CFX performance scaling on multicore local server

saggarw2 · July 18, 2016, 16:35

Hi,

I have been running a frozen rotor problem on a local server with 19 cores running parallel using Intel local parallel MPI. My company recently purchased a new server to be able to run 32 cores and use our hpc licenses to full. When i ran the same simulation on new server with 32 cores it runs slower but when i run it with same 19 cores it run a little bit faster than old server. Any settings I am missing? My simulation is big enough to have 14 million+ elements, so in my understanding it should not have any multi-threading issues. Anyone can give me some insight/guidance on this issue?

I really appreciate the help. Thank you,

Sachin Aggarwal

ghorrocks · July 18, 2016, 20:31

Effective multi-processor simulations require good interconnects, memory busses and much more.

If you have 32 cores in a single machine it will need to be carefully designed for multi-threaded operation or it will not give you the performance you expect.

Also check you have not crippled the machine. Check the:
* BIOS is current
* motherboard, hard drive, ethernet and other drivers are correct and current
* firmware is current in the hard drive and other gizmos
* You have not run out of memory
* You are not sharing the machine with other users
* Your antivirus or other background process is not causing problems.

saggarw2 · July 19, 2016, 14:32

Hi Glenn,

Thank you for your reply.
We completed the installation of this machine last week only, so hardware and software parts are good and configured by Dell technician himself. The server has 192 GB of RAM and only 12-15 GB is being used while simulation is running. I do have virtualization "on" in the server I cannot say if that can affect the solution time or not. I am not quite sure about antivirus, but I will check with IT. The server has 44 cores in FC630 blade configuration installed in a FX2 blade chassis.

I hope this helps.

Other than this I ran a stage up study on my set-up with increasing number of cores by 4 and found that 28 cores is most time efficient rather than 32. Any thoughts on that?

Thank you very much for your help.

Regards,

Sachin Aggarwal

ghorrocks · July 19, 2016, 19:45

If 28 cores is running better than 32 it suggests there is a bottleneck in your system which is preventing it running efficiently to larger number of cores.

Do not assume that because it was installed by a technician and it is the latest stuff that makes it suitable for large multiprocessor simulation. Most large multi processor systems are design for servers and web servers and they have very different demands compared to multi processor simulations.

Also - make sure your simulation is suitable for lots of partitions. How many nodes per core? What physics are you using? What physics are you modelling.

Here are some examples of things which have caught me out in the past on multiprocessor simulations:
1) A workstation straight from the vendor (Dell) ran at half the speed I expected based on spec.org results. I found the BIOS did not support the CPU and when I upgraded the BIOS to the latest BIOS it supported the CPU and double speed to the expected value.
2) A high-end workstation straight from the vendor ran a different simulation software at a fraction of the speed expected. It turned out the motherboard was unsuitable for multi-processor operation as the FSB was not fast enough for the memory throughput. This was despite having the best CPU and lots of memory. We had to downgrade the machine to a CAD workstation and buy more suitable machines where I checked the technical details of the workstation carefully.
3) How is the CPU to memory and CPU to CPU interconnect done on this machine?

-Maxim- · July 20, 2016, 02:31

Quote:

Originally Posted by ghorrocks

3) How is the CPU to memory and CPU to CPU interconnect done on this machine?

This is a key point. I don't know much about HPC hardware but as far as my understanding goes, any bottleneck can slow the whole thing down. So in case your upgrade went like "I already have those 19 cores (why is this such an uneven number?) of this older CPU/RAM generation and we just add some more of the newer generation" isn't really helpful.

My CFD hardware guy showed my bechmarks of multi-core CPUs where apparently two of the 4-core Xeons on one mainboard (with same amount of RAM each) are faster than a one 8-core Xeon with the same amount of RAM. So is your 32-core cluster a 4 computer with each 2*4-core Xeons on one mainboard plus InfiniBand for the connection?

You question might also be suitable for the hardware section of this forum - this is where the hardware guys are hiding

saggarw2 · July 21, 2016, 16:14

Quote:

Originally Posted by ghorrocks

If 28 cores is running better than 32 it suggests there is a bottleneck in your system which is preventing it running efficiently to larger number of cores.

Do not assume that because it was installed by a technician and it is the latest stuff that makes it suitable for large multiprocessor simulation. Most large multi processor systems are design for servers and web servers and they have very different demands compared to multi processor simulations.

Also - make sure your simulation is suitable for lots of partitions. How many nodes per core? What physics are you using? What physics are you modelling.

Here are some examples of things which have caught me out in the past on multiprocessor simulations:
1) A workstation straight from the vendor (Dell) ran at half the speed I expected based on spec.org results. I found the BIOS did not support the CPU and when I upgraded the BIOS to the latest BIOS it supported the CPU and double speed to the expected value.
2) A high-end workstation straight from the vendor ran a different simulation software at a fraction of the speed expected. It turned out the motherboard was unsuitable for multi-processor operation as the FSB was not fast enough for the memory throughput. This was despite having the best CPU and lots of memory. We had to downgrade the machine to a CAD workstation and buy more suitable machines where I checked the technical details of the workstation carefully.
3) How is the CPU to memory and CPU to CPU interconnect done on this machine?

Hi Glenn,

Thank you for your reply.

I am working with my IT department to figure the answer to your questions out. They told me that BIOS is the updated one. They will try to look into FSB and motherboard and also about the interconnect. When i will get the answer I will let you know.

About the problem itself, I am simulating a high speed wind turbine with a frozen rotor interface. The model has 14+ Million elements and 8+ million nodes. As far as i know the thumb rule is 50-100K nodes/core. This makes me believe that I should be able to use 80 cores without any loss of performance. I am using rotational periodicity to decrease the problem size to half and default partitioner Metis for partitioning the model.

Thank You,

Sachin Aggarwal

saggarw2 · July 21, 2016, 16:18

Quote:

Originally Posted by -Maxim-

This is a key point. I don't know much about HPC hardware but as far as my understanding goes, any bottleneck can slow the whole thing down. So in case your upgrade went like "I already have those 19 cores (why is this such an uneven number?) of this older CPU/RAM generation and we just add some more of the newer generation" isn't really helpful.

My CFD hardware guy showed my bechmarks of multi-core CPUs where apparently two of the 4-core Xeons on one mainboard (with same amount of RAM each) are faster than a one 8-core Xeon with the same amount of RAM. So is your 32-core cluster a 4 computer with each 2*4-core Xeons on one mainboard plus InfiniBand for the connection?

You question might also be suitable for the hardware section of this forum - this is where the hardware guys are hiding

Hi Maxim,

We did not add new CPU to old CPU but replaced it with new. The old server was a 20 core machine and i was using 19 out of 20 for simulations as 20 was clogging it down. The new machine has 44 cores in total and my intention was to use 32 cores out of them. I hope this clear things up.

Thank You,

Sachin Aggarwal

ghorrocks · July 21, 2016, 18:19

You do not appear to be modelling any physics which cause multi processor issues.

Can you show a graph of simulation speed versus number of cores? Also, how does your simulation speed compare to the spec.org result for your machine?

July 18, 2016, 16:35	CFX performance scaling on multicore local server	#1
saggarw2 New Member Sachin Aggarwal Join Date: Aug 2014 Posts: 4 Rep Power: 11	Hi, I have been running a frozen rotor problem on a local server with 19 cores running parallel using Intel local parallel MPI. My company recently purchased a new server to be able to run 32 cores and use our hpc licenses to full. When i ran the same simulation on new server with 32 cores it runs slower but when i run it with same 19 cores it run a little bit faster than old server. Any settings I am missing? My simulation is big enough to have 14 million+ elements, so in my understanding it should not have any multi-threading issues. Anyone can give me some insight/guidance on this issue? I really appreciate the help. Thank you, Sachin Aggarwal

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
time step continuity problem in VAWT simulation	lpz_michele	OpenFOAM Running, Solving & CFD	5	February 22, 2018 19:50
Help for the small implementation in turbulence model	shipman	OpenFOAM Programming & Development	25	March 19, 2014 10:08
CFX local parallel on windows XP	frank	CFX	12	April 24, 2008 07:26
ANSYS CFX 10.0 Parallel Performance for Windows XP	Saturn	CFX	4	August 13, 2006 12:27
Does CFX support LES, local dynamic mdoel	JJ	CFX	0	August 28, 2003 21:15

July 18, 2016, 20:31		#2
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,703 Rep Power: 143	Effective multi-processor simulations require good interconnects, memory busses and much more. If you have 32 cores in a single machine it will need to be carefully designed for multi-threaded operation or it will not give you the performance you expect. Also check you have not crippled the machine. Check the: * BIOS is current * motherboard, hard drive, ethernet and other drivers are correct and current * firmware is current in the hard drive and other gizmos * You have not run out of memory * You are not sharing the machine with other users * Your antivirus or other background process is not causing problems.

July 19, 2016, 14:32		#3
saggarw2 New Member Sachin Aggarwal Join Date: Aug 2014 Posts: 4 Rep Power: 11	Hi Glenn, Thank you for your reply. We completed the installation of this machine last week only, so hardware and software parts are good and configured by Dell technician himself. The server has 192 GB of RAM and only 12-15 GB is being used while simulation is running. I do have virtualization "on" in the server I cannot say if that can affect the solution time or not. I am not quite sure about antivirus, but I will check with IT. The server has 44 cores in FC630 blade configuration installed in a FX2 blade chassis. I hope this helps. Other than this I ran a stage up study on my set-up with increasing number of cores by 4 and found that 28 cores is most time efficient rather than 32. Any thoughts on that? Thank you very much for your help. Regards, Sachin Aggarwal

July 19, 2016, 19:45		#4
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,703 Rep Power: 143	If 28 cores is running better than 32 it suggests there is a bottleneck in your system which is preventing it running efficiently to larger number of cores. Do not assume that because it was installed by a technician and it is the latest stuff that makes it suitable for large multiprocessor simulation. Most large multi processor systems are design for servers and web servers and they have very different demands compared to multi processor simulations. Also - make sure your simulation is suitable for lots of partitions. How many nodes per core? What physics are you using? What physics are you modelling. Here are some examples of things which have caught me out in the past on multiprocessor simulations: 1) A workstation straight from the vendor (Dell) ran at half the speed I expected based on spec.org results. I found the BIOS did not support the CPU and when I upgraded the BIOS to the latest BIOS it supported the CPU and double speed to the expected value. 2) A high-end workstation straight from the vendor ran a different simulation software at a fraction of the speed expected. It turned out the motherboard was unsuitable for multi-processor operation as the FSB was not fast enough for the memory throughput. This was despite having the best CPU and lots of memory. We had to downgrade the machine to a CAD workstation and buy more suitable machines where I checked the technical details of the workstation carefully. 3) How is the CPU to memory and CPU to CPU interconnect done on this machine?

July 21, 2016, 18:19		#8
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,703 Rep Power: 143	You do not appear to be modelling any physics which cause multi processor issues. Can you show a graph of simulation speed versus number of cores? Also, how does your simulation speed compare to the spec.org result for your machine?