CFD Online Discussion Forums - how well does CFX parallel scale?

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- CFX (https://www.cfd-online.com/Forums/cfx/)

- - how well does CFX parallel scale? (https://www.cfd-online.com/Forums/cfx/97537-how-well-does-cfx-parallel-scale.html)

how well does CFX parallel scale?

Hi,

We're trying to run CFX parallel jobs on our cluster, and as near as I can tell, CFX seems to be scaling terribly.

As an example, our nodes each have 4 CPUs, 12 cores each: when running CFX with 4 processes on the node, each process is only utilizing 25% of the core it's running on. At 8 processes, it's only 8% per core, at 48 processes, it's reported down at 1 or 2% per process! This is with no other processes competing for CPU cycles.

Can anyone comment on their experience with how well the CFX parallel solver scales, both on a single node with multiple cores, or across nodes?

Sorry, wrote the reply below assuming you are running distributed parallel, but on a second read of your question I think you are running local parallel. Oh well, I will leave the reply up anyway just in case it is relevant. I will tack the local parallel comments on the bottom.

It is not as simple as you seem to imply. When you are trying to pipe 12 cores worth of computations down a single network pipe you are going to have a major bottleneck unless you have a super-dooper network. What network are you running? If you are on ethernet then forget it, it will never work. You will need infiniband, myranet or one of the high-end end networks to get decent scaling out of this.

And it does not stop there - you need a motherboard with a fast pipe from CPU to the network adapter. I have seen differences of 4x between motherboards with the same CPU. A quality motherboard is essential. And then there is the network switch as well.

It is highly unlikely your poor scale up is due to CFX. CFX has run out to thousands of CPUs with good speedups, so something is wrong with your set up. And if you have got all the software setup correctly then it probably means you need to buy a really expensive network.

********

If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it.

The speed up you are reporting is pretty bad and something is wrong. CFX has good parallel performance so it is unlikely to simply be CFX.

Can you describe the simulation you are using to get these performance and speedup figures?

configuration issue is good news

Hi ghorrocks,

Thanks for your reply, and from what you've said, it's probably a configuration issue and not a CFX scalability issue, which is good news because it means the performance can be improved:)

We would like to run distributed parallel, but we're first just trying to get CFX up and running smoothly with local parallel jobs.

Quick hardware rundown:

We have a new cluster with 318 nodes. Our interconnect is 40 GB/s Infiniband and each node has:
- 4 x AMD Opteron 6238 (Interlagos) 12 core 2.6 GHz processors. (48 cores total)
- 128GB RAM

...so I'm confident we have sufficient hardware specs to get decent performance, so now to the configuration...

The simulation is dead simple; just running the StaticMixer.def file that comes in the examples dir.

Quote:

If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it.

ah! those are some of the clues I'm looking for! Are there different executables for different architectures of CPU (Intel vs AMD)? We installed the linux x64 executables, and I'm only seeing one 'cfx5solve' executable in the 'CFX/bin' dir. How can I determine if the executable has correctly detected the AMD arch?

A lot of the software use we compile from source for our specific architecture, but in the ANSYS case, it it seems it installed pre-existing linux x64 executables.

I've looked through quite a bit of ANSYS documentation, but I'll keep digging. I am new to the software suite so if I've missed this information somewhere any pointers are appreciated!

If you are just using the static mixer tutorial then it is not big enough to get a good scale up with 48 cores. I would remesh the static mixer to a far finer mesh, something which will have a solver time of a few minutes on 48 cores (so an hour or more when run serial).

I am at home at the moment so cannot direct you where to look for the executable. Check the documentation to see if it gives you a pointer. Also if your CPU or OS is really new it might not be recognised.

One of our users supplied us with a larger mesh that has ~= 7 million cells, so we'll do some tests with it and see what wall time is for serial vs some parallel runs...and while the tests are running, I'll continue digging through docs for more information about optimizing the runs for a specific arch.

Also, what time are you comparing? You should compare solver wall times. Do not compare total simulation time as that contains the setup and packup stuff and that is not parallel, so will distort the results.

Hi,

Just wanted to report back that we identified and solved the issue. It wasn't an issue with CFX itself, but rather a configuration issue with our job scheduler, SLURM. SLURM was confining all the processes to one core, thus the drop of in CPU utilization per core.

Typically, SLURM likes to launch all the processes itself via srun, but since CFX spawns it's own processes when running in parallel, we needed to explicitly tell SLURM to allocate the number of resources (in this case the number of cores) that CFX will launch processes for, even though SLURM itself isn't launching the processes. Once we had this configured correctly we got good performance from CFX since it able to utilize a core per process!

So, thanks for helping narrow down possible contributers to the problem:-)

Hi danieru!
Just a little hint that we got from the CFX support concerning CFX scaling on multiprocessors. To have a good speed up ANSYS recommends something like 250.000 Cells per core. With less you will loose time due to the inter core communication and with a lot more each core will simply be "overloaded" ;)
So for your case of about 7 mio cells You should use 28 cores to see a speed up.

Quote:

Originally Posted by monkey1 (Post 346887)

Hej monkey1,

That's great info! We'll pass that on to CFX users on our cluster. Thanks for taking a moment to post:-)

We've seen almost linear scaling up to 256 cores with 50 000 hexas per core, so I would say that the optimum number of cells/core is both problem and cluster dependent.

The CFX manual talks about a minimum of 30000 nodes/partition for tetrahedrals and 75000 nodes/partition for hexahedrals but actual numbers could be both lower or higher.