CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > ANSYS > CFX

how well does CFX parallel scale?

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   February 20, 2012, 02:26
Default how well does CFX parallel scale?
  #1
New Member
 
Daniel Petersen
Join Date: Feb 2012
Posts: 6
Rep Power: 14
danieru is on a distinguished road
Hi,

We're trying to run CFX parallel jobs on our cluster, and as near as I can tell, CFX seems to be scaling terribly.

As an example, our nodes each have 4 CPUs, 12 cores each: when running CFX with 4 processes on the node, each process is only utilizing 25% of the core it's running on. At 8 processes, it's only 8% per core, at 48 processes, it's reported down at 1 or 2% per process! This is with no other processes competing for CPU cycles.

Can anyone comment on their experience with how well the CFX parallel solver scales, both on a single node with multiple cores, or across nodes?
danieru is offline   Reply With Quote

Old   February 20, 2012, 05:03
Default
  #2
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,697
Rep Power: 143
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
Sorry, wrote the reply below assuming you are running distributed parallel, but on a second read of your question I think you are running local parallel. Oh well, I will leave the reply up anyway just in case it is relevant. I will tack the local parallel comments on the bottom.

It is not as simple as you seem to imply. When you are trying to pipe 12 cores worth of computations down a single network pipe you are going to have a major bottleneck unless you have a super-dooper network. What network are you running? If you are on ethernet then forget it, it will never work. You will need infiniband, myranet or one of the high-end end networks to get decent scaling out of this.

And it does not stop there - you need a motherboard with a fast pipe from CPU to the network adapter. I have seen differences of 4x between motherboards with the same CPU. A quality motherboard is essential. And then there is the network switch as well.

It is highly unlikely your poor scale up is due to CFX. CFX has run out to thousands of CPUs with good speedups, so something is wrong with your set up. And if you have got all the software setup correctly then it probably means you need to buy a really expensive network.

********

If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it.

The speed up you are reporting is pretty bad and something is wrong. CFX has good parallel performance so it is unlikely to simply be CFX.

Can you describe the simulation you are using to get these performance and speedup figures?
ghorrocks is offline   Reply With Quote

Old   February 20, 2012, 05:40
Default configuration issue is good news
  #3
New Member
 
Daniel Petersen
Join Date: Feb 2012
Posts: 6
Rep Power: 14
danieru is on a distinguished road
Hi ghorrocks,

Thanks for your reply, and from what you've said, it's probably a configuration issue and not a CFX scalability issue, which is good news because it means the performance can be improved

We would like to run distributed parallel, but we're first just trying to get CFX up and running smoothly with local parallel jobs.

Quick hardware rundown:

We have a new cluster with 318 nodes. Our interconnect is 40 GB/s Infiniband and each node has:
- 4 x AMD Opteron 6238 (Interlagos) 12 core 2.6 GHz processors. (48 cores total)
- 128GB RAM

...so I'm confident we have sufficient hardware specs to get decent performance, so now to the configuration...

The simulation is dead simple; just running the StaticMixer.def file that comes in the examples dir.

Quote:
If you are getting this performance on local parallel then check that you are running the correct parallel solver for your CPU. If your solver has not detected the CPU correctly it will run with the default solver which is really slow. Likewise the CPU setup, it should detect that and optimse the setup for it.
ah! those are some of the clues I'm looking for! Are there different executables for different architectures of CPU (Intel vs AMD)? We installed the linux x64 executables, and I'm only seeing one 'cfx5solve' executable in the 'CFX/bin' dir. How can I determine if the executable has correctly detected the AMD arch?

A lot of the software use we compile from source for our specific architecture, but in the ANSYS case, it it seems it installed pre-existing linux x64 executables.

I've looked through quite a bit of ANSYS documentation, but I'll keep digging. I am new to the software suite so if I've missed this information somewhere any pointers are appreciated!
danieru is offline   Reply With Quote

Old   February 20, 2012, 05:45
Default
  #4
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,697
Rep Power: 143
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
If you are just using the static mixer tutorial then it is not big enough to get a good scale up with 48 cores. I would remesh the static mixer to a far finer mesh, something which will have a solver time of a few minutes on 48 cores (so an hour or more when run serial).

I am at home at the moment so cannot direct you where to look for the executable. Check the documentation to see if it gives you a pointer. Also if your CPU or OS is really new it might not be recognised.
ghorrocks is offline   Reply With Quote

Old   February 20, 2012, 10:08
Default
  #5
New Member
 
Daniel Petersen
Join Date: Feb 2012
Posts: 6
Rep Power: 14
danieru is on a distinguished road
One of our users supplied us with a larger mesh that has ~= 7 million cells, so we'll do some tests with it and see what wall time is for serial vs some parallel runs...and while the tests are running, I'll continue digging through docs for more information about optimizing the runs for a specific arch.
danieru is offline   Reply With Quote

Old   February 20, 2012, 16:53
Default
  #6
Super Moderator
 
Glenn Horrocks
Join Date: Mar 2009
Location: Sydney, Australia
Posts: 17,697
Rep Power: 143
ghorrocks is just really niceghorrocks is just really niceghorrocks is just really niceghorrocks is just really nice
Also, what time are you comparing? You should compare solver wall times. Do not compare total simulation time as that contains the setup and packup stuff and that is not parallel, so will distort the results.
ghorrocks is offline   Reply With Quote

Old   February 27, 2012, 05:34
Default issue solved
  #7
New Member
 
Daniel Petersen
Join Date: Feb 2012
Posts: 6
Rep Power: 14
danieru is on a distinguished road
Hi,

Just wanted to report back that we identified and solved the issue. It wasn't an issue with CFX itself, but rather a configuration issue with our job scheduler, SLURM. SLURM was confining all the processes to one core, thus the drop of in CPU utilization per core.

Typically, SLURM likes to launch all the processes itself via srun, but since CFX spawns it's own processes when running in parallel, we needed to explicitly tell SLURM to allocate the number of resources (in this case the number of cores) that CFX will launch processes for, even though SLURM itself isn't launching the processes. Once we had this configured correctly we got good performance from CFX since it able to utilize a core per process!

So, thanks for helping narrow down possible contributers to the problem:-)
danieru is offline   Reply With Quote

Old   February 29, 2012, 07:53
Default
  #8
Senior Member
 
Join Date: Jul 2011
Location: Berlin, Germany
Posts: 173
Rep Power: 14
monkey1 is on a distinguished road
Hi danieru!
Just a little hint that we got from the CFX support concerning CFX scaling on multiprocessors. To have a good speed up ANSYS recommends something like 250.000 Cells per core. With less you will loose time due to the inter core communication and with a lot more each core will simply be "overloaded"
So for your case of about 7 mio cells You should use 28 cores to see a speed up.
monkey1 is offline   Reply With Quote

Old   February 29, 2012, 08:30
Default
  #9
New Member
 
Daniel Petersen
Join Date: Feb 2012
Posts: 6
Rep Power: 14
danieru is on a distinguished road
Quote:
Originally Posted by monkey1 View Post
Hi danieru!
Just a little hint that we got from the CFX support concerning CFX scaling on multiprocessors. To have a good speed up ANSYS recommends something like 250.000 Cells per core. With less you will loose time due to the inter core communication and with a lot more each core will simply be "overloaded"
So for your case of about 7 mio cells You should use 28 cores to see a speed up.
Hej monkey1,

That's great info! We'll pass that on to CFX users on our cluster. Thanks for taking a moment to post:-)
danieru is offline   Reply With Quote

Old   February 29, 2012, 09:17
Default
  #10
Far
Super Moderator
 
Sijal
Join Date: Mar 2009
Location: Islamabad
Posts: 4,553
Blog Entries: 6
Rep Power: 54
Far has a spectacular aura aboutFar has a spectacular aura about
Send a message via Skype™ to Far
good info. Thanks
Far is offline   Reply With Quote

Old   February 29, 2012, 09:20
Default
  #11
Senior Member
 
Lance
Join Date: Mar 2009
Posts: 669
Rep Power: 22
Lance is on a distinguished road
We've seen almost linear scaling up to 256 cores with 50 000 hexas per core, so I would say that the optimum number of cells/core is both problem and cluster dependent.

The CFX manual talks about a minimum of 30000 nodes/partition for tetrahedrals and 75000 nodes/partition for hexahedrals but actual numbers could be both lower or higher.
Lance is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Core usage on CFX parallel processing alterego CFX 6 December 21, 2011 05:45
scale in cfx jai CFX 4 November 13, 2008 08:04
PhD using CFX Rui CFX 9 May 28, 2007 05:59
FEDORA CORE and PARALLEL processing Tuks CFX 2 August 20, 2005 11:05
CFX 4.4 installation problem Pandu Sattvika CFX 1 December 1, 2001 04:07


All times are GMT -4. The time now is 08:35.