CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > CONVERGE

Diagnosing runtime slow-down across clusters

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By jmt

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 23, 2019, 10:23
Default Diagnosing runtime slow-down across clusters
  #1
jmt
New Member
 
Julian
Join Date: Sep 2019
Posts: 5
Rep Power: 2
jmt is on a distinguished road
Good morning,

I have been running CONVERGE on a local cluster and recently moved over to my university's cluster and noticed a large slow-down for similar simulation sizes. I know that it can be difficult to pinpoint the problem, but I'm hoping to get some advice on what to try.

I compared running a gas-turbine non-reacting case with one node, 36 processors on the local and university clusters, and the wallclock per time-step is an order of magnitude slower on the university cluster (~13 sec per time-step versus ~120 sec per time-step).

Local cluster:

* CONVERGE 2.4.27 MPICH, Linux64
* Jobs submitted directly via `mpirun`
* AMD Opteron(tm) Processor 6380
* 2500 MHz, 2048 KB cache size

University Cluster
* CONVERGE 2.4.28, OpenMPI, Linux64
* Jobs managed by SLURM
* Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz



Please let me know if I can provide more information. Thanks.
jmt is offline   Reply With Quote

Old   October 28, 2019, 10:08
Default
  #2
New Member
 
azandian's Avatar
 
Arash Zandian
Join Date: Oct 2019
Location: Convergent Science, Madison WI
Posts: 5
Rep Power: 2
azandian is on a distinguished road
Quote:
Originally Posted by jmt View Post
Good morning,

I have been running CONVERGE on a local cluster and recently moved over to my university's cluster and noticed a large slow-down for similar simulation sizes. I know that it can be difficult to pinpoint the problem, but I'm hoping to get some advice on what to try.

I compared running a gas-turbine non-reacting case with one node, 36 processors on the local and university clusters, and the wallclock per time-step is an order of magnitude slower on the university cluster (~13 sec per time-step versus ~120 sec per time-step).

Local cluster:

* CONVERGE 2.4.27 MPICH, Linux64
* Jobs submitted directly via `mpirun`
* AMD Opteron(tm) Processor 6380
* 2500 MHz, 2048 KB cache size

University Cluster
* CONVERGE 2.4.28, OpenMPI, Linux64
* Jobs managed by SLURM
* Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz



Please let me know if I can provide more information. Thanks.

Julian,
As you said a direct comparison is not possible at this point because you are running different versions with different executables on different machines, but here are some suggestions:

- It would be best if you use the entire node potential, i.e. all the processors on a node, for optimum speedup.

- Try running your case interactively by SSHing into the university node and running from there using mpirun. Remember to load all the required modules before running.

- Check out the logfile (last few lines) to see the details of your simulation run time. A cluster with faster processors would give a shorter runtime for solving transport equations; however, the runtime is also affected by the bandwidth and capabilities of your inter-node connection. Using Interconnect would give a better speedup compared to Ethernet for processor communications.

- Check out your load balance either from logfile or from metis_map.out to make sure that all the processors have some cells to work with. If you have a rank with too many cells and another with 0 cells, you are not taking advantage of the full computation power of the node. So, either change your parallel_scale (make it -1 in inputs.in) or fix your base_grid and embed_scales to make sure the load is distributed properly among the processors.

- Finally, specifying the number of processors-per-node would generally result in a better outcome. You may specify the number of processors-per-node in your job submission script - if you are using SLURM - or you can add it as a flag to you mpirun (-ppn <No. processors per node>) to make sure that the node uses all its processors for the computation.

Please contact CONVERGE support team at support@convergecfd.com if you have further questions or concerns.
__________________
Arash Zandian,
Research engineer, Applications

(608) 230-1580
convergecfd.com
azandian is offline   Reply With Quote

Old   October 31, 2019, 07:02
Default
  #3
jmt
New Member
 
Julian
Join Date: Sep 2019
Posts: 5
Rep Power: 2
jmt is on a distinguished road
Thank you very much. I will report back if I have success!
jmt is offline   Reply With Quote

Old   November 20, 2019, 09:47
Default
  #4
jmt
New Member
 
Julian
Join Date: Sep 2019
Posts: 5
Rep Power: 2
jmt is on a distinguished road
Arash,

Thanks to your help, I solved my problem. This may be dependent to my HPC setup, but I changed my SLURM submission script as below to obtain better performance.

Now, I am running simulations of similar size at about four times faster on the uni cluster. I replaced the two commands in this original script with just the one in modified script for 108 procs.
Code:
...
--ntasks=3
--cpus-per-task=36
...
Modified:
Code:
...
--ntasks=108
...
jmt is offline   Reply With Quote

Old   November 20, 2019, 10:05
Default
  #5
Member
 
Join Date: Jun 2016
Posts: 42
Rep Power: 5
EDE16 is on a distinguished road
Quote:
Originally Posted by jmt View Post
Arash,

Thanks to your help, I solved my problem. This may be dependent to my HPC setup, but I changed my SLURM submission script as below to obtain better performance.

Now, I am running simulations of similar size at about four times faster on the uni cluster. I replaced the two commands in this original script with just the one in modified script for 108 procs.
Code:
...
--ntasks=3
--cpus-per-task=36
...
Modified:
Code:
...
--ntasks=108
...

Hi, this interests me.

Can you tell me if your were running 108 procs with the original script or if you only ran 108 consequence of running the new script and thats why you seen the speed increase?

I have experienced numerous issues related to cluster running and MPI used. Once changing from Open MPI to PMPI from IBM, it worked much better. But keen to know if this can be improved on yet again.

Not sure if it matters, I use Sun Grid Engine (SGE) rather than Slurm, if anyone has experince of both and can recommend Slurm over SGE for any reason?

Thanks
EDE16 is offline   Reply With Quote

Old   November 20, 2019, 10:23
Default
  #6
jmt
New Member
 
Julian
Join Date: Sep 2019
Posts: 5
Rep Power: 2
jmt is on a distinguished road
Quote:
Originally Posted by EDE16 View Post
Hi, this interests me.

Can you tell me if your were running 108 procs with the original script or if you only ran 108 consequence of running the new script and thats why you seen the speed increase?

I have experienced numerous issues related to cluster running and MPI used. Once changing from Open MPI to PMPI from IBM, it worked much better. But keen to know if this can be improved on yet again.

Not sure if it matters, I use Sun Grid Engine (SGE) rather than Slurm, if anyone has experince of both and can recommend Slurm over SGE for any reason?

Thanks

I was only running on three procs in the original script. In communication with the cluster support team, they suggested that the original commands would result in a simulation using \text{ntasks} \times \text{cpus-per-task} processors, but that was not the case. This was all with OpenMPI.
EDE16 likes this.
jmt is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fluent running increadibly slow MayTheFlowBeWithYou FLUENT 12 May 22, 2018 03:04
Problem in3D model processing mebinitap OpenFOAM 2 December 12, 2014 05:40
Cluster ID's not contiguous in compute-nodes domain. ??? Shogan FLUENT 1 May 28, 2014 16:03
Calculation to slow A.A. OpenFOAM Running, Solving & CFD 2 January 10, 2013 05:44
runTime out of scope in functionObject Sune OpenFOAM Programming & Development 2 September 26, 2012 03:11


All times are GMT -4. The time now is 20:49.