Diagnosing runtime slow-down across clusters

jmt · October 23, 2019, 09:23

Good morning,

I have been running CONVERGE on a local cluster and recently moved over to my university's cluster and noticed a large slow-down for similar simulation sizes. I know that it can be difficult to pinpoint the problem, but I'm hoping to get some advice on what to try.

I compared running a gas-turbine non-reacting case with one node, 36 processors on the local and university clusters, and the wallclock per time-step is an order of magnitude slower on the university cluster (~13 sec per time-step versus ~120 sec per time-step).

Local cluster:

* CONVERGE 2.4.27 MPICH, Linux64
* Jobs submitted directly via `mpirun`
* AMD Opteron(tm) Processor 6380
* 2500 MHz, 2048 KB cache size

University Cluster
* CONVERGE 2.4.28, OpenMPI, Linux64
* Jobs managed by SLURM
* Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Please let me know if I can provide more information. Thanks.

azandian · October 28, 2019, 09:08

Quote:

Originally Posted by jmt

Good morning,

I have been running CONVERGE on a local cluster and recently moved over to my university's cluster and noticed a large slow-down for similar simulation sizes. I know that it can be difficult to pinpoint the problem, but I'm hoping to get some advice on what to try.

I compared running a gas-turbine non-reacting case with one node, 36 processors on the local and university clusters, and the wallclock per time-step is an order of magnitude slower on the university cluster (~13 sec per time-step versus ~120 sec per time-step).

Local cluster:

* CONVERGE 2.4.27 MPICH, Linux64
* Jobs submitted directly via `mpirun`
* AMD Opteron(tm) Processor 6380
* 2500 MHz, 2048 KB cache size

University Cluster
* CONVERGE 2.4.28, OpenMPI, Linux64
* Jobs managed by SLURM
* Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Please let me know if I can provide more information. Thanks.

Julian,
As you said a direct comparison is not possible at this point because you are running different versions with different executables on different machines, but here are some suggestions:

- It would be best if you use the entire node potential, i.e. all the processors on a node, for optimum speedup.

- Try running your case interactively by SSHing into the university node and running from there using mpirun. Remember to load all the required modules before running.

- Check out the logfile (last few lines) to see the details of your simulation run time. A cluster with faster processors would give a shorter runtime for solving transport equations; however, the runtime is also affected by the bandwidth and capabilities of your inter-node connection. Using Interconnect would give a better speedup compared to Ethernet for processor communications.

- Check out your load balance either from logfile or from metis_map.out to make sure that all the processors have some cells to work with. If you have a rank with too many cells and another with 0 cells, you are not taking advantage of the full computation power of the node. So, either change your parallel_scale (make it -1 in inputs.in) or fix your base_grid and embed_scales to make sure the load is distributed properly among the processors.

- Finally, specifying the number of processors-per-node would generally result in a better outcome. You may specify the number of processors-per-node in your job submission script - if you are using SLURM - or you can add it as a flag to you mpirun (-ppn <No. processors per node>) to make sure that the node uses all its processors for the computation.

Please contact CONVERGE support team at support@convergecfd.com if you have further questions or concerns.

jmt · October 31, 2019, 06:02

Thank you very much. I will report back if I have success!

jmt · November 20, 2019, 08:47

Arash,

Thanks to your help, I solved my problem. This may be dependent to my HPC setup, but I changed my SLURM submission script as below to obtain better performance.

Now, I am running simulations of similar size at about four times faster on the uni cluster. I replaced the two commands in this original script with just the one in modified script for 108 procs.

Code:

...
--ntasks=3
--cpus-per-task=36
...

Modified:

Code:

...
--ntasks=108
...

EDE16 · November 20, 2019, 09:05

Quote:

Originally Posted by jmt

Arash,

Thanks to your help, I solved my problem. This may be dependent to my HPC setup, but I changed my SLURM submission script as below to obtain better performance.

Now, I am running simulations of similar size at about four times faster on the uni cluster. I replaced the two commands in this original script with just the one in modified script for 108 procs.

Code:

...
--ntasks=3
--cpus-per-task=36
...

Modified:

Code:

...
--ntasks=108
...

Hi, this interests me.

Can you tell me if your were running 108 procs with the original script or if you only ran 108 consequence of running the new script and thats why you seen the speed increase?

I have experienced numerous issues related to cluster running and MPI used. Once changing from Open MPI to PMPI from IBM, it worked much better. But keen to know if this can be improved on yet again.

Not sure if it matters, I use Sun Grid Engine (SGE) rather than Slurm, if anyone has experince of both and can recommend Slurm over SGE for any reason?

Thanks

jmt · November 20, 2019, 09:23

Quote:

Originally Posted by EDE16

Hi, this interests me.

Can you tell me if your were running 108 procs with the original script or if you only ran 108 consequence of running the new script and thats why you seen the speed increase?

I have experienced numerous issues related to cluster running and MPI used. Once changing from Open MPI to PMPI from IBM, it worked much better. But keen to know if this can be improved on yet again.

Not sure if it matters, I use Sun Grid Engine (SGE) rather than Slurm, if anyone has experince of both and can recommend Slurm over SGE for any reason?

Thanks

I was only running on three procs in the original script. In communication with the cluster support team, they suggested that the original commands would result in a simulation using $\text{ntasks} \times \text{cpus-per-task}$ processors, but that was not the case. This was all with OpenMPI.

October 23, 2019, 09:23	Diagnosing runtime slow-down across clusters	#1
jmt Member Julian Join Date: Sep 2019 Posts: 32 Rep Power: 6	Good morning, I have been running CONVERGE on a local cluster and recently moved over to my university's cluster and noticed a large slow-down for similar simulation sizes. I know that it can be difficult to pinpoint the problem, but I'm hoping to get some advice on what to try. I compared running a gas-turbine non-reacting case with one node, 36 processors on the local and university clusters, and the wallclock per time-step is an order of magnitude slower on the university cluster (~13 sec per time-step versus ~120 sec per time-step). Local cluster: * CONVERGE 2.4.27 MPICH, Linux64 * Jobs submitted directly via `mpirun` * AMD Opteron(tm) Processor 6380 * 2500 MHz, 2048 KB cache size University Cluster * CONVERGE 2.4.28, OpenMPI, Linux64 * Jobs managed by SLURM * Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz Please let me know if I can provide more information. Thanks.

November 20, 2019, 08:47		#4
jmt Member Julian Join Date: Sep 2019 Posts: 32 Rep Power: 6	Arash, Thanks to your help, I solved my problem. This may be dependent to my HPC setup, but I changed my SLURM submission script as below to obtain better performance. Now, I am running simulations of similar size at about four times faster on the uni cluster. I replaced the two commands in this original script with just the one in modified script for 108 procs. Code: ... --ntasks=3 --cpus-per-task=36 ... Modified: Code: ... --ntasks=108 ...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fluent running increadibly slow	MayTheFlowBeWithYou	FLUENT	12	May 22, 2018 02:04
Problem in3D model processing	mebinitap	OpenFOAM	2	December 12, 2014 04:40
Cluster ID's not contiguous in compute-nodes domain. ???	Shogan	FLUENT	1	May 28, 2014 15:03
Calculation to slow	A.A.	OpenFOAM Running, Solving & CFD	2	January 10, 2013 04:44
runTime out of scope in functionObject	Sune	OpenFOAM Programming & Development	2	September 26, 2012 02:11

October 31, 2019, 06:02		#3
jmt Member Julian Join Date: Sep 2019 Posts: 32 Rep Power: 6	Thank you very much. I will report back if I have success!