Optimum way for running simulation in parallel

kk415 · March 8, 2021, 23:02

Hello All

I am running a simulation of multiphase flow in parallel and there is a huge difference between execution time and clock time. I want to know what could be the possible reason for it.

I am attaching an instance of my log data and my system info here.

Code:

PIMPLE: iteration 1
Selected 0 split points out of a possible 0.
Number of isoAdvector surface cells = 0
isoAdvection: Before conservative bounding: min(alpha) = 0, max(alpha) = 1 + -1
isoAdvection: After conservative bounding: min(alpha) = 0, max(alpha) = 1 + -1
isoAdvection: time consumption = 1%
Phase-1 volume fraction = 0  Min(alpha.water) = 0  1 - Max(alpha.water) = 1
solve the reinitialization equation
Interpolation routine for interface normal
Curvature Calculation
Creating isoSurface
Interpolating Curvature from iso-surface to cell centers
smoothSolver:  Solving for Ux, Initial residual = 0.000593322427, Final residual = 1.70711359e-09, No Iterations 3
smoothSolver:  Solving for Uy, Initial residual = 0.00260626766, Final residual = 6.52222249e-09, No Iterations 3
smoothSolver:  Solving for Uz, Initial residual = 0.000199399075, Final residual = 1.1910404e-09, No Iterations 3
GAMG:  Solving for p_rgh, Initial residual = 0.00639449018, Final residual = 3.29043924e-05, No Iterations 3
time step continuity errors : sum local = 3.53346848e-09, global = 4.19805384e-11, cumulative = 3.24236402e-08
GAMG:  Solving for p_rgh, Initial residual = 0.000340716515, Final residual = 3.28157441e-06, No Iterations 3
time step continuity errors : sum local = 3.52358875e-10, global = -6.40922139e-12, cumulative = 3.2417231e-08
GAMG:  Solving for p_rgh, Initial residual = 4.49729204e-05, Final residual = 7.00905977e-09, No Iterations 15
time step continuity errors : sum local = 7.51373052e-13, global = -1.42068391e-14, cumulative = 3.24172168e-08
smoothSolver:  Solving for k, Initial residual = 0.000333755776, Final residual = 6.7287304e-07, No Iterations 1
ExecutionTime = 24298.66 s  ClockTime = 121287 s

Code:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               1200.000
BogoMIPS:              4589.05
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39

I am running 10 simulation each using 4 processor. The grid size is approx 22K for a 2D case.

PenPencil · March 9, 2021, 06:00

Hi Krishna,

I sometimes get confused with the output of 'lscpu' as you used, so I hope I am not saying something wrong. I would suggest you try running these 10 simulations with 2 processors each and compare the cpu time with the clocktime. I say this because I suspect you have actually 20 physical cores, and the way you're running, you're using virtual cores, which OpenFOAM does not take advantage of.
Hope that helps.

dlahaye · March 9, 2021, 12:02

ClockTime might be higher due to writing/reading from file and due to communication between processors.

klausb · March 10, 2021, 10:56

- I understand, you have two nodes, each with two sockets, each socket with 10 physical cores?

- Make sure you switch off SMT/Hyper-Threading and use only physical cores!

- How fast is your link between the two nodes? (InfiniBand or something slower?)

- Maybe you use too many cores for your small test case and waste time on "unnecessary" communication (see discussion: MPIRun How many processors)

Tobermory · March 10, 2021, 13:32

Have you tried running top? Just type this from the command line and it will tell you how busy the processors are ... and to check on Domenico's suggestion.

For example, if all is working smoothly the processes for each run should be steaming away at 100%CPU ... if they are always far below 100% then there is probably some bottleneck in the communication or you are over loading the cores; if they are at 100% for a while then drop to something small before returning to 100% then there may be a disk writing bottleneck etc etc.

kk415 · April 11, 2021, 03:33

Hello All

Thank you for all the suggestions and I apologize for my late reply, I was so involved those days for ICLASS so missed the notification of reply on my email.

Now I am running another simulation of 1.125M cells and using 4 processor each. The problem is still pertaining and It seems the problem is of multithreading and CPU utilization. The top command gives me this info:

Code:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                         
16526 Rajesh    20   0 1822m 1.1g 7932 R 45.6  3.7 959:58.71 interFlowvAMR1                                                                                                                                                                  
15585 Rajesh    20   0 1830m 993m 7788 R 43.8  3.1 945:34.24 interFlowvAMR1                                                                                                                                                                  
15892 Rajesh    20   0 1797m 976m 8216 R 43.8  3.1 949:38.86 interFlowvAMR1                                                                                                                                                                  
15893 Rajesh    20   0 1809m 1.0g 7600 R 43.8  3.2 945:08.07 interFlowvAMR1                                                                                                                                                                  
16527 Rajesh    20   0 1813m 1.1g 8092 R 43.8  3.7 960:44.33 interFlowvAMR1                                                                                                                                                                  
16524 Rajesh    20   0 1824m 1.2g 8120 R 42.0  3.7 958:22.57 interFlowvAMR1                                                                                                                                                                  
15588 Rajesh    20   0 1826m 965m 7760 R 40.1  3.0 944:16.15 interFlowvAMR1                                                                                                                                                                  
15894 Rajesh    20   0 1813m 1.0g 7844 R 40.1  3.2 947:49.50 interFlowvAMR1                                                                                                                                                                  
16852 Rajesh    20   0 1815m 1.2g 7792 R 40.1  3.8 956:07.44 interFlowvAMR1                                                                                                                                                                  
15586 Rajesh    20   0 1821m 1.0g 7824 R 38.3  3.3 946:58.00 interFlowvAMR1                                                                                                                                                                  
16219 Rajesh    20   0 1808m 1.1g 8196 R 38.3  3.6 953:12.18 interFlowvAMR1                                                                                                                                                                  
16221 Rajesh    20   0 1826m 1.1g 8180 R 38.3  3.6 954:00.33 interFlowvAMR1                                                                                                                                                                  
15891 Rajesh    20   0 1817m 1.0g 8192 R 36.5  3.3 948:09.53 interFlowvAMR1                                                                                                                                                                  
16218 Rajesh    20   0 1830m 1.1g 8220 R 36.5  3.4 947:10.34 interFlowvAMR1                                                                                                                                                                  
16525 Rajesh    20   0 1795m 1.1g 8144 R 36.5  3.7 958:52.91 interFlowvAMR1                                                                                                                                                                  
15587 Rajesh    20   0 1824m 1.1g 7600 R 34.7  3.5 945:53.87 interFlowvAMR1                                                                                                                                                                  
16220 Rajesh    20   0 1830m 1.1g 8032 R 34.7  3.4 948:53.46 interFlowvAMR1                                                                                                                                                                  
16851 Rajesh    20   0 1862m 1.3g 7760 R 34.7  4.0 955:12.77 interFlowvAMR1                                                                                                                                                                  
16853 Rajesh    20   0 1830m 1.2g 7568 R 34.7  3.9 958:54.66 interFlowvAMR1                                                                                                                                                                  
16854 Rajesh    20   0 1831m 1.2g 7772 R 31.0  3.9 956:42.15 interFlowvAMR1

CPU utilization is only 40%, even though I am using only 20 cpus. So I check my multithreading(

Code:

grep -i 'ht' /proc/cpuinfo

) and got this info

Code:

flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm invpcid_single ssbd pti retpoline ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc flush_l1d
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm invpcid_single ssbd pti retpoline ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc flush_l1d
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm invpcid_single ssbd pti retpoline ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc flush_l1d
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm invpcid_single ssbd pti retpoline ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc flush_l1d
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm invpcid_single ssbd pti retpoline ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc flush_l1d

kk415 · April 11, 2021, 03:48

Is there any command to switch off multithreading in OpenFoam? It is using virtual cpus even if I try to use only physical cpus by limiting to 20 cpus.

March 9, 2021, 06:00		#2
PenPencil New Member Icaro Amorim de Carvalho Join Date: Dec 2020 Posts: 24 Rep Power: 5	Hi Krishna, I sometimes get confused with the output of 'lscpu' as you used, so I hope I am not saying something wrong. I would suggest you try running these 10 simulations with 2 processors each and compare the cpu time with the clocktime. I say this because I suspect you have actually 20 physical cores, and the way you're running, you're using virtual cores, which OpenFOAM does not take advantage of. Hope that helps. kk415 likes this.

March 9, 2021, 12:02		#3
dlahaye Senior Member Domenico Lahaye Join Date: Dec 2013 Posts: 761 Blog Entries: 1 Rep Power: 17	ClockTime might be higher due to writing/reading from file and due to communication between processors. kk415 likes this.

March 10, 2021, 10:56		#4
klausb Senior Member Klaus Join Date: Mar 2009 Posts: 280 Rep Power: 22	- I understand, you have two nodes, each with two sockets, each socket with 10 physical cores? - Make sure you switch off SMT/Hyper-Threading and use only physical cores! - How fast is your link between the two nodes? (InfiniBand or something slower?) - Maybe you use too many cores for your small test case and waste time on "unnecessary" communication (see discussion: MPIRun How many processors) kk415 likes this.

March 10, 2021, 13:32		#5
Tobermory Senior Member Join Date: Apr 2020 Location: UK Posts: 696 Rep Power: 14	Have you tried running top? Just type this from the command line and it will tell you how busy the processors are ... and to check on Domenico's suggestion. For example, if all is working smoothly the processes for each run should be steaming away at 100%CPU ... if they are always far below 100% then there is probably some bottleneck in the communication or you are over loading the cores; if they are at 100% for a while then drop to something small before returning to 100% then there may be a disk writing bottleneck etc etc. kk415 likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Error running openfoam in parallel	fede32	OpenFOAM Programming & Development	5	October 4, 2018 16:38
Bug extracting field in parallel simulation?	FlyBob91	OpenFOAM Post-Processing	0	September 20, 2017 10:17
Unsteady simulation solution files in parallel	gunnersnroses	SU2	1	December 15, 2015 13:28
Engine Simulation in parallel	Peter_600	OpenFOAM	0	July 26, 2012 06:02
help with command line in parallel simulation	ernest	STAR-CCM+	8	August 17, 2011 05:02

April 11, 2021, 03:48		#7
kk415 Senior Member krishna kant Join Date: Feb 2016 Location: Hyderabad, India Posts: 133 Rep Power: 10	Is there any command to switch off multithreading in OpenFoam? It is using virtual cpus even if I try to use only physical cpus by limiting to 20 cpus.