Scaling of parallel computation? Solver/thread count combinations?

tdof · February 2, 2017, 11:01

Hi,

I'm currently looking into the scaling of OpenFOAM 4.0 and OF Extend 3.1 while running cases in parallel on my local machine (i7 6800k 6C/12T @ 4GHz, 32GB DDR4 2666MHz quadchannel, Windows 7 64bit Ultimate, Linux VMs with OF running in Virtualbox) using 4, 8 and 12 threads respectively. I've searched a bit about parallel scaling, but I've noticed a strange behaviour. Well, at least for me it is strange. After reading this

https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje

and this PDF

http://www.dtic.mil/get-tr-doc/pdf?AD=ADA612337

I was quite confident that I'd get a nice approximately linear speedup on my little processor, but that wasn't the case at all. I've started using a Hagen-Poiseuille laminar pipe flow with about 144k elements and pisoFoam. Using 12 threads resulted the slowest simulation speed, 8 threads were a little faster and 4 threads were somewhere in the middle. I figured that the case was too small to profit from 12 domains and tested a lid-driven cavity flow with Re = 1000, pisoFoam again and 1.0E6 cells, so roughly 83.3E3 cells per thread. Interestingly, using 12 threads was the slowest method, 8 threads were fastest and 4 threads were somewhere in the middle. In OF Extend, 4 threads were actually the fastest. I've read the following here in the forum:

Quote:

The multigrid solvers (GAMG) are quick (in terms of walltime), but does not at all scale well in parallel. They require around 100k cells/process for the parallel efficiency to be acceptable.
The conjugate gradient solvers (PCG) are slow in terms of walltime, but scale extremely well in parallel. As low as 10k cells/process can be effective.

I've tried GAMG as well as PCG/PBiCG for pressure and velocity as well as mixtures of both. The diagrams from the Vilje cluster show even superlinear speedup when using up to 100 parallel processes, why am I not getting at least an approximately linear speedup using only 12 threads? I've tested simple and scotch decomposition methods, renumberMesh, but no differences here. Could the virtual machines be the reason or am I missing something? The scalability is apparently dependent on the solver as well, but I can't imagine the overhead being that bad on such a small level of parallelization.

Cavity 1m cells with GAMG/GAMG solving for p/U:
12 threads: 726s walltime
8 threads: 576s
4 threads: 691s

Cavity 1m cells with GAMG/GAMG solving for p/U, OF Extend:
12 threads: 1044s walltime
8 threads: 613s
4 threads: 592s

Approximately the same bad scaling for the laminar pipe flow case. What is the cause? I'd appreciate any help

Oh I forgot, I use openMPI and start the cases using "mpirun -np num_of_threads foamJob pisoFoam -parallel", that should be correct.

khedar · February 2, 2017, 12:03

1. Can you share walltime for 1 thread?
2. May be because of Virtual Machines?
3. May be the cache size of your processor not as much as of the one quoted in study..

tdof · February 3, 2017, 03:48

1. 1681s
2. Probably, I'll try to run some benchmarks on a native Linux machine
3. Cache size per core is actually the same, 20MB for 8 cores on the Xeon E5 2687W and 15MB for 6 cores on the i7 6800k

tomf · February 3, 2017, 09:03

Hi,

Since you only have 6 cores means you can not expect any improvement from using more than 6 processors (read the section on hyperthreading from the pdf). The virtualisation may also hurt a bit. I would advise running 1, 2, 4 and 6 cores. For large enough (100k Cells) cases I would expect the 6 cores to be fastest, however you have 4 memory channels, this may also mean that after 4 cpus you will have less than linear scaling since 6 cores are trying to reach the memory over only 4 channels.

Regards,
Tom

tdof · February 3, 2017, 10:33

Thanks for your reply, it helped a lot

I've read the HT part, but I didn't see any setup infos. I thought that you're supposed to use the number of threads since my CPU is "only" under a load of 50% when using 6 processes. I figured you'd have to use them all, but you're actually right and I've got the fastest result using 6 processes now.

Still, the speedup isn't quite as good as I hoped. Only 3 times faster with 6 times more processes seems bad. I'm going to investigate this on our cluster and tinker a bit with GAMG/PCG, solvers and cell count.

SonnyD · January 13, 2022, 10:42

Hi guys,

I guess using hyperthreading in general does not work for simulations. In fact I personally disabled hyperthreading on my desktop pc, which is a 6-core i7 as well.

GAMG is also increasing computational expenses in parallel computing, because agglomeration can expand over your decomposed mesh interfaces. You can read it in OF-User Manual. There are currenly special agglomeration algorithms available for GAMG to reduce the additional inter-processor-communication as I understand, but for me those didn't show any benefit (maybe I applied them in a wrong way). However, my benchmark was very small and only consistet of very few simulations (conducted on a HPC cluster).

Preconditioners are mostly inconsistent in parallel. Only diagonal preconditioner works well in parallel as it seems.

Maybe, if you use more cores on your desktop i slows down, because normal CPU architecture is not specifically designed for parallel simulations.

Also your background processes like running your OS and other applications need some computational capacity which can not be used for simulations.

Hope that helps at least a little

tdof · January 13, 2022, 11:09

Yes, I was not really aware of the HT problematic at the time, but after almost 5 years, I updated my knowledge a little bit

. It seems obvious in retrospect to only start the same number of parallel threads as there are physical CPU cores.

SonnyD · January 13, 2022, 11:24

Yes thats true, it was an old thread. To be honest, I saw the date after my post

.

But I just thought I might say something which could be helpful, while anyone else has the same Problems, so I left it there

.

Actually after 5 years I guess you have much more experience on this than I have.

tdof · January 13, 2022, 11:39

It surely is helpful for other people who might stumble on this thread.

Ad the experience: maybe, maybe not. You never know who you're dealing with

At the moment, I don't care much about scaling and just guesstimate how many cores to use. If it's not the optimal amount, so be it. Seeing that I don't deal with large cases too often and don't even use OF anymore, it's not so important.

February 3, 2017, 10:33		#5
tdof Member Join Date: Jun 2016 Posts: 31 Rep Power: 9	Thanks for your reply, it helped a lot I've read the HT part, but I didn't see any setup infos. I thought that you're supposed to use the number of threads since my CPU is "only" under a load of 50% when using 6 processes. I figured you'd have to use them all, but you're actually right and I've got the fastest result using 6 processes now. Still, the speedup isn't quite as good as I hoped. Only 3 times faster with 6 times more processes seems bad. I'm going to investigate this on our cluster and tinker a bit with GAMG/PCG, solvers and cell count. karlvirgil and konangsh like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[General] Extracting ParaView Data into Python Arrays	Jeffzda	ParaView	30	November 6, 2023 21:00
Partition: cell count = 0	metmet	FLUENT	1	August 31, 2014 19:41
Serial UDF is working for parallel computation also	Tanjina	Fluent UDF and Scheme Programming	0	December 26, 2013 18:24
Installation issues for parallel computation	Akash C	SU2 Installation	1	June 21, 2013 05:26
Parallel computation problem in Tascflow	dandy	CFX	3	April 21, 2002 00:32

February 2, 2017, 12:03		#2
khedar Senior Member khedar Join Date: Oct 2016 Posts: 111 Rep Power: 9	1. Can you share walltime for 1 thread? 2. May be because of Virtual Machines? 3. May be the cache size of your processor not as much as of the one quoted in study..

February 3, 2017, 03:48		#3
tdof Member Join Date: Jun 2016 Posts: 31 Rep Power: 9	1. 1681s 2. Probably, I'll try to run some benchmarks on a native Linux machine 3. Cache size per core is actually the same, 20MB for 8 cores on the Xeon E5 2687W and 15MB for 6 cores on the i7 6800k

February 3, 2017, 09:03		#4
tomf Senior Member Tom Fahner Join Date: Mar 2009 Location: Breda, Netherlands Posts: 638 Rep Power: 32	Hi, Since you only have 6 cores means you can not expect any improvement from using more than 6 processors (read the section on hyperthreading from the pdf). The virtualisation may also hurt a bit. I would advise running 1, 2, 4 and 6 cores. For large enough (100k Cells) cases I would expect the 6 cores to be fastest, however you have 4 memory channels, this may also mean that after 4 cpus you will have less than linear scaling since 6 cores are trying to reach the memory over only 4 channels. Regards, Tom

January 13, 2022, 10:42		#6
SonnyD New Member Join Date: May 2021 Posts: 9 Rep Power: 4	Hi guys, I guess using hyperthreading in general does not work for simulations. In fact I personally disabled hyperthreading on my desktop pc, which is a 6-core i7 as well. GAMG is also increasing computational expenses in parallel computing, because agglomeration can expand over your decomposed mesh interfaces. You can read it in OF-User Manual. There are currenly special agglomeration algorithms available for GAMG to reduce the additional inter-processor-communication as I understand, but for me those didn't show any benefit (maybe I applied them in a wrong way). However, my benchmark was very small and only consistet of very few simulations (conducted on a HPC cluster). Preconditioners are mostly inconsistent in parallel. Only diagonal preconditioner works well in parallel as it seems. Maybe, if you use more cores on your desktop i slows down, because normal CPU architecture is not specifically designed for parallel simulations. Also your background processes like running your OS and other applications need some computational capacity which can not be used for simulations. Hope that helps at least a little

January 13, 2022, 11:09		#7
tdof Member Join Date: Jun 2016 Posts: 31 Rep Power: 9	Yes, I was not really aware of the HT problematic at the time, but after almost 5 years, I updated my knowledge a little bit . It seems obvious in retrospect to only start the same number of parallel threads as there are physical CPU cores.

January 13, 2022, 11:24		#8
SonnyD New Member Join Date: May 2021 Posts: 9 Rep Power: 4	Yes thats true, it was an old thread. To be honest, I saw the date after my post . But I just thought I might say something which could be helpful, while anyone else has the same Problems, so I left it there . Actually after 5 years I guess you have much more experience on this than I have.

January 13, 2022, 11:39		#9
tdof Member Join Date: Jun 2016 Posts: 31 Rep Power: 9	It surely is helpful for other people who might stumble on this thread. Ad the experience: maybe, maybe not. You never know who you're dealing with At the moment, I don't care much about scaling and just guesstimate how many cores to use. If it's not the optimal amount, so be it. Seeing that I don't deal with large cases too often and don't even use OF anymore, it's not so important.