CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM

Scaling of parallel computation? Solver/thread count combinations?

Register Blogs Community New Posts Updated Threads Search

Like Tree2Likes
  • 2 Post By tdof

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   February 2, 2017, 11:01
Default Scaling of parallel computation? Solver/thread count combinations?
  #1
Member
 
Join Date: Jun 2016
Posts: 31
Rep Power: 9
tdof is on a distinguished road
Hi,

I'm currently looking into the scaling of OpenFOAM 4.0 and OF Extend 3.1 while running cases in parallel on my local machine (i7 6800k 6C/12T @ 4GHz, 32GB DDR4 2666MHz quadchannel, Windows 7 64bit Ultimate, Linux VMs with OF running in Virtualbox) using 4, 8 and 12 threads respectively. I've searched a bit about parallel scaling, but I've noticed a strange behaviour. Well, at least for me it is strange. After reading this

https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje

and this PDF

http://www.dtic.mil/get-tr-doc/pdf?AD=ADA612337

I was quite confident that I'd get a nice approximately linear speedup on my little processor, but that wasn't the case at all. I've started using a Hagen-Poiseuille laminar pipe flow with about 144k elements and pisoFoam. Using 12 threads resulted the slowest simulation speed, 8 threads were a little faster and 4 threads were somewhere in the middle. I figured that the case was too small to profit from 12 domains and tested a lid-driven cavity flow with Re = 1000, pisoFoam again and 1.0E6 cells, so roughly 83.3E3 cells per thread. Interestingly, using 12 threads was the slowest method, 8 threads were fastest and 4 threads were somewhere in the middle. In OF Extend, 4 threads were actually the fastest. I've read the following here in the forum:

Quote:
The multigrid solvers (GAMG) are quick (in terms of walltime), but does not at all scale well in parallel. They require around 100k cells/process for the parallel efficiency to be acceptable.
The conjugate gradient solvers (PCG) are slow in terms of walltime, but scale extremely well in parallel. As low as 10k cells/process can be effective.
I've tried GAMG as well as PCG/PBiCG for pressure and velocity as well as mixtures of both. The diagrams from the Vilje cluster show even superlinear speedup when using up to 100 parallel processes, why am I not getting at least an approximately linear speedup using only 12 threads? I've tested simple and scotch decomposition methods, renumberMesh, but no differences here. Could the virtual machines be the reason or am I missing something? The scalability is apparently dependent on the solver as well, but I can't imagine the overhead being that bad on such a small level of parallelization.



Cavity 1m cells with GAMG/GAMG solving for p/U:
12 threads: 726s walltime
8 threads: 576s
4 threads: 691s

Cavity 1m cells with GAMG/GAMG solving for p/U, OF Extend:
12 threads: 1044s walltime
8 threads: 613s
4 threads: 592s

Approximately the same bad scaling for the laminar pipe flow case. What is the cause? I'd appreciate any help Oh I forgot, I use openMPI and start the cases using "mpirun -np num_of_threads foamJob pisoFoam -parallel", that should be correct.
tdof is offline   Reply With Quote

Old   February 2, 2017, 12:03
Default
  #2
Senior Member
 
khedar
Join Date: Oct 2016
Posts: 111
Rep Power: 9
khedar is on a distinguished road
1. Can you share walltime for 1 thread?
2. May be because of Virtual Machines?
3. May be the cache size of your processor not as much as of the one quoted in study..
khedar is offline   Reply With Quote

Old   February 3, 2017, 03:48
Default
  #3
Member
 
Join Date: Jun 2016
Posts: 31
Rep Power: 9
tdof is on a distinguished road
1. 1681s
2. Probably, I'll try to run some benchmarks on a native Linux machine
3. Cache size per core is actually the same, 20MB for 8 cores on the Xeon E5 2687W and 15MB for 6 cores on the i7 6800k
tdof is offline   Reply With Quote

Old   February 3, 2017, 09:03
Default
  #4
Senior Member
 
Tom Fahner
Join Date: Mar 2009
Location: Breda, Netherlands
Posts: 638
Rep Power: 32
tomf will become famous soon enoughtomf will become famous soon enough
Send a message via MSN to tomf Send a message via Skype™ to tomf
Hi,

Since you only have 6 cores means you can not expect any improvement from using more than 6 processors (read the section on hyperthreading from the pdf). The virtualisation may also hurt a bit. I would advise running 1, 2, 4 and 6 cores. For large enough (100k Cells) cases I would expect the 6 cores to be fastest, however you have 4 memory channels, this may also mean that after 4 cpus you will have less than linear scaling since 6 cores are trying to reach the memory over only 4 channels.

Regards,
Tom
tomf is offline   Reply With Quote

Old   February 3, 2017, 10:33
Default
  #5
Member
 
Join Date: Jun 2016
Posts: 31
Rep Power: 9
tdof is on a distinguished road
Thanks for your reply, it helped a lot I've read the HT part, but I didn't see any setup infos. I thought that you're supposed to use the number of threads since my CPU is "only" under a load of 50% when using 6 processes. I figured you'd have to use them all, but you're actually right and I've got the fastest result using 6 processes now.



Still, the speedup isn't quite as good as I hoped. Only 3 times faster with 6 times more processes seems bad. I'm going to investigate this on our cluster and tinker a bit with GAMG/PCG, solvers and cell count.
karlvirgil and konangsh like this.
tdof is offline   Reply With Quote

Old   January 13, 2022, 10:42
Default
  #6
New Member
 
Join Date: May 2021
Posts: 9
Rep Power: 4
SonnyD is on a distinguished road
Hi guys,

I guess using hyperthreading in general does not work for simulations. In fact I personally disabled hyperthreading on my desktop pc, which is a 6-core i7 as well.

GAMG is also increasing computational expenses in parallel computing, because agglomeration can expand over your decomposed mesh interfaces. You can read it in OF-User Manual. There are currenly special agglomeration algorithms available for GAMG to reduce the additional inter-processor-communication as I understand, but for me those didn't show any benefit (maybe I applied them in a wrong way). However, my benchmark was very small and only consistet of very few simulations (conducted on a HPC cluster).

Preconditioners are mostly inconsistent in parallel. Only diagonal preconditioner works well in parallel as it seems.



Maybe, if you use more cores on your desktop i slows down, because normal CPU architecture is not specifically designed for parallel simulations.

Also your background processes like running your OS and other applications need some computational capacity which can not be used for simulations.


Hope that helps at least a little
SonnyD is offline   Reply With Quote

Old   January 13, 2022, 11:09
Default
  #7
Member
 
Join Date: Jun 2016
Posts: 31
Rep Power: 9
tdof is on a distinguished road
Yes, I was not really aware of the HT problematic at the time, but after almost 5 years, I updated my knowledge a little bit . It seems obvious in retrospect to only start the same number of parallel threads as there are physical CPU cores.
tdof is offline   Reply With Quote

Old   January 13, 2022, 11:24
Default
  #8
New Member
 
Join Date: May 2021
Posts: 9
Rep Power: 4
SonnyD is on a distinguished road
Yes thats true, it was an old thread. To be honest, I saw the date after my post .

But I just thought I might say something which could be helpful, while anyone else has the same Problems, so I left it there .


Actually after 5 years I guess you have much more experience on this than I have.
SonnyD is offline   Reply With Quote

Old   January 13, 2022, 11:39
Default
  #9
Member
 
Join Date: Jun 2016
Posts: 31
Rep Power: 9
tdof is on a distinguished road
It surely is helpful for other people who might stumble on this thread.

Ad the experience: maybe, maybe not. You never know who you're dealing with At the moment, I don't care much about scaling and just guesstimate how many cores to use. If it's not the optimal amount, so be it. Seeing that I don't deal with large cases too often and don't even use OF anymore, it's not so important.
tdof is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[General] Extracting ParaView Data into Python Arrays Jeffzda ParaView 30 November 6, 2023 21:00
Partition: cell count = 0 metmet FLUENT 1 August 31, 2014 19:41
Serial UDF is working for parallel computation also Tanjina Fluent UDF and Scheme Programming 0 December 26, 2013 18:24
Installation issues for parallel computation Akash C SU2 Installation 1 June 21, 2013 05:26
Parallel computation problem in Tascflow dandy CFX 3 April 21, 2002 00:32


All times are GMT -4. The time now is 06:35.