|
[Sponsors] | |||||
Scaling of parallel computation? Solver/thread count combinations? |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
|
|
|
#1 | |
|
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 11 ![]() |
Hi,
I'm currently looking into the scaling of OpenFOAM 4.0 and OF Extend 3.1 while running cases in parallel on my local machine (i7 6800k 6C/12T @ 4GHz, 32GB DDR4 2666MHz quadchannel, Windows 7 64bit Ultimate, Linux VMs with OF running in Virtualbox) using 4, 8 and 12 threads respectively. I've searched a bit about parallel scaling, but I've noticed a strange behaviour. Well, at least for me it is strange. After reading this https://www.hpc.ntnu.no/display/hpc/...mance+on+Vilje and this PDF http://www.dtic.mil/get-tr-doc/pdf?AD=ADA612337 I was quite confident that I'd get a nice approximately linear speedup on my little processor, but that wasn't the case at all. I've started using a Hagen-Poiseuille laminar pipe flow with about 144k elements and pisoFoam. Using 12 threads resulted the slowest simulation speed, 8 threads were a little faster and 4 threads were somewhere in the middle. I figured that the case was too small to profit from 12 domains and tested a lid-driven cavity flow with Re = 1000, pisoFoam again and 1.0E6 cells, so roughly 83.3E3 cells per thread. Interestingly, using 12 threads was the slowest method, 8 threads were fastest and 4 threads were somewhere in the middle. In OF Extend, 4 threads were actually the fastest. I've read the following here in the forum: Quote:
Cavity 1m cells with GAMG/GAMG solving for p/U: 12 threads: 726s walltime 8 threads: 576s 4 threads: 691s Cavity 1m cells with GAMG/GAMG solving for p/U, OF Extend: 12 threads: 1044s walltime 8 threads: 613s 4 threads: 592s Approximately the same bad scaling for the laminar pipe flow case. What is the cause? I'd appreciate any help Oh I forgot, I use openMPI and start the cases using "mpirun -np num_of_threads foamJob pisoFoam -parallel", that should be correct.
|
||
|
|
|
||
|
|
|
#2 |
|
Senior Member
khedar
Join Date: Oct 2016
Posts: 111
Rep Power: 11 ![]() |
1. Can you share walltime for 1 thread?
2. May be because of Virtual Machines? 3. May be the cache size of your processor not as much as of the one quoted in study.. |
|
|
|
|
|
|
|
|
#3 |
|
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 11 ![]() |
1. 1681s
2. Probably, I'll try to run some benchmarks on a native Linux machine 3. Cache size per core is actually the same, 20MB for 8 cores on the Xeon E5 2687W and 15MB for 6 cores on the i7 6800k |
|
|
|
|
|
|
|
|
#4 |
|
Senior Member
|
Hi,
Since you only have 6 cores means you can not expect any improvement from using more than 6 processors (read the section on hyperthreading from the pdf). The virtualisation may also hurt a bit. I would advise running 1, 2, 4 and 6 cores. For large enough (100k Cells) cases I would expect the 6 cores to be fastest, however you have 4 memory channels, this may also mean that after 4 cpus you will have less than linear scaling since 6 cores are trying to reach the memory over only 4 channels. Regards, Tom |
|
|
|
|
|
|
|
|
#5 |
|
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 11 ![]() |
Thanks for your reply, it helped a lot
I've read the HT part, but I didn't see any setup infos. I thought that you're supposed to use the number of threads since my CPU is "only" under a load of 50% when using 6 processes. I figured you'd have to use them all, but you're actually right and I've got the fastest result using 6 processes now.![]() Still, the speedup isn't quite as good as I hoped. Only 3 times faster with 6 times more processes seems bad. I'm going to investigate this on our cluster and tinker a bit with GAMG/PCG, solvers and cell count. |
|
|
|
|
|
|
|
|
#6 |
|
New Member
Join Date: May 2021
Posts: 9
Rep Power: 6 ![]() |
Hi guys,
I guess using hyperthreading in general does not work for simulations. In fact I personally disabled hyperthreading on my desktop pc, which is a 6-core i7 as well. GAMG is also increasing computational expenses in parallel computing, because agglomeration can expand over your decomposed mesh interfaces. You can read it in OF-User Manual. There are currenly special agglomeration algorithms available for GAMG to reduce the additional inter-processor-communication as I understand, but for me those didn't show any benefit (maybe I applied them in a wrong way). However, my benchmark was very small and only consistet of very few simulations (conducted on a HPC cluster). Preconditioners are mostly inconsistent in parallel. Only diagonal preconditioner works well in parallel as it seems. Maybe, if you use more cores on your desktop i slows down, because normal CPU architecture is not specifically designed for parallel simulations. ![]() Also your background processes like running your OS and other applications need some computational capacity which can not be used for simulations. Hope that helps at least a little
|
|
|
|
|
|
|
|
|
#7 |
|
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 11 ![]() |
Yes, I was not really aware of the HT problematic at the time, but after almost 5 years, I updated my knowledge a little bit
. It seems obvious in retrospect to only start the same number of parallel threads as there are physical CPU cores.
|
|
|
|
|
|
|
|
|
#8 |
|
New Member
Join Date: May 2021
Posts: 9
Rep Power: 6 ![]() |
Yes thats true, it was an old thread. To be honest, I saw the date after my post
. But I just thought I might say something which could be helpful, while anyone else has the same Problems, so I left it there .Actually after 5 years I guess you have much more experience on this than I have.
|
|
|
|
|
|
|
|
|
#9 |
|
Member
Join Date: Jun 2016
Posts: 31
Rep Power: 11 ![]() |
It surely is helpful for other people who might stumble on this thread.
Ad the experience: maybe, maybe not. You never know who you're dealing with At the moment, I don't care much about scaling and just guesstimate how many cores to use. If it's not the optimal amount, so be it. Seeing that I don't deal with large cases too often and don't even use OF anymore, it's not so important.
|
|
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| [General] Extracting ParaView Data into Python Arrays | Jeffzda | ParaView | 30 | November 6, 2023 22:00 |
| Partition: cell count = 0 | metmet | FLUENT | 1 | August 31, 2014 20:41 |
| Serial UDF is working for parallel computation also | Tanjina | Fluent UDF and Scheme Programming | 0 | December 26, 2013 19:24 |
| Installation issues for parallel computation | Akash C | SU2 Installation | 1 | June 21, 2013 06:26 |
| Parallel computation problem in Tascflow | dandy | CFX | 3 | April 21, 2002 01:32 |