CFD Online Discussion Forums - Worth parallelizing ?

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)

- - Worth parallelizing ? (https://www.cfd-online.com/Forums/openfoam-solving/87448-worth-parallelizing.html)

Worth parallelizing ?

Hi,

I have tried to run pisoFoam both in parallel and on a single processor and result is quite upsetting. I have carried out test with a dual core processor on a 500 000 cell mesh.

I only save 13% of execution time when running in parallel.

So, I know there is overhead due to data transfer but is there a mean to assess if running in parallel is worth doing it (from experience or from a theoretical point of view) ?

Thanks

How is your mesh decomposed? You need to minimise the number of faces in the processor boundaries.

Quote:

How is your mesh decomposed?

I have used decomposePar using scotch method.
It gives 5000 shared cells (so about 1% of mesh size).
I have also tried simple decomposition method but I had twice as much shared cells.

I have very good results in Gigabit clusters, less good on largely SMP systems, but still acceptable.

How many cells do you have per processors?

I have done test with a 500 000 cell mesh. Test is carried out on a dual core processor so I have 250 000 cells per core and 5000 shared cells between the two cores.

Quote:

Originally Posted by alf12 (Post 304519)

I have done test with a 500 000 cell mesh. Test is carried out on a dual core processor so I have 250 000 cells per core and 5000 shared cells between the two cores.

Just a thought: on a dual-core processor the two cores have to share the memory bandwidth. If one process already uses the whole bandwidth (old Intel dual-cores for example) then you won't see hardly any speedup on two cores

Quote:

Originally Posted by gschaider (Post 305088)

So, I have checked and the computer I have used to test has a single memory channel, so it is likely to be why running the program in parallel is not so efficient with my configuration

Beyond my initial question, I assume that there is an optimum number of processors for each case, depending on mesh size, number of cells shared between processors, processor frequency, memory bandwidth...

Quote:

Originally Posted by alf12 (Post 305182)

So, I have checked and the computer I have used to test has a single memory channel, so it is likely to be why running the program in parallel is not so efficient with my configuration

One way to know for sure is to let two serial cases run simultanously. In an ideal world they should take the same time as if they're on their own. In real life they'll take longer: this is the impact of the memory bandwidth and having to share other parts of the processor.

Bernhard

I have conducted some tests on our new quad-core cluster with infiniband.
The case I ran had 1.0M cells and the solver was rhoPimpleFoam.

Code:

#procs        runtime (s)     speedup   efficiency  rel speedup     k cells/core     k cells/cpu

1                   38019               -               1                -                1000             1000

2                   24625              1.5           0.75            1.5               500             1000

4                   15951              2.4           0.6              1.5               265             1000

8                   12090              3.1           0.39            1.3               132               500

16                   5554              6.9           0.43            2.2                 65               250

32                   2335            16.3           0.51            2.4                 33               125

64                     877            43.4           0.68            2.7                 17               62.5

128                   455            83.6           0.65            1.9                 ~8               31.3

as you can see it scales very bad up to 8 cpu's, which is because you just shovel alot
of stuff over the same internal connectivity.
This problem is too big to run on a low number of cores basically. If you chose your problem
optimally you should be able to produce good scaling numbers even on 1-8 cores.
As you also can see from the relative speedup (which is how much speedup you get compared
to the previous lower number of cores. (i.e. from 2 to 4, 4 to 8, 8 to 16 etc...) you get superlinear effects
once you reduce the size of the problem on the cpu's.
Once you get a size of the problem that is ~50-70k cells/cpu it starts to scale very well.
This is my rule of thumb really to keep the problemsize of around 50 k cells / cpu.

I will do a similar comparison later when I use 1 to 2 cores/cpu only and see what the numbers are.
Im guessing the speedup will be alot better, so the numbers will look better, but I will be utilizing
my hardware very bad.

quadcores are not the best hardware for cfd.

Quote:

Originally Posted by gschaider (Post 305249)

One way to know for sure is to let two serial cases run simultanously.

So, I have done the test Bernhard talked about. Here are the results :
- Two cases simultanously = 9926 s (averaged, respectively 9965 s and 9887 s)
- Same case alone = 4695 s

No comment !

Thanks Niklas for your post, it's quite interesting