CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Worth parallelizing ? (http://www.cfd-online.com/Forums/openfoam-solving/87448-worth-parallelizing.html)

alf12 April 20, 2011 09:43

Worth parallelizing ?
 
Hi,

I have tried to run pisoFoam both in parallel and on a single processor and result is quite upsetting. I have carried out test with a dual core processor on a 500 000 cell mesh.

I only save 13% of execution time when running in parallel.

So, I know there is overhead due to data transfer but is there a mean to assess if running in parallel is worth doing it (from experience or from a theoretical point of view) ?

Thanks

l_r_mcglashan April 20, 2011 09:51

How is your mesh decomposed? You need to minimise the number of faces in the processor boundaries.

alf12 April 20, 2011 10:11

Quote:

How is your mesh decomposed?
I have used decomposePar using scotch method.
It gives 5000 shared cells (so about 1% of mesh size).
I have also tried simple decomposition method but I had twice as much shared cells.

alberto April 21, 2011 03:30

I have very good results in Gigabit clusters, less good on largely SMP systems, but still acceptable.

How many cells do you have per processors?

alf12 April 21, 2011 04:13

I have done test with a 500 000 cell mesh. Test is carried out on a dual core processor so I have 250 000 cells per core and 5000 shared cells between the two cores.

gschaider April 26, 2011 05:53

Quote:

Originally Posted by alf12 (Post 304519)
I have done test with a 500 000 cell mesh. Test is carried out on a dual core processor so I have 250 000 cells per core and 5000 shared cells between the two cores.

Just a thought: on a dual-core processor the two cores have to share the memory bandwidth. If one process already uses the whole bandwidth (old Intel dual-cores for example) then you won't see hardly any speedup on two cores

alf12 April 26, 2011 18:49

Quote:

Originally Posted by gschaider (Post 305088)
Just a thought: on a dual-core processor the two cores have to share the memory bandwidth. If one process already uses the whole bandwidth (old Intel dual-cores for example) then you won't see hardly any speedup on two cores

So, I have checked and the computer I have used to test has a single memory channel, so it is likely to be why running the program in parallel is not so efficient with my configuration

Beyond my initial question, I assume that there is an optimum number of processors for each case, depending on mesh size, number of cells shared between processors, processor frequency, memory bandwidth...

gschaider April 27, 2011 05:16

Quote:

Originally Posted by alf12 (Post 305182)
So, I have checked and the computer I have used to test has a single memory channel, so it is likely to be why running the program in parallel is not so efficient with my configuration

One way to know for sure is to let two serial cases run simultanously. In an ideal world they should take the same time as if they're on their own. In real life they'll take longer: this is the impact of the memory bandwidth and having to share other parts of the processor.

Bernhard

niklas April 28, 2011 02:53

I have conducted some tests on our new quad-core cluster with infiniband.
The case I ran had 1.0M cells and the solver was rhoPimpleFoam.
Code:

#procs        runtime (s)    speedup  efficiency  rel speedup    k cells/core    k cells/cpu
1                  38019              -              1                -                1000            1000
2                  24625              1.5          0.75            1.5              500            1000
4                  15951              2.4          0.6              1.5              265            1000
8                  12090              3.1          0.39            1.3              132              500
16                  5554              6.9          0.43            2.2                65              250
32                  2335            16.3          0.51            2.4                33              125
64                    877            43.4          0.68            2.7                17              62.5
128                  455            83.6          0.65            1.9                ~8              31.3

as you can see it scales very bad up to 8 cpu's, which is because you just shovel alot
of stuff over the same internal connectivity.
This problem is too big to run on a low number of cores basically. If you chose your problem
optimally you should be able to produce good scaling numbers even on 1-8 cores.
As you also can see from the relative speedup (which is how much speedup you get compared
to the previous lower number of cores. (i.e. from 2 to 4, 4 to 8, 8 to 16 etc...) you get superlinear effects
once you reduce the size of the problem on the cpu's.
Once you get a size of the problem that is ~50-70k cells/cpu it starts to scale very well.
This is my rule of thumb really to keep the problemsize of around 50 k cells / cpu.

I will do a similar comparison later when I use 1 to 2 cores/cpu only and see what the numbers are.
Im guessing the speedup will be alot better, so the numbers will look better, but I will be utilizing
my hardware very bad.

quadcores are not the best hardware for cfd.

alf12 April 28, 2011 10:17

Quote:

Originally Posted by gschaider (Post 305249)
One way to know for sure is to let two serial cases run simultanously.

So, I have done the test Bernhard talked about. Here are the results :
- Two cases simultanously = 9926 s (averaged, respectively 9965 s and 9887 s)
- Same case alone = 4695 s

No comment !

Thanks Niklas for your post, it's quite interesting


All times are GMT -4. The time now is 13:51.