In order to test OpenFOAMs LES
In order to test OpenFOAMs LES performance we
used the channelOodLes solver to compute a simple
turbulent channel flow similar to the one
described in the channel395 tutorial but with
Re_tau=180 (64x64x64 cells, 2 blocks and a simple
grading stretching close to the walls).
We tested all available time schemes and
different solvers for pressure (AMG
(NCells=26000) and ICCG with abs. convergence
The timestep was setup to be 0.005, yielding to an average CFL number of 1.1 during the
Running OpenFOAM parallel on 4 cpus on a
linux-cluster, we found out that:
Using AMG is much slower than ICCG, right, or are
there any options other than NCells, which have
an significant influence on the performance?
Comparing with our inhouse Fortran-Code running the same testcase on single cpu, OpenFOAM still needs almost 10 times more computation time for one time step.
We need to mention that our explicit LES code is limited to one-block domains and optimized with the intel fortran compiler. Further it uses an optimized pressure solver (FFT -> periodic directions).
The question concerning this statement is: Is the only reason of the difference the prize we have to pay for a very flexible code infrastructure, or are there built-in options with which one can play to get a better performance?
Is it worth to use another compiler (e.g. intel) or are the differences only small?
//Thanks for help in advance
1) Yes the current implementa
1) Yes the current implementation of AMG has poor performance running in parallel but this can be improved by choosing a largish number of cells at the coarsest level and could be improved further if the coarsest level were solved using a direct solver on the master-node. However, for channel LES cases we do not find that the AMG offers any performance improvement over ICCG.
2) I would expect an implicit p-U code to be significantly slower than an optimised explicit FFT code but I am surprised it's a factor of 10, I would expect 4 or 5.
2.1) You are probably finding the cost of the pressure solution is dominating the computation, how many PISO correctors are you doing? How many iterations in the pressure solver is it doing?
Is your code running compressible or incompressible? If the case is compressible you will find a significant gain in performance if you also run the implicit code compressible because the diagonal dominance of the pressure equation improves convergence enormously.
2.2) You might get speed imrpovements with other compilers, although I have not found the Intel compiler better than gcc. I have tried the KAI compiler in the past and got a reasonable improvements in performance, perhaps the pathscale or portland compilers would be even better but I haven't tried them yet.
On point 2) : is the in-house
On point 2) : is the in-house Fortran code structured or unstructured? If its limited to 1 block domains that suggests that the addressing might be significantly simpler, which might give a further speedup.
Additional notes to: 2.1) 2 p
Additional notes to:
2.1) 2 piso correctors, for the first corrector step it needs 170 and for the second 150 iterations
Our code is incompressible.
What is the interconnect on yo
What is the interconnect on your cluster? If it is just fast ethernet or gigabit ethernet, I would expect this to massively decrease the parallel performance of channelOodles' implicit solver. Even if you had infiniband or a 4-way opteron, running only 260000 cells over 4 processors is not going to get you near 100% efficiency (I get ~85% for 500k cells on a quad opteron). Please run the case on a single processor and compare that to your 4-way timings.
Explicit codes have much less of a comms overhead. Combine a 50% efficiency with Henry's factor of 4-5 and you have you 10x slowdown.
With 170 iterations in the fir
With 170 iterations in the first corrector and 150 in the second it sounds like PISO is not converging very well which is probably due to your Courant number being quite high. Are these numbers from early in the run or after the flow has developed?
I need to correct my previous
I need to correct my previous posting: the
number of iterations in the first step was around
140 and in the second around 50. This is for the
Hi, I did a few parallel calc
I did a few parallel calculations to check the capabilities of OpenFoam. Therefore I used 2 Pentiums with 3.2 GHz connected with ethernet gigabit.
1) Amongst others I ran the OpenFoam tutorial "channelOodles". I just did a decomposition into 2 parts and didn´t do any other changes (the mesh is built of 60´000 cells). Comparing the decomposed case (running on two processors) to the undecomposed case (one processor) I got a speedup of about 1.5.
Is that a realistic value?
2) When doing the same test with other cases and solvers I got very different results. (Not surprising) when the number of cells is lower, the speedup is lower. At another place at this forum I read about a speedup of 1.3, running the "cavity" tutorial on a network with 2 processors. Is that really realistic? Because the cavity tutorial is built of just 400 cells.
3) Could anyone give me a rough estimation, what speedup I can expect, depending on the mesh size and the number of processors I use?
What values of the parallel ru
What values of the parallel running control parameters did you choose and what difference in performance did you get when you changed them?
Hello Henry, I´m not quite su
I´m not quite sure what you mean with running control parameters. I left all parameters unchanged, compared to the original tutorial-case.
The simulation ran at a Courant number of about 0.3.
I did just a simple-decompositon of the "channelOodles" case into 2 equal parts. Where decomposition in wall normal direction resulted in the highest speedup (1.5).
Take a look at this tread:
Take a look at this tread:
Also are your machines connected via a cross-over cable or a switch? If a switch is it a good one?
Thanks Henry, I´ll try playin
I´ll try playing with the parameters and check their influence.
The machines are connected by a switch and I was told, it is a good one.
Maybe you could tell me, what you think of the speedup of 1.5 for "channelOodles" (60´000 cells) on 2 processors. Is it rather a good or a bad value?
I would expect better speed-up
I would expect better speed-up than that but because the case has two sets of cyclic planes you can end up with quite a large communications overhead if you split the domain between either pair. I suggest you decompose the case by splitting between the walls, i.e. in the y-direction if I remember correctly. Also I expect that you will get better speed-up by using floatTransfer.
I did some calculations. The r
I did some calculations. The results are not very satisfying. Maybe you could tell me, what you think about it.
I did some calculations of "channelOodles" with different configurations concerning the number of mesh cells. I did each calculation both on 1 processor (not decomposed) and on 2 processors (decomposed).
For the decomposed cases I splitted the mesh between the walls. In the following table one can see the speedup of going from the 1 processor case to the 2 processor case.
60´000 1.50 ("original" tutorial case)
I did all parallel calculations both with and without "floatTransfer". The results DID NOT change.
I did the mesh refinement by placing points in spanwise direction. This means that the number of cells on the processer-interfaces is also increased. But I did one calculation of the 120´000 case with a refinement in wall normal direction (number of cells on interface is not changed by mesh refinement) and what really surprised me, I got the same simulation times as above for both the 1 and the 2 processor run.
I get a marked improvement whe
I get a marked improvement when using floatTransfer and I am surprised you don't. It appears your case is limited by the global-sums (i.e. latency) rather than bandwidth otherwise you would have seen a difference when using floatTransfer and the refinement direction. I think you should run some tests on your network performance to see where the bottle-neck is. It might also be interesting is you could run the tests with a cross-over cable instead of the switch.
Thanks Henry for all the infor
Thanks Henry for all the informations.
Next I will do some network checks and hope that I find the bottle neck.
Could anyone tell me, what´s a good speedup for my problem, so that I have some orientation or does anyone have some experience about speedup dependent on problem size?
With only two processors, you
With only two processors, you should be getting very near 100% speedup. No less than 95%
I would agree if the two proce
I would agree if the two processors were on a shared-memory or NUMA machine but they communicate across a GigaBit switch in which case I would estimate the speedup will be less than 90%.
Thanks Eugene, thanks Henry I
Thanks Eugene, thanks Henry
I assume with 100% speedup you mean half the calculation time of a single processor run ?
I made up some quick numbers f
I made up some quick numbers for two LES channels. In all cases, float transfer is on, the rest is stock.
1. P4 3.0GHz Gigabit cluster (P4)
2. Opteron 2x246 (O2)
For the 60k stock channel run I get the following timings:
P4 single: 137s
P4 two: 87s
O2 single: 118s
O2 two: 64s
P4 parallel x2 cpu efficiency: 79.3%
O2 parallel x2 cpu efficiency: 92.2%
These numbers are misleading though. A 60k mesh with 1200 communicating faces is quite heavy on the comms. I therefore made a 480k mesh and re-ran the test on the P4s. This time the picture is a lot different:
P4 parallel x2 cpu efficiency: 96.7%
Thats very close to 100% speedup. As you can see the question of parallel efficiency is not straight forward and any code that claims it can consistantly provide this performance is doing something ... well lets just say "special" and leave it at that.
A quick additional stat, the cell->comm face ratio for the 60k case is 50:1, while the same stat for the 480k case is 100:1. Additionally, there might be issues unrelated to comms performance (like cache size) that can also influence the calculation times, skewing the scaling results.
All-in-all a less than trivial question.
Note: cpu efficiency calculated as (0.5*1cpu time)/(2cpu time)*100
|All times are GMT -4. The time now is 13:35.|