CFD Online Discussion Forums

CFD Online Discussion Forums (
-   OpenFOAM Running, Solving & CFD (
-   -   Concurrent runs slowing each others (

johndeas October 21, 2008 12:40

Hi, I wanted to launch seve

I wanted to launch several runs of one case on a multicore machine at the same time. I hoped, since each process would be completely independent, that computation time of each run won't affect too much the others. To the contrary, I witnessed an increase in the calculation time spent on each timestep according to the following chart:

(Results are not very precise, I used a timer to get them quickly, but the trends are represented). There is a very big increase in computation time, and it seems to occur by levels (e.g. with 3 or 4 cores results are quite the same).

To detect whether the problem is coming from OpenFOAM or not, I tried to see if another will lay the same results. I chose to launch several runs of Scilab, computing a basic matrix inversion. The command was: for i=1:10000, inv(rand(1000,1000));

The following results were obtained :

As one can see, the performance of the runs are relatively unaffected by the others until almost all the cores are occupied by Scilab processes.

I am using a dell station with 2 Intel E5450 processors each consisting of 4 cores cadenced at 3 gHz. The operating system is RHEL 5.

Is my problem coming from OF, should I adjust some parameters ? I hinted that OpenFOAM is making much more access to the RAM than Scilab, and maybe can slow the other processes accesses. Has somebody witnessed the same behaviour with OF ?

Thank you,


johndeas October 21, 2008 12:42

Sorry for the image sizes, did
Sorry for the image sizes, didn't check that !

olesen October 21, 2008 13:16

We have dual-core dual-cpu ser
We have dual-core dual-cpu servers.
With both OpenFOAM and STAR-CD we get better performance if we use a single core from each cpu and split across more machines rather than use all available cpus and cores.
Apparently the bottleneck to memory is much more significant than the additional network traffic by splitting across more separate machines.

With quad-cores, the memory bottleneck will look even worse. I think your best chance is to run each case in parallel and run the cases sequentially.
In this case, you should see how the parallel speed up looks for a single case.

juergen October 21, 2008 13:44

John and Mark, just out of
John and Mark,

just out of curiosity: Which processor architecture do you use? AMD or Intel? Xeon? Opteron? Is NUMA active?

What is your 'uname -a'?

I'm about to build a multiprocessor machine for work with OpenFOAM and I'd like to collect some experience from multi-core users.

Thanks a lot.

Ciao, Juergen

msrinath80 October 21, 2008 14:28

If you search the forum, you'l
If you search the forum, you'll find that many of us have experimented with dual/quad core offerings from AMD and Intel and posted the results here.

markusrehm October 22, 2008 02:35

Hi John, matrix calculation
Hi John,

matrix calculation is computationally very dense which means that memory access is rather small and much of it can be done in CPU cache.

By the way: which solver and problem size did you benchmark?

It depends a lot on the case and size. You should benchmark your cases and decide if it is better to fill up your whole cluster or leave half of the processors empty as Mark mentioned.

Regards Markus.

johndeas October 29, 2008 09:43

First, thank you for your answ
First, thank you for your answers, and sorry for not replying myself as promptly.

@Juergen : uname -a gave me: Linux lambda30 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
Regarding NUMA being enabled or not, I have no idea.

@Mark : "I think your best chance is to run each case in parallel and run the cases sequentially." Doing this, am I not adding communication between process slow down to the existing memory bandwith slodown ?

@Florian : What processors are your using ?

I also had results from another calculation on this machine, performed using Fluent on an 8 million case. The case was run in parallel, and the results are reported below. As well as the various speedups obtained running concurrent OpenFOAM cases at the same time.

The openFoam case is a plane channel with 300000 calculation points. I am running a modified version of icoFoam which allows me to sustain a constant pressure gradient, and compute some statistics on the fly (like what channeloodles did before using controlDict functions). The Fluent case is an 8 million cells rectangular domain with velocity inlet/ pressure outlet, and symmetries on the sides.

One of the results which is astonishing is that Fluent run in parallel scales better that individual small cases in OpenFOAM run simultaneously. As I said, the Scilab results are obtained inversing 1000*1000 matrices. Since cat /proc/cpuinfo gave me a cache size of 6000 kb, I suspect the inversion to be made in cache, explaining the good speedup.

Based on what I have read on this forum, right now, as the datasets used in CFD are often large, memory bandwith is the most limiting factor. Having multiple cores sharing this bandwith through a single Therefore, multiple cores CPU are not suited for those tasks, because they share a front side bus to access the memory. The Opteron architecture, however is not subject to those limitations, as every core has a direct. The processor to memory controller interface is on the processor die.
As my laboratory is willing to invest in a cluster, I guess a good base could be a bunch of multicore Opterons connected using Infiniband.

I don't understand then why so many clusters seem to use Xeon machines ? Have they some kind of improvement which makes their shared bandwith efficient anyway ?

johndeas November 28, 2008 14:54

I posted this one on CFD onlin
I posted this one on CFD online. Not everybody might track posts there and well, as I am mainly doign calculation with OpenFOAM, I value more comments from this forum.

CPU have become much more performant than memory access, which is putting a lot of pressure on the FSB paradigm, created times ago, when memory access were comparatively faster. The CPU cache has been developped to limit the access to the memory, but, due to the large amount of data needed to solve Navier-Stokes equations on a large domain, cache performance become less important as it needs to be frequently filled with data from the RAM. Hypertransport from AMD is a solution, has it removes the FSB and allow the Opteron CPU to be connected directly to memory. However, recent Xeon provide several FSB (one per core) which also cirvumvent the problem of saturation of a single FSB, and might explain why Xeon equip a large percentage of clusters, despite its use of "old" FSB technology.

What is your say on this ?

floooo January 15, 2009 04:27

John Deas -> I use dell workst
John Deas -> I use dell workstation with 2 intel Xenon quad processor and 64GB of memory; the machines are connected togever with a Gbit network:
My speedup is nearly linear.
I can't make a nice plot because I never alone on the machines.

Here are the results obtained by ENEA (national agency for new technology, energy and envionment - Italy) ESCO.pdf

With an infiniband their result are linear.
The study show that the bandwith of the connection becomes a major criterion when the number of cores increase.
For a Gb bandwith the effect of this limitation appear after 20 cores.

Your machines are maybe connected with only a 100M ethernet bandwith.

vijayakumar January 15, 2009 04:44

i need to run a combustion pro
i need to run a combustion problem.... i used XI-foam as solver, while running some errors are coming... i need to know which solver is good for combustion problem.. and few guidelines for solving combustion problem.

All times are GMT -4. The time now is 21:56.