CFD Online Discussion Forums

CFD Online Discussion Forums (
-   Siemens (
-   -   star 4.06 memory on linux cluster (

johnmck April 20, 2009 11:27

star 4.06 memory on linux cluster
I'm trying to run Star 4.06 on a linux cluster with pbs, on 900,000 cells modelling incompressible transient flow. Each node of the cluster has two processors with 4 cores, and 8GB of shared memory. The model is partitioned using metis.

Each processor is an Intel(R) Xeon(R) CPU E5430 @ 2.66GHz

Uname -a gives: Linux 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:14:47 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

The compiler is Absoft 9.0 EP 64 bit.

My question is this:

If I use 8 processes on each of 2 nodes, ie 16 processes in total, each process takes 860Mb virtual memory (mostly data and stack).
If I use 8 processes on each of 4 nodes, ie 32 processes in total, each process takes 820Mb.

How come if I use more processes each doesnt use proportionally less memory? I would have expected the total memory used to stay almost constant. As it stands I can't use many more cells before the nodes run out of memory, and start to swap.

Any advice appreciated. My apologies if I've missed something obvious like an option.
John Mck.

vishyaroon April 20, 2009 11:49

The memory used is not only related to the problem size. As you use more processors, the communication overhead between the processors increases. So you'll not have a linear decrease in the memory usage. At some point using more processors may result in slower performance due to the communication overhead

johnmck April 20, 2009 15:17

yes a tradeoff - but not yet?
Yes, I'd agree that eventually there is a trade off, when using more nodes adds a greater communications overhead then the computational benefit they bring.

But I didnt think I'd reached that point yet. I'm finding that I can't run 1,000,000 cells on two nodes each with 8Gb and 8 processors. Other threads indicate that I should be able to do this on a single processor with 2GB memory.

Any ideas?


vishyaroon April 20, 2009 16:25

That was my initial thought too. I use similar machines (my Linux machine) shows the same capabilities as your except for a different linux version. And I frequently run about 1 million size meshes on 1 processor.

f-w April 20, 2009 18:15


Just out of curiosity, have you benchmarked your quad-cores? I was advised to go with dual-cores instead of quad-cores because of the inherent performance loss when using all 4 cores (which I confirmed on my head-node with Star-CCM+). What is your "speedup" going from 7 to 8 cores on one of your nodes?


olesen April 21, 2009 02:48


Originally Posted by f-w (Post 213563)
I was advised to go with dual-cores instead of quad-cores because of the inherent performance loss when using all 4 cores

I don't think the issue is dual vs. quad core per se, but rather the bottleneck accessing the memory. We've have several dual-cpu/quad-cores machines in our cluster and found that using a single process per cpu gave us about 30-35% better performance than using all of the cores (no swapping occured). In _our_ testcase, the memory bottleneck was worse than the network overhead incurred by spreading the job over more machines. As always, do not trust anybody's benchmark though, but benchmark with your own problems.

With the changes in memory access with the Nehalem cpus, the impact of the memory bottleneck should become less significant in the future ... it might even be better in the current generation of AMD cpus.

johnmck April 21, 2009 07:58

More Results
I ran some more tests (mesh 96x99x96=912384 cells), and yes we are reaching a tradeoff:

Nodes x processes per node
1x1=1 (ie serial) uses 1870Mb/process
1x2=2 uses 1340
1x4=4 uses 1060
1x8=8 uses 940
2x4=8 uses 930
2x8=16 uses 860
4x8=32 uses 820

For our work using 32 licences the memory per process doesnt fall much below half the serial memory requirement. So for big jobs we'll have to only partly use nodes, in order to get enough memory on them.

The memory overhead due to parallel working seems surprisingly high, to me at least.

Many Thanks
John mck

TMG April 22, 2009 13:02

Your model is too small to make your conclusion valid. By 32 cores you only have 28000 cells on each core (that's a very small number). At that size the overhead of all the "halo" cells (the cells that exist at boundaries between two domains) are just not going to decrease any further. If you run a much larger (like an order of magnitude) model, you will see the memory effect you are looking for.

All times are GMT -4. The time now is 22:36.