star 4.06 memory on linux cluster
I'm trying to run Star 4.06 on a linux cluster with pbs, on 900,000 cells modelling incompressible transient flow. Each node of the cluster has two processors with 4 cores, and 8GB of shared memory. The model is partitioned using metis.
Each processor is an Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
Uname -a gives: Linux 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:14:47 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
The compiler is Absoft 9.0 EP 64 bit.
My question is this:
If I use 8 processes on each of 2 nodes, ie 16 processes in total, each process takes 860Mb virtual memory (mostly data and stack).
If I use 8 processes on each of 4 nodes, ie 32 processes in total, each process takes 820Mb.
How come if I use more processes each doesnt use proportionally less memory? I would have expected the total memory used to stay almost constant. As it stands I can't use many more cells before the nodes run out of memory, and start to swap.
Any advice appreciated. My apologies if I've missed something obvious like an option.
The memory used is not only related to the problem size. As you use more processors, the communication overhead between the processors increases. So you'll not have a linear decrease in the memory usage. At some point using more processors may result in slower performance due to the communication overhead
yes a tradeoff - but not yet?
Yes, I'd agree that eventually there is a trade off, when using more nodes adds a greater communications overhead then the computational benefit they bring.
But I didnt think I'd reached that point yet. I'm finding that I can't run 1,000,000 cells on two nodes each with 8Gb and 8 processors. Other threads indicate that I should be able to do this on a single processor with 2GB memory.
That was my initial thought too. I use similar machines (my Linux machine) shows the same capabilities as your except for a different linux version. And I frequently run about 1 million size meshes on 1 processor.
Just out of curiosity, have you benchmarked your quad-cores? I was advised to go with dual-cores instead of quad-cores because of the inherent performance loss when using all 4 cores (which I confirmed on my head-node with Star-CCM+). What is your "speedup" going from 7 to 8 cores on one of your nodes?
With the changes in memory access with the Nehalem cpus, the impact of the memory bottleneck should become less significant in the future ... it might even be better in the current generation of AMD cpus.
I ran some more tests (mesh 96x99x96=912384 cells), and yes we are reaching a tradeoff:
Nodes x processes per node
1x1=1 (ie serial) uses 1870Mb/process
1x2=2 uses 1340
1x4=4 uses 1060
1x8=8 uses 940
2x4=8 uses 930
2x8=16 uses 860
4x8=32 uses 820
For our work using 32 licences the memory per process doesnt fall much below half the serial memory requirement. So for big jobs we'll have to only partly use nodes, in order to get enough memory on them.
The memory overhead due to parallel working seems surprisingly high, to me at least.
Your model is too small to make your conclusion valid. By 32 cores you only have 28000 cells on each core (that's a very small number). At that size the overhead of all the "halo" cells (the cells that exist at boundaries between two domains) are just not going to decrease any further. If you run a much larger (like an order of magnitude) model, you will see the memory effect you are looking for.
|All times are GMT -4. The time now is 20:37.|