OpenFOAM 13 Intel quadcore parallel results

msrinath80 · June 6, 2007, 14:59

Hello OF lovers,

I'm back with another parallel study, this time on an intel machine with 8 cores (i.e. two quad-core CPUs). The machine has around 8 Gigs of memory so I had to restrict myself to a 6.5-7 Gig vortex-shedding case.

Here is the excerpt from cat /proc/cpuinfo:

madhavan@frzserver:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4659.18
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.05
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.07
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.16
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

The actual clock speed rises to 2.33 GHz once there is some activity on any of the CPU cores.

Anyways, without further ado, I present the parallel scale-up results below:

http://www.ualberta.ca/~madhavan/ope..._frzserver.jpg

As you can see the scale-up is very good for 2 processors (in fact slightly better than the ideal case). However as one moves to using 4 and 8 cores, the situation detoriates rapidly.

STREAM[1] memory bandwidth benchmarks show that the quad-core features superior memory bandwidth (approx 2500 MB/s) compared to an AMD Opteron 248 (approx 1600 MB/s) for instance. However, if that bandwidth were to be split among 4 cores, we end up with a per core bandwidth of 625 MB/s per core. Even if the Opteron in question was a dual-core, the per-core memory bandwidth would at least work out to 800 MB/s which is definitely an edge over 625 MB/s. Interestingly, the 4 MB shared L2 cache in the quad-core does not seem to influence the scale-up for the 4 or 8-core cases.

I would also like to draw your attention to the red dot (triangle) in the above graph. As a quad-core is basically an MCM (Multi-chip module), it is made by slapping two dual-core CPUs on one processor die. That red triangle was the 2-core scale-up result, if I placed both processes on the same physical CPU (i.e. on one of the dual-core units). If I chose one process to use one dual-core unit and the other process to use the other dual-core unit on the same processor, I would get a slightly better result, but still well below the ideal curve.

Bottom-line based on the above results, is that intel quad-cores are not really good news. Of course, this is just based on one kind of study. I would appreciate more feedback from others with access to quad-core systems (both intel and AMD).

[1] http://www.cs.virginia.edu/stream/

msrinath80 · June 6, 2007, 15:03

I forgot to mention that the good (slightly better than ideal) 2-core result you see in the above graph was obtained while scheduling both icoFoam processes on different physical CPUs. In other words each icoFoam instance enjoyed the full 2500 MB/s.

nikos · June 7, 2007, 04:12

Did you make a test with a smaller case such that in could be in the cache ?
is your kernel tuned ?

msrinath80 · June 7, 2007, 04:44

Nope... Did not do that. Kernel is certainly tuned to my knowledge. If you can be more specific I could furnish more information.

The important thing I want to know is whether anyone has found a different trend on an intel quad core.

nikos · June 7, 2007, 05:14

if I remember on smp computer kernel.shmmax (which can be tuned in /etc/sysctl.conf) value can affect performance of mpi communicator ... see lam/mpi user guide

a too high value can affect performance

correct me if I m wrong but for example with mpich you can set something like P4_GLOBMEMSIZE

ipcs -a can give u more info

rmorgans · June 7, 2007, 08:59

We are (hopefully if it runs tonight!) benchmarking a quad core (Q6600), with an interacting bluff body vortex shedding case.

We're interested in this discussion and will keep you posted.

Rick

msrinath80 · June 7, 2007, 09:17

Thanks. Looking forward to your results

@Nicolas: I am aware of the P4_GLOBMEMSIZE env variable for MPICH. For lam however, I've never had any complaints of shared memory shortage. The default value of 33554432 (32 MB?) is what comes up when I issue cat /proc/sys/kernel/shmmax.

Do you have any pointers?

nikos · June 7, 2007, 10:45

which communicator do you use ? if lam you can use rpi sysv module to use shared memory on the same node ... something like mpirun -ssi rpi sysv

even if application is different on large Oracle cluster type they increase the shmmax to 2Gb

increasing to 128 or 256 Mb will do the trick ...
too high value can decrease performance

kernel should be tuned for parallel computation on each side part, memory and network

by default linux kernel values are not appropriate for parallel computation (smp and distributed) and nfs services. If you want to obtain best performance you must adjust your kernel for your computation

in summary, small cases need practically no adjustment (except for network) but as soon as you are out of the memory cache ...

:->

msrinath80 · June 7, 2007, 22:25

Thanks for your input. Just one observation. Wallclock times are better indicators for parallel speedup estimates. Unless you actually meant wallclock times?

But based on what I see you have a speedup close to 2.3 for 4 processes. So I guess our results have agreement.

msrinath80 · June 7, 2007, 22:32

@Nicolas

Thanks for the pointers. I don't however see the need to tweak any of those settings. Based on my google search, the majority who change shmmax et al. are those working with databases (e.g. oracle). I searched petsc/lam mpi mailing lists to see if anyone has reported significant improvement by tweaking those settings. I could not find anything. If you have experienced benefits, can you specifically state the nature of the parallel case you tested (how many cells etc.) and what was the difference in speedup when using the default setting of 32 MB versus 128 or 256 MB?

Once again thanks for your help!

nikos · June 8, 2007, 04:43

Ok I'm to busy to do some test on our cluster and unfortunately to come to workshop ...

An interessant benchmark would be to pass from ethernet ... I mean using multiple ethernet on the same host and using openmpi because with openmpi u can have multiple ethernet devices on the same host.

lam don't like this (because of lamd) ...

when I will have more time I will try it

have a nice day

ps : when you do benchmark, please put os, compiler and communicator

msrinath80 · June 12, 2007, 08:14

Update: I changed SHMMAX and SHMALL in /etc/sysctl.conf and rebooted. /proc/sys/kernel/shm* displays the new values I set. However, the speedup was again the same 1.2 when using both cores from the same physical CPU. Apparently there is no effect in parallel speedup.

OS: RHEL 4.x (Scientific Linux 4.1)
Compiler: OpenFOAM stock supplied
Communicator: LAM

nikos · June 14, 2007, 08:10

Hi
following this discussion I ve done some test
on a small 2d case 100000 cells using openmpi

first, compiling OF1.4 with -march=nocoma on EMT64 processor will reduce global cpu time by 10% on a serial case and 5% on a parallel case

second I've the same results concerning speedup i.e
2 cores : 1.46
4 cores : 2.06
warning since I ve only one quad core cpu 4 cores result may be polluted on 4 cores computtation by OS activity (Centos 4)

eugene · February 5, 2008, 05:26

You cant have dedicated memory channels for each core unless you are prepared to make the cpu package a lot lot bigger to incorporate the additional pins needed for 64bit interfaces. In addition, the FSB based design does not admit for individual memory channels which will only appear on Intel chips when CSI is released with Nahlem later this year.

The memory bottleneck is one of the main reasons AMD cpus with hypertransport can still be competitive in some applications.

Unfortunately, there isn't much you can do other than buying lower latency memory to make things faster. You only have around 10 GB/s of memory bandwidth shared between 8 cores.

Bring back rambus I say.

June 6, 2007, 14:59	Hello OF lovers, I'm back w	#1
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Hello OF lovers, I'm back with another parallel study, this time on an intel machine with 8 cores (i.e. two quad-core CPUs). The machine has around 8 Gigs of memory so I had to restrict myself to a 6.5-7 Gig vortex-shedding case. Here is the excerpt from cat /proc/cpuinfo: madhavan@frzserver:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4659.18 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.05 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.07 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 2 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.16 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.03 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz stepping : 7 cpu MHz : 1998.000 cache size : 4096 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm bogomips : 4655.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: The actual clock speed rises to 2.33 GHz once there is some activity on any of the CPU cores. Anyways, without further ado, I present the parallel scale-up results below: http://www.ualberta.ca/~madhavan/ope..._frzserver.jpg As you can see the scale-up is very good for 2 processors (in fact slightly better than the ideal case). However as one moves to using 4 and 8 cores, the situation detoriates rapidly. STREAM[1] memory bandwidth benchmarks show that the quad-core features superior memory bandwidth (approx 2500 MB/s) compared to an AMD Opteron 248 (approx 1600 MB/s) for instance. However, if that bandwidth were to be split among 4 cores, we end up with a per core bandwidth of 625 MB/s per core. Even if the Opteron in question was a dual-core, the per-core memory bandwidth would at least work out to 800 MB/s which is definitely an edge over 625 MB/s. Interestingly, the 4 MB shared L2 cache in the quad-core does not seem to influence the scale-up for the 4 or 8-core cases. I would also like to draw your attention to the red dot (triangle) in the above graph. As a quad-core is basically an MCM (Multi-chip module), it is made by slapping two dual-core CPUs on one processor die. That red triangle was the 2-core scale-up result, if I placed both processes on the same physical CPU (i.e. on one of the dual-core units). If I chose one process to use one dual-core unit and the other process to use the other dual-core unit on the same processor, I would get a slightly better result, but still well below the ideal curve. Bottom-line based on the above results, is that intel quad-cores are not really good news. Of course, this is just based on one kind of study. I would appreciate more feedback from others with access to quad-core systems (both intel and AMD). [1] http://www.cs.virginia.edu/stream/

June 6, 2007, 15:03	I forgot to mention that the g	#2
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	I forgot to mention that the good (slightly better than ideal) 2-core result you see in the above graph was obtained while scheduling both icoFoam processes on different physical CPUs. In other words each icoFoam instance enjoyed the full 2500 MB/s.

June 7, 2007, 04:12	Did you make a test with a sma	#3
nikos New Member Nicolas Coste Join Date: Mar 2009 Location: Marseilles, France Posts: 11 Rep Power: 17	Did you make a test with a smaller case such that in could be in the cache ? is your kernel tuned ?

June 7, 2007, 04:44	Nope... Did not do that. Kerne	#4
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Nope... Did not do that. Kernel is certainly tuned to my knowledge. If you can be more specific I could furnish more information. The important thing I want to know is whether anyone has found a different trend on an intel quad core.

June 7, 2007, 05:14	if I remember on smp computer	#5
nikos New Member Nicolas Coste Join Date: Mar 2009 Location: Marseilles, France Posts: 11 Rep Power: 17	if I remember on smp computer kernel.shmmax (which can be tuned in /etc/sysctl.conf) value can affect performance of mpi communicator ... see lam/mpi user guide a too high value can affect performance correct me if I m wrong but for example with mpich you can set something like P4_GLOBMEMSIZE ipcs -a can give u more info

June 7, 2007, 08:59	We are (hopefully if it runs t	#6
rmorgans New Member Richard Morgans Join Date: Mar 2009 Posts: 16 Rep Power: 17	We are (hopefully if it runs tonight!) benchmarking a quad core (Q6600), with an interacting bluff body vortex shedding case. We're interested in this discussion and will keep you posted. Rick

June 7, 2007, 09:17	Thanks. Looking forward to you	#7
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Thanks. Looking forward to your results @Nicolas: I am aware of the P4_GLOBMEMSIZE env variable for MPICH. For lam however, I've never had any complaints of shared memory shortage. The default value of 33554432 (32 MB?) is what comes up when I issue cat /proc/sys/kernel/shmmax. Do you have any pointers?

June 7, 2007, 10:45	which communicator do you use	#8
nikos New Member Nicolas Coste Join Date: Mar 2009 Location: Marseilles, France Posts: 11 Rep Power: 17	which communicator do you use ? if lam you can use rpi sysv module to use shared memory on the same node ... something like mpirun -ssi rpi sysv even if application is different on large Oracle cluster type they increase the shmmax to 2Gb increasing to 128 or 256 Mb will do the trick ... too high value can decrease performance kernel should be tuned for parallel computation on each side part, memory and network by default linux kernel values are not appropriate for parallel computation (smp and distributed) and nfs services. If you want to obtain best performance you must adjust your kernel for your computation in summary, small cases need practically no adjustment (except for network) but as soon as you are out of the memory cache ... :->

June 7, 2007, 22:25	Thanks for your input. Just on	#9
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Thanks for your input. Just one observation. Wallclock times are better indicators for parallel speedup estimates. Unless you actually meant wallclock times? But based on what I see you have a speedup close to 2.3 for 4 processes. So I guess our results have agreement.

June 7, 2007, 22:32	@Nicolas Thanks for the poi	#10
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	@Nicolas Thanks for the pointers. I don't however see the need to tweak any of those settings. Based on my google search, the majority who change shmmax et al. are those working with databases (e.g. oracle). I searched petsc/lam mpi mailing lists to see if anyone has reported significant improvement by tweaking those settings. I could not find anything. If you have experienced benefits, can you specifically state the nature of the parallel case you tested (how many cells etc.) and what was the difference in speedup when using the default setting of 32 MB versus 128 or 256 MB? Once again thanks for your help!

June 8, 2007, 04:43	Ok I'm to busy to do some test	#11
nikos New Member Nicolas Coste Join Date: Mar 2009 Location: Marseilles, France Posts: 11 Rep Power: 17	Ok I'm to busy to do some test on our cluster and unfortunately to come to workshop ... An interessant benchmark would be to pass from ethernet ... I mean using multiple ethernet on the same host and using openmpi because with openmpi u can have multiple ethernet devices on the same host. lam don't like this (because of lamd) ... when I will have more time I will try it have a nice day ps : when you do benchmark, please put os, compiler and communicator

June 12, 2007, 08:14	Update: I changed SHMMAX and S	#12
msrinath80 Senior Member Srinath Madhavan (a.k.a pUl\|) Join Date: Mar 2009 Location: Edmonton, AB, Canada Posts: 703 Rep Power: 21	Update: I changed SHMMAX and SHMALL in /etc/sysctl.conf and rebooted. /proc/sys/kernel/shm* displays the new values I set. However, the speedup was again the same 1.2 when using both cores from the same physical CPU. Apparently there is no effect in parallel speedup. OS: RHEL 4.x (Scientific Linux 4.1) Compiler: OpenFOAM stock supplied Communicator: LAM

June 14, 2007, 08:10	Hi following this discussion	#13
nikos New Member Nicolas Coste Join Date: Mar 2009 Location: Marseilles, France Posts: 11 Rep Power: 17	Hi following this discussion I ve done some test on a small 2d case 100000 cells using openmpi first, compiling OF1.4 with -march=nocoma on EMT64 processor will reduce global cpu time by 10% on a serial case and 5% on a parallel case second I've the same results concerning speedup i.e 2 cores : 1.46 4 cores : 2.06 warning since I ve only one quad core cpu 4 cores result may be polluted on 4 cores computtation by OS activity (Centos 4)

February 5, 2008, 05:26	You cant have dedicated memory	#14
eugene Senior Member Eugene de Villiers Join Date: Mar 2009 Posts: 725 Rep Power: 21	You cant have dedicated memory channels for each core unless you are prepared to make the cpu package a lot lot bigger to incorporate the additional pins needed for 64bit interfaces. In addition, the FSB based design does not admit for individual memory channels which will only appear on Intel chips when CSI is released with Nahlem later this year. The memory bottleneck is one of the main reasons AMD cpus with hypertransport can still be competitive in some applications. Unfortunately, there isn't much you can do other than buying lower latency memory to make things faster. You only have around 10 GB/s of memory bandwidth shared between 8 cores. Bring back rambus I say.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OpenFOAM results in SALOME	kar	OpenFOAM Post-Processing	1	January 31, 2008 13:15
OpenFOAM 141 parallel results infiniband vs gigabit vs SMP	msrinath80	OpenFOAM Running, Solving & CFD	10	November 30, 2007 18:11
OpenFOAM 13 AMD quadcore parallel results	msrinath80	OpenFOAM Running, Solving & CFD	1	November 10, 2007 23:23
OpenFOAM on Intel Core 2 Duo	michael_owen	OpenFOAM Installation	1	November 4, 2006 16:06
AMD X2 & INTEL core 2 are compatible for parallel?	nikolas	FLUENT	0	October 5, 2006 06:49