CFD Online Discussion Forums - OpenFOAM 13 Intel quadcore parallel results

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)

- - OpenFOAM 13 Intel quadcore parallel results (https://www.cfd-online.com/Forums/openfoam-solving/59171-openfoam-13-intel-quadcore-parallel-results.html)

Hello OF lovers, I'm back w

Hello OF lovers,

I'm back with another parallel study, this time on an intel machine with 8 cores (i.e. two quad-core CPUs). The machine has around 8 Gigs of memory so I had to restrict myself to a 6.5-7 Gig vortex-shedding case.

Here is the excerpt from cat /proc/cpuinfo:

madhavan@frzserver:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4659.18
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.05
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.07
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.16
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

The actual clock speed rises to 2.33 GHz once there is some activity on any of the CPU cores.

Anyways, without further ado, I present the parallel scale-up results below:

http://www.ualberta.ca/~madhavan/ope..._frzserver.jpg

As you can see the scale-up is very good for 2 processors (in fact slightly better than the ideal case). However as one moves to using 4 and 8 cores, the situation detoriates rapidly.

STREAM[1] memory bandwidth benchmarks show that the quad-core features superior memory bandwidth (approx 2500 MB/s) compared to an AMD Opteron 248 (approx 1600 MB/s) for instance. However, if that bandwidth were to be split among 4 cores, we end up with a per core bandwidth of 625 MB/s per core. Even if the Opteron in question was a dual-core, the per-core memory bandwidth would at least work out to 800 MB/s which is definitely an edge over 625 MB/s. Interestingly, the 4 MB shared L2 cache in the quad-core does not seem to influence the scale-up for the 4 or 8-core cases.

I would also like to draw your attention to the red dot (triangle) in the above graph. As a quad-core is basically an MCM (Multi-chip module), it is made by slapping two dual-core CPUs on one processor die. That red triangle was the 2-core scale-up result, if I placed both processes on the same physical CPU (i.e. on one of the dual-core units). If I chose one process to use one dual-core unit and the other process to use the other dual-core unit on the same processor, I would get a slightly better result, but still well below the ideal curve.

Bottom-line based on the above results, is that intel quad-cores are not really good news. Of course, this is just based on one kind of study. I would appreciate more feedback from others with access to quad-core systems (both intel and AMD).

[1] http://www.cs.virginia.edu/stream/

I forgot to mention that the g

I forgot to mention that the good (slightly better than ideal) 2-core result you see in the above graph was obtained while scheduling both icoFoam processes on different physical CPUs. In other words each icoFoam instance enjoyed the full 2500 MB/s.

Did you make a test with a sma

Did you make a test with a smaller case such that in could be in the cache ?
is your kernel tuned ?

Nope... Did not do that. Kerne

Nope... Did not do that. Kernel is certainly tuned to my knowledge. If you can be more specific I could furnish more information.

The important thing I want to know is whether anyone has found a different trend on an intel quad core.

if I remember on smp computer

if I remember on smp computer kernel.shmmax (which can be tuned in /etc/sysctl.conf) value can affect performance of mpi communicator ... see lam/mpi user guide

a too high value can affect performance

correct me if I m wrong but for example with mpich you can set something like P4_GLOBMEMSIZE

ipcs -a can give u more info

We are (hopefully if it runs t

We are (hopefully if it runs tonight!) benchmarking a quad core (Q6600), with an interacting bluff body vortex shedding case.

We're interested in this discussion and will keep you posted.

Rick

Thanks. Looking forward to you

Thanks. Looking forward to your results http://www.cfd-online.com/OpenFOAM_D...part/happy.gif

@Nicolas: I am aware of the P4_GLOBMEMSIZE env variable for MPICH. For lam however, I've never had any complaints of shared memory shortage. The default value of 33554432 (32 MB?) is what comes up when I issue cat /proc/sys/kernel/shmmax.

Do you have any pointers?

which communicator do you use

which communicator do you use ? if lam you can use rpi sysv module to use shared memory on the same node ... something like mpirun -ssi rpi sysv

even if application is different on large Oracle cluster type they increase the shmmax to 2Gb

increasing to 128 or 256 Mb will do the trick ...
too high value can decrease performance

kernel should be tuned for parallel computation on each side part, memory and network

by default linux kernel values are not appropriate for parallel computation (smp and distributed) and nfs services. If you want to obtain best performance you must adjust your kernel for your computation

in summary, small cases need practically no adjustment (except for network) but as soon as you are out of the memory cache ...

:->

Thanks for your input. Just on

Thanks for your input. Just one observation. Wallclock times are better indicators for parallel speedup estimates. Unless you actually meant wallclock times?

But based on what I see you have a speedup close to 2.3 for 4 processes. So I guess our results have agreement.

@Nicolas Thanks for the poi

@Nicolas

Thanks for the pointers. I don't however see the need to tweak any of those settings. Based on my google search, the majority who change shmmax et al. are those working with databases (e.g. oracle). I searched petsc/lam mpi mailing lists to see if anyone has reported significant improvement by tweaking those settings. I could not find anything. If you have experienced benefits, can you specifically state the nature of the parallel case you tested (how many cells etc.) and what was the difference in speedup when using the default setting of 32 MB versus 128 or 256 MB?

Once again thanks for your help!

Ok I'm to busy to do some test

Ok I'm to busy to do some test on our cluster and unfortunately to come to workshop ...

An interessant benchmark would be to pass from ethernet ... I mean using multiple ethernet on the same host and using openmpi because with openmpi u can have multiple ethernet devices on the same host.

lam don't like this (because of lamd) ...

when I will have more time I will try it

have a nice day

ps : when you do benchmark, please put os, compiler and communicator

Update: I changed SHMMAX and S

Update: I changed SHMMAX and SHMALL in /etc/sysctl.conf and rebooted. /proc/sys/kernel/shm* displays the new values I set. However, the speedup was again the same 1.2 when using both cores from the same physical CPU. Apparently there is no effect in parallel speedup.

OS: RHEL 4.x (Scientific Linux 4.1)
Compiler: OpenFOAM stock supplied
Communicator: LAM

Hi following this discussion

Hi
following this discussion I ve done some test
on a small 2d case 100000 cells using openmpi

first, compiling OF1.4 with -march=nocoma on EMT64 processor will reduce global cpu time by 10% on a serial case and 5% on a parallel case

second I've the same results concerning speedup i.e
2 cores : 1.46
4 cores : 2.06
warning since I ve only one quad core cpu 4 cores result may be polluted on 4 cores computtation by OS activity (Centos 4)

You cant have dedicated memory

You cant have dedicated memory channels for each core unless you are prepared to make the cpu package a lot lot bigger to incorporate the additional pins needed for 64bit interfaces. In addition, the FSB based design does not admit for individual memory channels which will only appear on Intel chips when CSI is released with Nahlem later this year.

The memory bottleneck is one of the main reasons AMD cpus with hypertransport can still be competitive in some applications.

Unfortunately, there isn't much you can do other than buying lower latency memory to make things faster. You only have around 10 GB/s of memory bandwidth shared between 8 cores.

Bring back rambus I say.