CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   OpenFOAM 13 Intel quadcore parallel results (http://www.cfd-online.com/Forums/openfoam-solving/59171-openfoam-13-intel-quadcore-parallel-results.html)

msrinath80 June 6, 2007 14:59

Hello OF lovers, I'm back w
 
Hello OF lovers,

I'm back with another parallel study, this time on an intel machine with 8 cores (i.e. two quad-core CPUs). The machine has around 8 Gigs of memory so I had to restrict myself to a 6.5-7 Gig vortex-shedding case.

Here is the excerpt from cat /proc/cpuinfo:

madhavan@frzserver:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4659.18
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.05
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.07
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.16
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 7
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
bogomips : 4655.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

The actual clock speed rises to 2.33 GHz once there is some activity on any of the CPU cores.

Anyways, without further ado, I present the parallel scale-up results below:

http://www.ualberta.ca/~madhavan/ope..._frzserver.jpg

As you can see the scale-up is very good for 2 processors (in fact slightly better than the ideal case). However as one moves to using 4 and 8 cores, the situation detoriates rapidly.

STREAM[1] memory bandwidth benchmarks show that the quad-core features superior memory bandwidth (approx 2500 MB/s) compared to an AMD Opteron 248 (approx 1600 MB/s) for instance. However, if that bandwidth were to be split among 4 cores, we end up with a per core bandwidth of 625 MB/s per core. Even if the Opteron in question was a dual-core, the per-core memory bandwidth would at least work out to 800 MB/s which is definitely an edge over 625 MB/s. Interestingly, the 4 MB shared L2 cache in the quad-core does not seem to influence the scale-up for the 4 or 8-core cases.

I would also like to draw your attention to the red dot (triangle) in the above graph. As a quad-core is basically an MCM (Multi-chip module), it is made by slapping two dual-core CPUs on one processor die. That red triangle was the 2-core scale-up result, if I placed both processes on the same physical CPU (i.e. on one of the dual-core units). If I chose one process to use one dual-core unit and the other process to use the other dual-core unit on the same processor, I would get a slightly better result, but still well below the ideal curve.

Bottom-line based on the above results, is that intel quad-cores are not really good news. Of course, this is just based on one kind of study. I would appreciate more feedback from others with access to quad-core systems (both intel and AMD).

[1] http://www.cs.virginia.edu/stream/

msrinath80 June 6, 2007 15:03

I forgot to mention that the g
 
I forgot to mention that the good (slightly better than ideal) 2-core result you see in the above graph was obtained while scheduling both icoFoam processes on different physical CPUs. In other words each icoFoam instance enjoyed the full 2500 MB/s.

nikos June 7, 2007 04:12

Did you make a test with a sma
 
Did you make a test with a smaller case such that in could be in the cache ?
is your kernel tuned ?

msrinath80 June 7, 2007 04:44

Nope... Did not do that. Kerne
 
Nope... Did not do that. Kernel is certainly tuned to my knowledge. If you can be more specific I could furnish more information.

The important thing I want to know is whether anyone has found a different trend on an intel quad core.

nikos June 7, 2007 05:14

if I remember on smp computer
 
if I remember on smp computer kernel.shmmax (which can be tuned in /etc/sysctl.conf) value can affect performance of mpi communicator ... see lam/mpi user guide

a too high value can affect performance

correct me if I m wrong but for example with mpich you can set something like P4_GLOBMEMSIZE

ipcs -a can give u more info

rmorgans June 7, 2007 08:59

We are (hopefully if it runs t
 
We are (hopefully if it runs tonight!) benchmarking a quad core (Q6600), with an interacting bluff body vortex shedding case.

We're interested in this discussion and will keep you posted.

Rick

msrinath80 June 7, 2007 09:17

Thanks. Looking forward to you
 
Thanks. Looking forward to your results http://www.cfd-online.com/OpenFOAM_D...part/happy.gif

@Nicolas: I am aware of the P4_GLOBMEMSIZE env variable for MPICH. For lam however, I've never had any complaints of shared memory shortage. The default value of 33554432 (32 MB?) is what comes up when I issue cat /proc/sys/kernel/shmmax.

Do you have any pointers?

nikos June 7, 2007 10:45

which communicator do you use
 
which communicator do you use ? if lam you can use rpi sysv module to use shared memory on the same node ... something like mpirun -ssi rpi sysv

even if application is different on large Oracle cluster type they increase the shmmax to 2Gb

increasing to 128 or 256 Mb will do the trick ...
too high value can decrease performance

kernel should be tuned for parallel computation on each side part, memory and network

by default linux kernel values are not appropriate for parallel computation (smp and distributed) and nfs services. If you want to obtain best performance you must adjust your kernel for your computation

in summary, small cases need practically no adjustment (except for network) but as soon as you are out of the memory cache ...

:->

msrinath80 June 7, 2007 22:25

Thanks for your input. Just on
 
Thanks for your input. Just one observation. Wallclock times are better indicators for parallel speedup estimates. Unless you actually meant wallclock times?

But based on what I see you have a speedup close to 2.3 for 4 processes. So I guess our results have agreement.

msrinath80 June 7, 2007 22:32

@Nicolas Thanks for the poi
 
@Nicolas

Thanks for the pointers. I don't however see the need to tweak any of those settings. Based on my google search, the majority who change shmmax et al. are those working with databases (e.g. oracle). I searched petsc/lam mpi mailing lists to see if anyone has reported significant improvement by tweaking those settings. I could not find anything. If you have experienced benefits, can you specifically state the nature of the parallel case you tested (how many cells etc.) and what was the difference in speedup when using the default setting of 32 MB versus 128 or 256 MB?

Once again thanks for your help!

nikos June 8, 2007 04:43

Ok I'm to busy to do some test
 
Ok I'm to busy to do some test on our cluster and unfortunately to come to workshop ...

An interessant benchmark would be to pass from ethernet ... I mean using multiple ethernet on the same host and using openmpi because with openmpi u can have multiple ethernet devices on the same host.

lam don't like this (because of lamd) ...

when I will have more time I will try it

have a nice day

ps : when you do benchmark, please put os, compiler and communicator

msrinath80 June 12, 2007 08:14

Update: I changed SHMMAX and S
 
Update: I changed SHMMAX and SHMALL in /etc/sysctl.conf and rebooted. /proc/sys/kernel/shm* displays the new values I set. However, the speedup was again the same 1.2 when using both cores from the same physical CPU. Apparently there is no effect in parallel speedup.

OS: RHEL 4.x (Scientific Linux 4.1)
Compiler: OpenFOAM stock supplied
Communicator: LAM

nikos June 14, 2007 08:10

Hi following this discussion
 
Hi
following this discussion I ve done some test
on a small 2d case 100000 cells using openmpi

first, compiling OF1.4 with -march=nocoma on EMT64 processor will reduce global cpu time by 10% on a serial case and 5% on a parallel case

second I've the same results concerning speedup i.e
2 cores : 1.46
4 cores : 2.06
warning since I ve only one quad core cpu 4 cores result may be polluted on 4 cores computtation by OS activity (Centos 4)

eugene February 5, 2008 06:26

You cant have dedicated memory
 
You cant have dedicated memory channels for each core unless you are prepared to make the cpu package a lot lot bigger to incorporate the additional pins needed for 64bit interfaces. In addition, the FSB based design does not admit for individual memory channels which will only appear on Intel chips when CSI is released with Nahlem later this year.

The memory bottleneck is one of the main reasons AMD cpus with hypertransport can still be competitive in some applications.

Unfortunately, there isn't much you can do other than buying lower latency memory to make things faster. You only have around 10 GB/s of memory bandwidth shared between 8 cores.

Bring back rambus I say.


All times are GMT -4. The time now is 20:52.