CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Performance of interFoam running in parallel

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   September 9, 2006, 19:58
Default Dear OpenFOAM experts: I ra
  #1
Senior Member
 
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 9
hsieh is on a distinguished road
Dear OpenFOAM experts:

I ran interFoam on an AMD X2 3800+ CPU (3 GB DDR, total number of elements is about 390,000) using LAM (simple decompose). It actually took longer time to run when compared to running in serial. I noticed that it required more number of iterations to converge at each time step when running in parallel. Also, extra communication is needed in parallel. Is there any way to improve the performance in parallel?

I am setting up a cluter with 6 workstations (each with either dual CPUs or a dual core CPU). Gigabit switch is used to connect the 6 boxes. Most of my work used interFoam/lesInterFoam. Now, I am wondering whether it is worth it to even setting up the cluster (most of the cases are less than 1 million elements).

It will be appreciated if someone can shed some light on this?

Pei
hsieh is offline   Reply With Quote

Old   September 11, 2006, 04:09
Default It might be a problem with the
  #2
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,912
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
It might be a problem with the dual-cores. Both processors have to share the same memory-bandwidth.

I recently did some benchmarks with a 600k-cells simpleFoam-case. The cluster has dual-core dual-cpu nodes connected with Gigabit Ethernet (2 interfaces, channel bonded).

The speedups for all processes on 1 node are
N=1 1. (not too surprising)
N=2 2.06 (quite OK, only one core used per CPU)
N=3 2.09 (Oops)
N=4 2.69 (We're rising again)

If I do the same benchmark with only one process per node on up to 4 nodes (all communication over the network) i get
N=2 1.99
N=4 3.72
which is not spectacular but OK.
So I guess it's NOT an OpenFOAM-problem but an architectur problem (benchmarks with Fluent on the same machines hint in the same direction)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   September 11, 2006, 04:51
Default What processors are these? Do
  #3
Super Moderator
 
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,416
Rep Power: 16
mattijs is on a distinguished road
What processors are these? Do the latest models still have these memory bandwidth problems?

Also does the channel bonding have any effect? I think OpenMPI does it automatically?
mattijs is offline   Reply With Quote

Old   September 11, 2006, 06:16
Default Hi Mattjis! The processors
  #4
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,912
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Hi Mattjis!

The processors I did the benchmarks with were DualCore Opterons 275. There is still the possibility that the motherboard doesn't handle things as well as it could (it's a Tyan S2892) or that the kernel doesn't allocate memory optimaly (it's only a 2.6.9 - as far as I remember there have been some changes in the way memory gets allocated in NUMA-like architectures, but I don't know in which version).
On the other hand the official Fluent-benchmarks show no such effects on the DualCore-machines which might indicate that something is wrong with my setup (but I suspect them to distribute the processes optimally onto the nodes like I did for the second set)

To be honest, I didn't measure the effect of the channel bonding (it doesn't make things worse, that I made sure). The MoBo had two interfaces anyway so the cost to set it up only was the cost of the patch cables. The load-distribution happens on a per-connection basis (this means that one connection only can send with 1GBit, but a second connection can send at the same bandwith at the same time, verified with iperf). Per-packet distribution (which could give 2GBit per connection) should be as easy to set up, but MIGHT have issues with out-of-order packets (and I figured if I run more than one process per node 2x1GBit is as good as 1x2GBit)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   September 11, 2006, 14:12
Default Hi, Thanks for the feedback
  #5
Senior Member
 
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 9
hsieh is on a distinguished road
Hi,

Thanks for the feedback.

I have done some more testing and this is what I found:

interFoam was applied. 1 gigabit switch. No special stuffs (channel bonding..).

N=1 1.
N=4 2.47 (two dual core systems)
N=6 3.6 (two dual core system + 1 dual CPU system)

pei
hsieh is offline   Reply With Quote

Old   September 13, 2006, 08:19
Default A note on Bernhard's post from
  #6
Member
 
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 8
olwi is on a distinguished road
A note on Bernhard's post from Sept 11: I have similar experiences running FLUENT (sorry) on a large cluster of nodes, each node having 2 AMD64 cpu:s, and Gb switch. When we use only one cpu per node, speedup is reasonably, but using both cpu:s on each node it's very bad. The problem is that both cpu:s want to use the same network interface at more or less the same time. Latency is a lot greater in that case. This is consistent with Bernhard's tests on dual core.

A note on the positive side: As long as our problem have at least 150.000-200.000 cells per cpu, statistics are quite ok even when we fill the nodes. You "just" need to give the cpu:s a LOT of work between each data exchange...

/Ola
olwi is offline   Reply With Quote

Old   September 13, 2006, 08:44
Default Did you have shmem enabled on
  #7
New Member
 
Sreekanth Pannala
Join Date: Mar 2009
Posts: 6
Rep Power: 8
pannala is on a distinguished road
Did you have shmem enabled on the nodes? If that is the case, it should be a memory to memory copy instead of going through the network interface. Only concern depending on how the memory is being managed, might be one is filling up the pipeline or having lot of cache conflicts when running large problems but that should not happen with small (~10K/CPU) problems. I do not have much experience benchmarking OF or Fluent but should be experimenting with OF in near future. I will definitely share the results when I have them.

Cheers!

Sreekanth
pannala is offline   Reply With Quote

Old   September 13, 2006, 15:27
Default Concerning the performance of
  #8
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,912
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Concerning the performance of memory intensive applications on dual-core machines I found this paper:

www.novell.com/collateral/4622016/4622016.pdf

I'm afraid the situation with OpenFOAM is quite similar to the STREAM-benchmark shown in Figure.1: no good speedup with dual-cores.

If the number of cores is the same Single-Core-SMP makes more Foam than Dual-Core. Much more. (at least for AMD DualCores - anyone has acess to Intels?)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   September 14, 2006, 09:15
Default Hi, Bernhard, I have two De
  #9
Senior Member
 
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 9
hsieh is on a distinguished road
Hi, Bernhard,

I have two Dell Precision 380 systems, each has a Pentium D 3.2 GHz CPU. I will be happy to do some testing.

I made a mistake in my earlier post. The numbers there were ExecutionTime. For N=6, the real time speed up is only about 2. That is, clockTime is much longer than executionTime. I guess, the parallel run spent a lot of time communicating.

I am runing another test: 4 CPUs - only 1 CPU (or 1 core) per workstation. HyperThreading was disabled on all workstation as suggested.

Bernhard, I am wondering if you can share how your cluster is setup. I think that my cluster is very premitive and I am hoping that I can follow your setup to improve the efficiency since I am just getting into the cluster area.

Pei
hsieh is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel performance ivandipia CFX 6 January 29, 2009 16:26
Parallel performance liu OpenFOAM Running, Solving & CFD 8 October 17, 2006 10:04
InterFoam problem running parallel vatant OpenFOAM Running, Solving & CFD 0 April 28, 2006 19:22
Parallel Performance of Fluent Soheyl FLUENT 2 October 30, 2005 07:11
Parallel performance hsing OpenFOAM Running, Solving & CFD 16 August 30, 2005 14:38


All times are GMT -4. The time now is 23:32.