|
[Sponsors] |
September 9, 2006, 20:58 |
Dear OpenFOAM experts:
I ra
|
#1 |
Senior Member
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 18 |
Dear OpenFOAM experts:
I ran interFoam on an AMD X2 3800+ CPU (3 GB DDR, total number of elements is about 390,000) using LAM (simple decompose). It actually took longer time to run when compared to running in serial. I noticed that it required more number of iterations to converge at each time step when running in parallel. Also, extra communication is needed in parallel. Is there any way to improve the performance in parallel? I am setting up a cluter with 6 workstations (each with either dual CPUs or a dual core CPU). Gigabit switch is used to connect the 6 boxes. Most of my work used interFoam/lesInterFoam. Now, I am wondering whether it is worth it to even setting up the cluster (most of the cases are less than 1 million elements). It will be appreciated if someone can shed some light on this? Pei |
|
September 11, 2006, 05:09 |
It might be a problem with the
|
#2 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
It might be a problem with the dual-cores. Both processors have to share the same memory-bandwidth.
I recently did some benchmarks with a 600k-cells simpleFoam-case. The cluster has dual-core dual-cpu nodes connected with Gigabit Ethernet (2 interfaces, channel bonded). The speedups for all processes on 1 node are N=1 1. (not too surprising) N=2 2.06 (quite OK, only one core used per CPU) N=3 2.09 (Oops) N=4 2.69 (We're rising again) If I do the same benchmark with only one process per node on up to 4 nodes (all communication over the network) i get N=2 1.99 N=4 3.72 which is not spectacular but OK. So I guess it's NOT an OpenFOAM-problem but an architectur problem (benchmarks with Fluent on the same machines hint in the same direction)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
September 11, 2006, 05:51 |
What processors are these? Do
|
#3 |
Senior Member
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,419
Rep Power: 26 |
What processors are these? Do the latest models still have these memory bandwidth problems?
Also does the channel bonding have any effect? I think OpenMPI does it automatically? |
|
September 11, 2006, 07:16 |
Hi Mattjis!
The processors
|
#4 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Hi Mattjis!
The processors I did the benchmarks with were DualCore Opterons 275. There is still the possibility that the motherboard doesn't handle things as well as it could (it's a Tyan S2892) or that the kernel doesn't allocate memory optimaly (it's only a 2.6.9 - as far as I remember there have been some changes in the way memory gets allocated in NUMA-like architectures, but I don't know in which version). On the other hand the official Fluent-benchmarks show no such effects on the DualCore-machines which might indicate that something is wrong with my setup (but I suspect them to distribute the processes optimally onto the nodes like I did for the second set) To be honest, I didn't measure the effect of the channel bonding (it doesn't make things worse, that I made sure). The MoBo had two interfaces anyway so the cost to set it up only was the cost of the patch cables. The load-distribution happens on a per-connection basis (this means that one connection only can send with 1GBit, but a second connection can send at the same bandwith at the same time, verified with iperf). Per-packet distribution (which could give 2GBit per connection) should be as easy to set up, but MIGHT have issues with out-of-order packets (and I figured if I run more than one process per node 2x1GBit is as good as 1x2GBit)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
September 11, 2006, 15:12 |
Hi,
Thanks for the feedback
|
#5 |
Senior Member
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 18 |
Hi,
Thanks for the feedback. I have done some more testing and this is what I found: interFoam was applied. 1 gigabit switch. No special stuffs (channel bonding..). N=1 1. N=4 2.47 (two dual core systems) N=6 3.6 (two dual core system + 1 dual CPU system) pei |
|
September 13, 2006, 09:19 |
A note on Bernhard's post from
|
#6 |
Member
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 17 |
A note on Bernhard's post from Sept 11: I have similar experiences running FLUENT (sorry) on a large cluster of nodes, each node having 2 AMD64 cpu:s, and Gb switch. When we use only one cpu per node, speedup is reasonably, but using both cpu:s on each node it's very bad. The problem is that both cpu:s want to use the same network interface at more or less the same time. Latency is a lot greater in that case. This is consistent with Bernhard's tests on dual core.
A note on the positive side: As long as our problem have at least 150.000-200.000 cells per cpu, statistics are quite ok even when we fill the nodes. You "just" need to give the cpu:s a LOT of work between each data exchange... /Ola |
|
September 13, 2006, 09:44 |
Did you have shmem enabled on
|
#7 |
New Member
Sreekanth Pannala
Join Date: Mar 2009
Posts: 6
Rep Power: 17 |
Did you have shmem enabled on the nodes? If that is the case, it should be a memory to memory copy instead of going through the network interface. Only concern depending on how the memory is being managed, might be one is filling up the pipeline or having lot of cache conflicts when running large problems but that should not happen with small (~10K/CPU) problems. I do not have much experience benchmarking OF or Fluent but should be experimenting with OF in near future. I will definitely share the results when I have them.
Cheers! Sreekanth |
|
September 13, 2006, 16:27 |
Concerning the performance of
|
#8 |
Assistant Moderator
Bernhard Gschaider
Join Date: Mar 2009
Posts: 4,225
Rep Power: 51 |
Concerning the performance of memory intensive applications on dual-core machines I found this paper:
www.novell.com/collateral/4622016/4622016.pdf I'm afraid the situation with OpenFOAM is quite similar to the STREAM-benchmark shown in Figure.1: no good speedup with dual-cores. If the number of cores is the same Single-Core-SMP makes more Foam than Dual-Core. Much more. (at least for AMD DualCores - anyone has acess to Intels?)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request |
|
September 14, 2006, 10:15 |
Hi, Bernhard,
I have two De
|
#9 |
Senior Member
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 18 |
Hi, Bernhard,
I have two Dell Precision 380 systems, each has a Pentium D 3.2 GHz CPU. I will be happy to do some testing. I made a mistake in my earlier post. The numbers there were ExecutionTime. For N=6, the real time speed up is only about 2. That is, clockTime is much longer than executionTime. I guess, the parallel run spent a lot of time communicating. I am runing another test: 4 CPUs - only 1 CPU (or 1 core) per workstation. HyperThreading was disabled on all workstation as suggested. Bernhard, I am wondering if you can share how your cluster is setup. I think that my cluster is very premitive and I am hoping that I can follow your setup to improve the efficiency since I am just getting into the cluster area. Pei |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
parallel performance | ivandipia | CFX | 6 | January 29, 2009 16:26 |
Parallel performance | liu | OpenFOAM Running, Solving & CFD | 8 | October 17, 2006 11:04 |
InterFoam problem running parallel | vatant | OpenFOAM Running, Solving & CFD | 0 | April 28, 2006 20:22 |
Parallel Performance of Fluent | Soheyl | FLUENT | 2 | October 30, 2005 07:11 |
Parallel performance | hsing | OpenFOAM Running, Solving & CFD | 16 | August 30, 2005 15:38 |