Dear OpenFOAM experts: I ra
Dear OpenFOAM experts:
I ran interFoam on an AMD X2 3800+ CPU (3 GB DDR, total number of elements is about 390,000) using LAM (simple decompose). It actually took longer time to run when compared to running in serial. I noticed that it required more number of iterations to converge at each time step when running in parallel. Also, extra communication is needed in parallel. Is there any way to improve the performance in parallel?
I am setting up a cluter with 6 workstations (each with either dual CPUs or a dual core CPU). Gigabit switch is used to connect the 6 boxes. Most of my work used interFoam/lesInterFoam. Now, I am wondering whether it is worth it to even setting up the cluster (most of the cases are less than 1 million elements).
It will be appreciated if someone can shed some light on this?
It might be a problem with the
It might be a problem with the dual-cores. Both processors have to share the same memory-bandwidth.
I recently did some benchmarks with a 600k-cells simpleFoam-case. The cluster has dual-core dual-cpu nodes connected with Gigabit Ethernet (2 interfaces, channel bonded).
The speedups for all processes on 1 node are
N=1 1. (not too surprising)
N=2 2.06 (quite OK, only one core used per CPU)
N=3 2.09 (Oops)
N=4 2.69 (We're rising again)
If I do the same benchmark with only one process per node on up to 4 nodes (all communication over the network) i get
which is not spectacular but OK.
So I guess it's NOT an OpenFOAM-problem but an architectur problem (benchmarks with Fluent on the same machines hint in the same direction)
What processors are these? Do
What processors are these? Do the latest models still have these memory bandwidth problems?
Also does the channel bonding have any effect? I think OpenMPI does it automatically?
Hi Mattjis! The processors
The processors I did the benchmarks with were DualCore Opterons 275. There is still the possibility that the motherboard doesn't handle things as well as it could (it's a Tyan S2892) or that the kernel doesn't allocate memory optimaly (it's only a 2.6.9 - as far as I remember there have been some changes in the way memory gets allocated in NUMA-like architectures, but I don't know in which version).
On the other hand the official Fluent-benchmarks show no such effects on the DualCore-machines which might indicate that something is wrong with my setup (but I suspect them to distribute the processes optimally onto the nodes like I did for the second set)
To be honest, I didn't measure the effect of the channel bonding (it doesn't make things worse, that I made sure). The MoBo had two interfaces anyway so the cost to set it up only was the cost of the patch cables. The load-distribution happens on a per-connection basis (this means that one connection only can send with 1GBit, but a second connection can send at the same bandwith at the same time, verified with iperf). Per-packet distribution (which could give 2GBit per connection) should be as easy to set up, but MIGHT have issues with out-of-order packets (and I figured if I run more than one process per node 2x1GBit is as good as 1x2GBit)
Hi, Thanks for the feedback
Thanks for the feedback.
I have done some more testing and this is what I found:
interFoam was applied. 1 gigabit switch. No special stuffs (channel bonding..).
N=4 2.47 (two dual core systems)
N=6 3.6 (two dual core system + 1 dual CPU system)
A note on Bernhard's post from
A note on Bernhard's post from Sept 11: I have similar experiences running FLUENT (sorry) on a large cluster of nodes, each node having 2 AMD64 cpu:s, and Gb switch. When we use only one cpu per node, speedup is reasonably, but using both cpu:s on each node it's very bad. The problem is that both cpu:s want to use the same network interface at more or less the same time. Latency is a lot greater in that case. This is consistent with Bernhard's tests on dual core.
A note on the positive side: As long as our problem have at least 150.000-200.000 cells per cpu, statistics are quite ok even when we fill the nodes. You "just" need to give the cpu:s a LOT of work between each data exchange...
Did you have shmem enabled on
Did you have shmem enabled on the nodes? If that is the case, it should be a memory to memory copy instead of going through the network interface. Only concern depending on how the memory is being managed, might be one is filling up the pipeline or having lot of cache conflicts when running large problems but that should not happen with small (~10K/CPU) problems. I do not have much experience benchmarking OF or Fluent but should be experimenting with OF in near future. I will definitely share the results when I have them.
Concerning the performance of
Concerning the performance of memory intensive applications on dual-core machines I found this paper:
I'm afraid the situation with OpenFOAM is quite similar to the STREAM-benchmark shown in Figure.1: no good speedup with dual-cores.
If the number of cores is the same Single-Core-SMP makes more Foam than Dual-Core. Much more. (at least for AMD DualCores - anyone has acess to Intels?)
Hi, Bernhard, I have two De
I have two Dell Precision 380 systems, each has a Pentium D 3.2 GHz CPU. I will be happy to do some testing.
I made a mistake in my earlier post. The numbers there were ExecutionTime. For N=6, the real time speed up is only about 2. That is, clockTime is much longer than executionTime. I guess, the parallel run spent a lot of time communicating.
I am runing another test: 4 CPUs - only 1 CPU (or 1 core) per workstation. HyperThreading was disabled on all workstation as suggested.
Bernhard, I am wondering if you can share how your cluster is setup. I think that my cluster is very premitive and I am hoping that I can follow your setup to improve the efficiency since I am just getting into the cluster area.
|All times are GMT -4. The time now is 00:55.|