We're in the process of buying
We're in the process of buying new hardware for a small cluster in the next months. Evaluating the hardware by looking at published results is a bit difficult because benchmarks tend to fall into three categories:
- SPECmarks (which are OK but IMHO not 100% applicable for CFD-computations)
- Framerates in Quake 3/Doom 3 (which are interesting, but I don't think my boss would approve if I took this as the basis for a decision)
- other benchmarks which tend to be interger/low memory
So to compare hardware we get for tests I wrote a Python-Script that runs various tutorial-cases from the OpenFOAM-distribution and compares the execution time with that of a reference machine. It then computes an average speedup to that machine.
The script can also be used to run the cases in parallel.
I KNOW that trying to get a single number to gauge the performance of a computer system is the sign of extreme simple-mindedness but I'm trying it anyway (and of course it's not the only number I'm using)
The script is discussed in more detail at
and is part of
The script is quite stable (at least at my site). The problem is the benchmark suite where half of the cases fail for parallel execution (due to problems during decomposePar, problems with the boundary conditions, seems that some of these cases were never run in parallel). I'm planning to have a stable version by the end of the month.
Any comments on the approach/the script/the benchmark suite would be greatly apprechiated (even "You got it all wrong")
(On thing that especially interests me is: what is faster a) one DualCore-Opteron or b) two equivalent SingleCore-CPUs on one board; the last time I looked the price for these two configurations was almost the same)
Cant comment on the reliabilit
Cant comment on the reliability of the benchmarks, but dual core vs single core depends entirely on the interconnect you plan to use.
Friend of mine did a bunch of benchmarks (using STAR admittedly) with a cheap gigabit ethernet and 3 AMD 3800 X2s. Using only 2 machines he got near 90% efficiency. Adding the third machine however dropped him down to around 60%. From this and other experiences I would say unless you can afford a myrinet or equivalent interconnect, stick with single core cpus. A gigabit backbone just doesn't have the capacity or low enough latency to carry two compute units per nic. Even if you etherbond 2 or more nics per box, you will still have latency issues.
Can you report any problems wi
Can you report any problems with decomposePar?
It's not a problem with decomp
It's not a problem with decomposePar per se: for instance in the dieselFoam/aachenBomb case there are two files (ft, fu) that don't have sufficient boundary conditions according to decomposePar:
--> FOAM FATAL IO ERROR : keyword walls is undefined in dictionary "/.automount/werner/Werner/bgschaid/bgschaid-foamStuff/Benchmark/dieselFoam_aach enBomb_standard.gcds07.cdratfd.unileoben.ac.at.cas e.runDir/0/ft::boundaryField"
(I'm fully aware that lagrangian particles usually do not parallelize very well, but that was the reason why I included that case)
Similar things happen with the other cases that fail (except for dnsFoam/boxTurb16: "FOAM FATAL ERROR : calculated number of cells is incorrect" when running dnsFoam).
I'll let you know if I find a real problem with decomposePar (and not a problem that has to do with model set-up)
Eugene wrote: > Cant comment
> Cant comment on the reliability of the benchmarks, but
> dual core vs single core depends entirely on the
> interconnect you plan to use.
Aren't you confusing a computer with a dual core processor with a cluster with two nodes here? Dual core is always better than single core, same processor frequency assumed.
I think what Eugene meant was:
I think what Eugene meant was: "if there are two CPUs on a board (in whatever form) as soon as you need a third CPU for your task you'll see that it would have been wiser to invest in good networking instead of fancy SMP-hardware"
@"dual core always better": if there's only one CPU you're right, but compared to a Dual-CPU-SingleCore-Board I'm not 100% sure, because, if I interpret the Processor diagrams I've seen correctly, on a DualCore the two cores have to share the same MemoryBus which could be a bottleneck. But nobody can tell me for sure whether this has an impact. That's why I want to benchmark.
The AMD Hypertransport memory
The AMD Hypertransport memory bus is good enough that dual core cpus only take about a 10% hit in performance when running a 2 processor job.
The comment about the number of cpus per nic stands though. It all depends on the number of foam processes that have to share the same communications interface. Basically 2 cores/cpus/processors per comms interface can potentially produce a bottleneck due to the doubling in the volume of interprocessor communications that the nic has to handle compared to a single processes.
Channel bonding gigabit ethern
Channel bonding gigabit ethernet is a waste of time. Performance is not doubled and latency actually becomes worse. Since many (most?) dual Opteron motherboard include dual gigabit interfaces onboard, a useful approach is to buy a bigger network switch and connect both NICs on each node to it. This is key--you need to assign different IP addresses to each interface and basically make the single node look like two nodes by assigning it two host names.
So, nodeX would each have two host names nodeXa and nodeXb. When you launch your parallel runs on dual-CPU, dual-core Opteron nodes, you would use each hostname twice:
This will give each pair of processors one independent network interface to use as its own and avoids network contention issues. Latency for this setup is the same as for a dual-CPU single-core with one NIC. This same approach could be used for single-core dual-CPU nodes or for dual-core dual-CPU configurations with four NICs.
While this is a possible solution, in my mind, the cost of a dual-core, dual-CPU Opteron node is at the breaking point for investing in higher-end networking. Specifically, our new dual-dual cluster uses Infiniband. The cost of each node itself was on the order of $4k. The networking added roughly $1k per node over plain gigabit. I think that is a reasonable investment for significantly higher bandwidth and lower latency.
This is very interesting. So t
This is very interesting. So traffic between 2 or more nics on a single machine will be balanced automatically or is it managed by lam/mpi?
I have two 8way opteron boxes here that I would really like to improve the interconnect for. If as you say I can just stick in more nics and cables, that would be awsome. For some reason I had never considered this a possibility and fiddling around with channel bonding got me nowhere.
There is no balancing to do.
There is no balancing to do. The IP addresses/hostnames are the identifier that MPI/PVM uses to identify processes. By doing as I outlined, you will be giving different pairs of parallel processes different IP addresses. Let's just talk about one PE per NIC for clarity for now. In MPI talk, it might be like this:
# Node 2
# Node 3
If the nodes have two CPUs each, when we launch this job, the each PE will have its own IP address and hence, its own network interface to use. Traffic moving between PE0 and PE1 will not be sent to the switch. The IP stack will bounce it right back just like it would if the processes shared the same IP address, so no performance is lost. With dual cores, it is the same...just two PEs will share an IP address and its corresponding network interface and your hosts file will list the hostnames twice.
On your 8-way boxes, you have a few ways to go. You can buy a few cheap Intel e1000 cards and populate as many PCI slots as you can. Or you can spend a bit more and get the Intel dual- or quad-interface cards. I would also recommend that you at least check on Infiniband prices. You can "end-to-end" them so you would only need to buy two cards and a cable. That shouldn't be much more than $1200-$1500.
BTW, make sure that you add this line to the modules.conf file if you are running the e1000 cards under Linux:
options e1000 InterruptThrottleRate=80000,80000
Add an "80000" for each e1000 you have. That line above is for two interfaces. This greatly reduces network latency and gave about another 150 Mbps in bandwidth. With this tuning, I got latency numbers in the 25-ms range on our xeon cluster. That is down from about 160 ms using the default settings.
Now why didnt I think of that?
Now why didnt I think of that? Thanks for the info.
I will see about getting a few PCI-X multi-channel cards as soon as I can get these monsters stable.
I know this is not really the forum for this but I have to ask since my patience is wearing thin: has anyone managed to get any of the opteron 8-way systems stable under load for protracted periods (week+)?
Hello Mattijs! I didn't fin
I didn't find any problems with decomposePar. The only two cases in the suite that I didn't get to run are
- dieselFoam/aachenBomb: the same problem as the one described by thomas in
- dnsFoam/boxTurb16: dnsFoam says
--> FOAM FATAL ERROR : calculated number of cells is incorrect
From function Kmesh::Kmesh(const fvMesh& mesh)
in file Kmesh/Kmesh.C at line 87.
no matter how I decompose the grid (simple/metis). My stupid question: does dnsFoam run in parallel?
Nope. It uses fast Fourier tr
Nope. It uses fast Fourier transforms and a regular uniform mesh (KMesh) to do it on for the forcing and that does not parallelise. If you throw away the forcing, the solver will run parallel.
decomposePar needs proper boun
decomposePar needs proper boundary conditions like any other code, but ft is not used anymore by dieselFoam, so that file can simply be removed.
Since decomposePar tries to decompose every file it finds in the directory it will obviously not work if the boundary conditions are wrong.
Has anyone tested to correct the bc, or simply remove ft, and then run decomposePar for the aachenBomb????
@dnsFoam: I've marked it as no
@dnsFoam: I've marked it as non-parallel in the Benchmark-suite.
@dieselFoam: my script does that (remove ft and fu) and then the grid get's correctly decomposed. But as soon as dieselFoam runs in parallel I get the error described in the other thread.
Hy Bernahard, the benchmak-
the benchmak-script is what I was looking for, since there are almost no cfd-benchmarks available. I am also interested in a dualcore vs. two-cpu comparison: espacially AMD Athlon 64 X2 4800+ vs.2 x AMD Opteron 248 2.20GHz vs. AMD Opteron 265 2x 1.80GHz. Which are all about the same im in price.
The problem is I don'get your PyFoam-0.2.2 script to run if I do a: python setup.py install
I get: error: invalid Python installation: unable to open /usr/lib/python2.4/config/Makefile (No such file or directory). Any idea?
Python seems to be installed since /usr/lib/python2.4/ does exist but not the config directory.
Thanks for your help!
P.S. I never used python before!
Hi Duderino! Which Linux-di
Which Linux-distribution are you using (I assume it's Linux)? (Python2.4 is only included in the most recent distributions)
Anyway: your python2.4-installation seems to be broken. To find out how badly broken it is just type 'python' on the command line. You then get an "interactive python shell". If you don't the installation is very badly broken.
If you're lucky there is an older version of python still installed (call 'python2.3' or 'python23', 2.2 won't work with my scripts). Try that.
Feel free to contact me by EMail (if we find a solution we can post it to this forum but I don't think it is necessary to bother people with the intermediate steps)
Hello all! First concerning
First concerning Jens' (Duderino) problem: it seems that Ubuntu-Linux only installs the files that are necessary for a successfull setup.py with the development stuff for Python (try 'apt-get install python2.4-dev' or something similar)
The benchmark-script and suite are now sufficently stable to be thought of as 'beta quality'.
Some prelimiary results can be found at
The parallel results are not too good (some would even say they're bad), but this had to be expected with cases in the suite that only use 11MB of memory (King Amdahl says Hello). But I think some of the results are quite interesting (Good speedup for Opteron-SMP compared to Xeon-SMP (with MultiThreading; thats not so good))
Feel free to add your results.
And of course: I'm still open to suggestions concerning the benchmark-suite.
Hello all I am looking for
I am looking for some volunteers who help me on comparing some machines. You just need to use Bernhards python script collection you get at the links of the first message in this thread.
I really would like to see some benchmark results on Opteron 250 and above system. So if somebody happens to have such system. Please run the benchmark and publish it at the wiki. This will definetly help me (and also others) with choosing a new system.
Hi, We are also planning on
We are also planning on purchasing a new Linux cluster. It has basically already been decided to be an AMD Opteron Dual Node, Dual Core, 2.2GHz. I will start doing some benchmaring during next week on a Dual Node Dual Core Opteron 280, 2.4GHz for up to 16 CPU's/cores. I will benchmark both with Gigabit network and Infiniband. Later on (in a week or so) I will have the opportunity to also try out a similar system but with InfiniPath and up to 32 CPU's/cores.
I will try to use your Python script, but I will also run a test with a 1M cell testcase in simpleFoam (A water turbine draft tube, anyone who would like the case can contact me to get it). As you have already mentioned, the testcases in the python-script are most likely way too small to say anything about real applications.
Does anyone have any suggestion on special settings I should use concerning domain decomposition (I plan to use automatic metis) or any specific settings that can be done in OpenFOAM, which could influence the benchmarking?
|All times are GMT -4. The time now is 11:35.|