4 cpu motherboard for CFD
Have anyone used 4 cpu motherboars for cfd? With 4 cpu motherboard and four opteron 6174 prosessors you can build compact 48 core machine. For example this motherboard:
Which one would be faster setup?
cluster of 2 computers
2 cpu motherboards and two 6174 opteron for each machine
4 pcs 6174 opteron
4 cpu motherboard
Aren't the 2 processors per board and 4 processors per board a different part number - for opterons now they are the 2000 and 8000 series I believe? Typically the 4 processor per board models are significantly more expensive.
I would think that with a decent interconnect that the two board solution is probably faster and cheaper. You will probably get more memory bandwidth per core with the 2 board solution.
the 4 processor solution together with the 12 core Opterons (6100 series aka Magny Cours) and the soon available 6200 series (aka Interlagos) which should fit into the same board are really very popular at the moment in the HPC community because they offer a very good price performance ratio. The memory bandwidth is also quite nice. For a benchmark you might read here:
If you can wait: the prices of Interlagos should be even more competitive but what first benchmarks for yet available desktop FX-series indicate is that you need some compiler tuning to get full performance:
CFD performance with unstructured grids on AMD's multi-socket boards is extremely poor. This article from anandtech tries to investigate why. I am assuming that Interlagos won't fix this entirely.
The best price/performance for CFD available now is far and away Intel's desktop chips. Four i5 2400 machines, which you can build for as little as $300 each, would blow your two choices out of the water. With just four machines you can get away with just a gig-e network.
Or, you could wait a week and get the new Intel Sandy Bridge E chips, which have six cores and an absolutely ridiculous amount of memory bandwidth. They machines would cost a little more than ones using the current Sandy Bridge chips, but the performance should be significantly more as well. It definitely would be way cheaper, and way faster, than buying server class hardware from AMD.
I doubt that Euler3d results are representative for general CFD
performance. On this system
the speedup was almost linear.
Also Gigabit Ethernet interconnects are not a good choice if you want top performance.
From my point of view you are better off with Intel chips at the moment if the licensing model of your CFD code is per core. If this doesn't matter Opterons are often the better alternative. But as we saw before this is not generally valid so best you run benchmarks of your code before buying.
Gigabit ethernet is good enough for very small clusters. I had a four node cluster with gigabit ethernet that scaled from one to four nodes at 90% efficiency. Infiniband would take that up to what, 93%? For the money I could just buy another node and get ~20% speedup instead of ~3%.
AMD just is not competitive right now. With traditional CFD on unstructured grids, performance is dominated by memory bandwidth, memory latency and caching... all of which are areas that Intel has a significant advantage. Clockspeed doesn't really matter, I overclocked my machines from 3.4ghz to 4.0ghz and only saw a tiny speedup.
Regardless of per-core licensing issues, if you have a fixed amount of money to spend then buying Intel systems will give you the fastest cluster.
All of this only holds true for traditional CFD on unstructured meshes. If you are using structured meshes or a Lattice Boltzman code like Exa, then AMD likely DOES make sense.
Another point to consider is energy consumption. My private owned AMD CPU is slower than the Xeon in my workstation and needs more energy. This is no issue as long as it doesn't run for a long time, but when it's up an 24/7 and under full load, it makes a huge difference. In Germany, it makes a difference of 50 bucks on the electricity bill per node in just a year. But the AMD would need to run at leas 20% longer to get the same results.
It's a shame, as I don't like the total market control and pricing policy of Intel - but at least the moement, AMD can't compete with the power and efficiency of Intel CPU's.
I recently had the chance to make a little benchmark between a two socket XeonX5675 (24 Cores, 3.06GHz) and the new AMD Opteron 6274 (32 Cores, 2.1GHz). I run the DLR turbomachinery solver TRACE on a multi-block mesh of a axial compressor stage. OS was openSuse 12.1 in both cases, use of openMPI for parallelization
The results at a glance
machine numberJobs numberCores timesteps/minute (over all jobs)
XeonX5675 3 4 30,57
XeonX5675 3 8 33,93
XeonX5675 4 6 34,09
Opt6274 4 4 26,79
Opt6274 4 8 37,57
The main conclusions (from my perspective)
- Hyperthreading on Xeon is only effective in case of imperfect load balancing, at least for this number crunching intensive code.
- The sharing of one FPU for two cores on the Opteron system is the better deal for CFD, the test with 4*8 cores has about 40% more speed than 4*4 cores (one FPU per process)
- Opteron is the better deal, especially for a four socket system with infiniband interconnection, resulting in much lower hardware costs.
Dear Mr. Siller
just to make sure I understand your benchmark correctly: You run three/four distinct cases utilizing all cores available to the system.
Could it be that if you use all cores for one job (and make sure that no processor switches happen, emptying the INT/CMD/FPU pipelines) the results may look different? (And yes, I agree, HT is not relevant for CFD).
I am asking because I have to make the desicion Opteron 62XX vs E5-26YY and there are different aspects to consider. From the Benchmarks
ROMS and WRFv3 are interesting for CFD applications, while
it seems to me that the 6174 processor can only win in certain rather artificial situations. If any you need to consider 6276 as a direct E5-26YY competitor.
Hi Mr. Skillas,
your are right: I started the same computation n times on the machine and measured the time to finish for a specific number of timesteps. While for the Interlagos and the Xeon without HT all runs finished quite at the same time, the OT on case had very different running times (up to 10%).
My little benchmark is far away answering even the most important questions of the matrix beeing relevant for parallel computing.
We had the following strategy to answer the question:
- We have no core based licensing issue of our CFD solver - that simplifies a lot.
- Comparing the hardware costs of an Xeon based 2-socket server and an Interlagos 4-socket server (both with IB interconnection) we came up with approx. half the hardware costs per core for the AMD system - the lower clock speed of the AMD is already included.
Last week we received our HPC cluster from Delta Computer GmbH (Hamburg) and we are now looking forward to test again in-house :).
Ulrich, it would be great if you could keep us informed about what you find. I am particularly interested in seeing how well your application scales on a node compared to how well it scales across nodes. There seems to be quite a lot of uncertainty about whether it is really better to run with many cores on a motherboard (call it pure shared memory), or if it is faster to have more nodes, but not so many cores per motherboard.
|All times are GMT -4. The time now is 18:18.|