Fluent benchmakrs on new Intel CPUs
I am currently evaluating hardware for running Fluent and came across this very impressive "World record" from Sun.
Unfortunately this is a rather impressive piece of marketing baloney - perhaps I am a bit harsh - but it does deserve a closer look (comment on the blog is closed or I would have taken it up with the author directly).
In my world we have to pay for each box and each CPU and each core. So when we run application we want each of the cores to work. I speculate that this benchmark was achieved by running 1 core (out 4 in the box) on 4 different machines connected via Infiniband - leaving 12 cores useless/inactive. The 8 core result is achieved by using 2 (out of the 4) on each of the 4 boxes - leaving 8 inactive cores. This produces impressive scaling figures and high numbers - with no use for real world (my world) applications.
Perhaps if the author would just report his benchmark in the standard way where the number of boxes or chips are included (e.g. on Fluent website or for spec.org CPU benchmarks) he would have been able to give us something useful.
More worrying to me is the results for 16 cores - a closer look shows a 54% parallel scaling efficiency. So when all boxes are fully loaded, the performance is miserable.
So far my search for the right solution has been time-consuming and frustrating. The unavailability of Barcelona benchmarks (spanning more than one box) and the unavailability of the CPUs in general points towards 2 options: (1) New Intel technology (Harpertown/Wolfdale) with its memory bandwith bottleneck or (2) the older AMD dual cores. Pricing seems to indicate that option (2) is better because the high-end CPUs (3.0GHz and 3.2GHz) are ridiculously expensive.
In summary: For smaller parallel jobs (up to 16 cores) the Intel CPUs are best, but at larger sizes (40+ cores) it seems that AMD remains king - even with lower CPU clockspeeds. If anyone can add alternative ways to look at this, it would be greatly appreciated.
May convergence be with you.
DISCLAIMER: I am not affiliated with any hardware or software vendors.
Re: Fluent benchmakrs on new Intel CPUs
I just built a 16 core rig (8x dual core Wolfdale 45 nm E8400s) running linux 64 bit, standard gige interconnect.
Running wind tunnel DES sims gives excellent per core utilisation of 95-100%. It was absolutely CRUCIAL to use a suitable partition method (principle Z axis in my case) which cuts the domain like a sliced salami. Metis gave appalling scaling.
Some basics: Dont even think of using intel quads ... scaling is appalling for 3+ cores. Look at the benchmarks on Fluent's site. Barcelona is a dud. Its performance at 2.2-2.4 GHZ is piss poor even if it (hopefully) will scale well. My 3.8 GHZ dual wolfdales (3 GHZ stock OCed to 3.8 GHZ) would run rings around any quad Barcelona. Intel dual cores scale very well over ethernet IF you are very thorough in evaluating the various partitioning methods. In my experience minimizing the number of partition neighbors is much more important than minimizing the interface cell ratios. If you are aiming for < 24 cores I wouldn't bother with a low latency interconnect unless you know your problems will be difficult to partition efficiently. For a dedicated cluster, stringing together Wolfdales is cheaper and faster than stringing together Harpertown dual socket systems.
CFX's parallel performance is much better than Fluent's. With Fluent you really have to configure the problem well to get good scaling. With CFX you have to really bungle things to get poor scaling. For the exact same mesh and case definition CFX gave me ~100% scaling efficiency over 16 cores, Fluent's was ~70% ... which corresponds to Fluents published benchmarks.
On the horizon, Intels upcoming Nehalem looks to be God like in terms of raw speed and multicore scaling.
|All times are GMT -4. The time now is 03:36.|