CFD Online Discussion Forums - Fastest Processor for CFD

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Fastest Processor for CFD (https://www.cfd-online.com/Forums/hardware/99775-fastest-processor-cfd.html)

soni7007

April 11, 2012 08:48

Fastest Processor for CFD

I am currently using Intel Nehalem Processor (W5580) for my CFD calculations. It was the fastest processor when i bought a few year back. I need to upgrade now. Is there a faster processor available now? Which is the best and the fastest? Any benchmarks?

kyle	April 11, 2012 11:53

The fastest Xeon E5 is probably the fastest single chip right now. It should be quite a bit faster than the old Nehalems because the memory bandwidth is so much higher.

CapSizer

April 12, 2012 07:40

You can go and look at the fpRate figures on spec.org . These benchmarks are not necessarily fully representative of the performance that you can expect in CFD, but do give a reasonable indication, especially within a processor family. As Kyle has pointed out, if you can afford them, the Xeon E5 processors are almost certainly the quickest ones available at the moment. Do not underestimate the significance of this step up in memory bandwidth. In many CFD codes, it seems to be memory more than anything else that determines speed, and the higher priced E5 processors are monsters in this regard. (50 GB/s vs. 20 or 30 for the older generation chips).

evcelica

April 12, 2012 08:51

I have found the memory controller in the Sandy bridge E chips, like the XEON-E5, to not be capable of pushing the memory anywhere close to 50GB/sec. On my 3930K with 2133MHz Ram I'm showing more like 18GB/sec in synthetic benchmarks, and less in ANSYS mechanical applications. My i7-2600K actualy gives me higher memory bandwidth, and its only dual channel 1600MHz, which is very disappointing.

Regardless, the Sandy-bridge E chips are still probably the fastest single chips, with the XEON E5-2687W being the fastest of the line. Though for the price of this one CPU, you could build a 4 node cluster of 2500Ks or 2600Ks and mop the floor with the XEON.

I went with the 3930K (XEON twin is the E5-1650) as it seemed it would be very close in performance to the 2687W but it costs less than one third the price.

There are a lot of "Euler-3D" benchmarks which is a CFD benchmark. They show a single 3960X beating the CFD performance of two Xeon W5580s, and a dual E5-2687W system just destroying everything:
http://www.tomshardware.com/reviews/...ew,3149-9.html

kyle	April 12, 2012 09:42

Quote:

Originally Posted by evcelica (Post 354376)

Are you sure you have the RAM installed correctly on your 3930K system? It sounds like you are only using two of the four memory channels. You have to have at least four sticks of RAM and they have to be installed into the correct slots (assuming you have a MB with more than four RAM slots).

I haven't had a chance to play with an X79 system, but all the benchmarks I have seen show the memory bandwidth over 40GB/sec with just 1600MHz RAM.

evcelica

April 12, 2012 13:36

I've got 8 Dimms installed and recognized as 32GB. I've seen benchmarks confirming sandy bridge is faster than sandy bridge E. I think it depends on the benchmarking software. check these benchmarks:
http://compare-processors.com/forum/...tion-t225.html
I've read the four channels were not to increase memory speed but to increase the memory capacity. I don't know though.

In real world tests using ANSYS 14.0 for a buckling analysis, the 2600K showed 14GB/s I/O rate, and 3930K showed only 12GB/sec I/O rate. As far as solve time: with only 2 cores the 2600K was just slightly faster; and with 3 cores, the 3930K was slightly faster, a couple percent in each case.

In other, less I/O bound problems, the 3930K is incredibly fast. Check out this ANSYS benchmark where I absolutely destroyed everything, even dual Xeon systems using Tesla GPUs: http://www.padtinc.com/support/bench...07/default.asp

TMG	April 13, 2012 13:25

Any conclusions drawn from running Ansys on structural problems in the context of a CFD forum are not terribly informative. We run Ansys (and Abaqus and Nastran) on structural problems and STAR-CCM+ for flow and the types of hardware we've found best for each are very different. If you have Ansys licenses, why don't you run Fluent or CFX benchmarks? They would be much more applicable to this particular group.

CapSizer

April 13, 2012 14:51

TMG, could you elaborate a bit on the statement " ... the types of hardware we've found best for each are very different ..." ? It is what one would expect, but it would be very valuable if you were prepared to share some more of what you have learnt.

TMG	April 13, 2012 15:18

So we haven't done exhaustive comparisons, but typically we run structural codes on single shared memory machines with maybe 4 cpus/24 cores (fastest Xeons du jour) as much memory as we can stuff in and a RAID based disk array to speed I/O by striping. With the structural codes, if you don't have really fast I/O, it doesn't matter much how good the rest is. They aren't typical memory bandwidth driven.

STAR-CCM+ runs on typical infiniband based clusters with 2 cpus/blade (12 cores with Westmere, soon to be 16 with Sandy Bridge) 24GB/blade (32 with Sandy Bridge), local disk for swap and OS only (no local data) and a big 10GigE based parallel file system. When we've tried to run structural codes on clusters, we've had to beef up the memory on each blade (double or triple) and we also had to install a RAID 0 pair of drives on each blade as well for scratch I/O.

CapSizer

April 13, 2012 17:39

Thanks for that info TMG. Did you guys find that it helped to use SSD's for high speed IO? Also, your blades have quite a lot of cores ... did you ever compare performance between either fully loading up each blade and spreading the load out? For example, if you were going to run say 48 cores, would it be quicker if you used say 4 fully loaded blades, or if you spread it out over say 24 blades, using only 1 core per CPU?

I'm trying to get a feel for the thin node (have less cores per box, with good IB interconnect) vs. fat node (low number of nodes, but many cores per node, relying on on-motherboard hypertransport links rather than fast low latency links between the nodes) question.

TMG	April 13, 2012 20:27

I havent ever tried using SSD's, but it certainly sounds like a logical thing to do. You probably don't have to write the solution to them, but if you can direct all the scratch files to SSD, it makes sense. I don't know enough about the speeds and feeds of SSD vs multiple high speed drives, but if its faster, I would bet it would help.
As far as spreading out the load (I assume you mean for CFD?) - on a Westmere I find that 6 cores are between 5 and 6 faster than 1 core. Running on 8 out of 12 cores on each blade is just about as good as running on 1 core on 8 separate blades. I've never seen a configuration with more than 2 cpu's in 1 blade or node work well (for flow) but I have no problem with 2 cpu's. There are so many ways to look at this and part of it has to do with how your licenses work. The STAR-CCM+ power licenses don't charge per core so you don't pay a license penalty for adding another core, even if it doesn't scale perfectly as long as its faster than not adding it. If you pay a lot for each additional core then you have to look for ways to get the most out of each one and maybe that means leaving some empty.
The other thing to think about is the more cores you have in each box, the more important it is to have higher infiniband bandwidth. 16 cores all trying to talk to other blades at the same time must stress your interconnect pipe bandwidth more than 6 or 8 (or 1). With lower core counts, you probably have to worry about latency more than bandwidth. I guess its not much of a distinction as everyone is selling QDR if not FDR anyway.
For what its worth, the Sandy Bridge cpu's, with the 4th memory channel really look good to us. We've seen some tests and there is no way that they could get good scaling to all 8 cores if it didn't mean more bandwidth.

soni7007

April 20, 2012 09:06

Thanks to all for your inputs. Kyle, CapSizer & evcelica...Your suggestions were most valuable...Thank you again!!

evcelica

April 20, 2012 19:03

OK so I ran the CFX Benchmark file. Its in: Program Files/ANSYS/v14/CFX/Examples folder. Instead of showing wall time which rounds to the second, I used average processor time. Results for 1-6 threads are as follows.

Avg Processor time:
1 - 26.40 sec
2 - 15.61 sec
3 - 11.06 sec
4 - 8.99 sec
5 - 7.88 sec
6 - 7.39 sec

Scaling
1 - 1.00
2 - 1.69
3 - 2.39
4 - 2.94
5 - 3.35
6 - 3.57

Processor was 3930K Overclocked to 4.4GHz. 2133MHz Ram @ 9-11-10-28-2 timings. It would be great if others could run this benchmark. Perhaps I could run different Ram timings to see their effect as well.

TMG	April 21, 2012 20:39

Evcelica-I want to thank you for running CFX benchmarks. Its appreciated. I think what you see is what I expect. You can't scale 6x on 6 processors because CFX, like every other unstructured finite volume CFD code is memory bandwidth limited. I would bet you if you stopped the overclocking you wouldn't see much of a difference in real time (wall time) either. In fact, I would have reported wall time because in the end the only thing that counts with multiple processors is how fast you get out the other side in real time. I'm not sure if avg processor time measures the same thing. Anyway - thank you for reporting this. Was this on a Sandy Bridge or a Westmere?

evcelica

April 24, 2012 00:38

This was with sandy bridge-E. The reason I reported average processor time was to discern wall time which looks like it rounded to the nearest second or truncated the time. Both 5 and 6 processors gave me a wall time of 7.000 seconds, but I had to look at the average CPU time to see on was 7.8, and one was 7.3.
I'll try some different overclocks and memory speeds to see what the differences are.

cfpaters

April 30, 2012 10:07

License cost drive efficiency needs

For all you ANSYS users, be sure to check your core efficiency! We installed a small cluster running MS HPC 2008 R2 at my company. When we were looking at benchmarks, we noticed that 6-core Westmere's were only 25% faster than 4-core Westmeres. You should have gotten a 50% increase if you were going by core count. We went with E-5620s and are extremely happy. According to the Intel chip site, this is the best $/MHz chip in the Westmere line. It isn’t the fastest by far, but you don’t need to pay a price premium for negligible gains. You also pay more per core for a 6-core architecture and higher.

With ANSYS and other commercial codes, you pay by the core. Don't waste your money on licensing inefficient cores. Yes, the 6-core version was a bit faster than the 4-core version. However, if you have a fast interconnect, you’d get better performance from 3 4-core chips instead of 2 6-core chips. You’d pay about the same in license costs. Yes, with ANSYS you’d be at 8, 32, 128… core runs with fully loaded HPC Packs.

Recently we ran an ANSYS Mechanical benchmark against a Cray at a local university. The clusters are the same age. Our 8 cores solved in 113 minutes. On the Cray, 8 cores took 172 minutes, 24 Cores took 132 minutes. Our Lizard tests show 95.8% efficiency per node.

Has anyone seen data for Sandy Bridge showing core efficiency on ANSYS? Back to back benchmarks of 4, 6, 8 core versions would show if they’ve opened the pipelines enough to maintain scaling efficiency.

evcelica

April 30, 2012 18:58

Here what I've found on available benchmarks for mechanical and CFD:

Mechanical Benchmark
Using PADT Benchmark #7 - Normalized to 120 iterations

3930K at 4.6GHz (Sandy Bridge E) 2133MHz RAM
Cores wall time Speedup Core Efficiency
2 2985 1 (baseline) 100% (baseline)
4 1972 1.51 75.7%
6 1763 1.69 56.4%

2600K at 4.8GHz (Sandy Bridge) 1600MHz RAM
Cores wall time Speedup Core Efficiency
2 3609 1 (baseline) 100% (baseline)
4 3299 1.09 54.7%

CFD Benchmark
Using supplied CFX Benchmark
3930K @ 4.4GHz (Sandy Bridge E) 2133MHz RAM
Cores wall time Speedup Core Efficiency
2 15.61 1 (baseline) 100% (baseline)
4 8.99 1.74 86.8%
6 7.39 2.11 70.41%

I don't have the 4 core anymore, so I can't run the CFX benchmark on it.

All times are GMT -4. The time now is 15:31.