|
[Sponsors] |
April 11, 2012, 09:48 |
Fastest Processor for CFD
|
#1 |
New Member
CFDLife
Join Date: Apr 2009
Posts: 7
Rep Power: 16 |
I am currently using Intel Nehalem Processor (W5580) for my CFD calculations. It was the fastest processor when i bought a few year back. I need to upgrade now. Is there a faster processor available now? Which is the best and the fastest? Any benchmarks?
|
|
April 11, 2012, 12:53 |
|
#2 |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
The fastest Xeon E5 is probably the fastest single chip right now. It should be quite a bit faster than the old Nehalems because the memory bandwidth is so much higher.
|
|
April 12, 2012, 08:40 |
|
#3 |
Senior Member
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18 |
You can go and look at the fpRate figures on spec.org . These benchmarks are not necessarily fully representative of the performance that you can expect in CFD, but do give a reasonable indication, especially within a processor family. As Kyle has pointed out, if you can afford them, the Xeon E5 processors are almost certainly the quickest ones available at the moment. Do not underestimate the significance of this step up in memory bandwidth. In many CFD codes, it seems to be memory more than anything else that determines speed, and the higher priced E5 processors are monsters in this regard. (50 GB/s vs. 20 or 30 for the older generation chips).
|
|
April 12, 2012, 09:51 |
|
#4 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23 |
I have found the memory controller in the Sandy bridge E chips, like the XEON-E5, to not be capable of pushing the memory anywhere close to 50GB/sec. On my 3930K with 2133MHz Ram I'm showing more like 18GB/sec in synthetic benchmarks, and less in ANSYS mechanical applications. My i7-2600K actualy gives me higher memory bandwidth, and its only dual channel 1600MHz, which is very disappointing.
Regardless, the Sandy-bridge E chips are still probably the fastest single chips, with the XEON E5-2687W being the fastest of the line. Though for the price of this one CPU, you could build a 4 node cluster of 2500Ks or 2600Ks and mop the floor with the XEON. I went with the 3930K (XEON twin is the E5-1650) as it seemed it would be very close in performance to the 2687W but it costs less than one third the price. There are a lot of "Euler-3D" benchmarks which is a CFD benchmark. They show a single 3960X beating the CFD performance of two Xeon W5580s, and a dual E5-2687W system just destroying everything: http://www.tomshardware.com/reviews/...ew,3149-9.html |
|
April 12, 2012, 10:42 |
|
#5 | |
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 18 |
Quote:
I haven't had a chance to play with an X79 system, but all the benchmarks I have seen show the memory bandwidth over 40GB/sec with just 1600MHz RAM. |
||
April 12, 2012, 14:36 |
|
#6 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23 |
I've got 8 Dimms installed and recognized as 32GB. I've seen benchmarks confirming sandy bridge is faster than sandy bridge E. I think it depends on the benchmarking software. check these benchmarks:
http://compare-processors.com/forum/...tion-t225.html I've read the four channels were not to increase memory speed but to increase the memory capacity. I don't know though. In real world tests using ANSYS 14.0 for a buckling analysis, the 2600K showed 14GB/s I/O rate, and 3930K showed only 12GB/sec I/O rate. As far as solve time: with only 2 cores the 2600K was just slightly faster; and with 3 cores, the 3930K was slightly faster, a couple percent in each case. In other, less I/O bound problems, the 3930K is incredibly fast. Check out this ANSYS benchmark where I absolutely destroyed everything, even dual Xeon systems using Tesla GPUs: http://www.padtinc.com/support/bench...07/default.asp Last edited by evcelica; April 12, 2012 at 14:55. Reason: ___ |
|
April 13, 2012, 14:25 |
|
#7 |
Member
Join Date: Mar 2009
Posts: 44
Rep Power: 17 |
Any conclusions drawn from running Ansys on structural problems in the context of a CFD forum are not terribly informative. We run Ansys (and Abaqus and Nastran) on structural problems and STAR-CCM+ for flow and the types of hardware we've found best for each are very different. If you have Ansys licenses, why don't you run Fluent or CFX benchmarks? They would be much more applicable to this particular group.
|
|
April 13, 2012, 15:51 |
|
#8 |
Senior Member
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18 |
TMG, could you elaborate a bit on the statement " ... the types of hardware we've found best for each are very different ..." ? It is what one would expect, but it would be very valuable if you were prepared to share some more of what you have learnt.
|
|
April 13, 2012, 16:18 |
|
#9 |
Member
Join Date: Mar 2009
Posts: 44
Rep Power: 17 |
So we haven't done exhaustive comparisons, but typically we run structural codes on single shared memory machines with maybe 4 cpus/24 cores (fastest Xeons du jour) as much memory as we can stuff in and a RAID based disk array to speed I/O by striping. With the structural codes, if you don't have really fast I/O, it doesn't matter much how good the rest is. They aren't typical memory bandwidth driven.
STAR-CCM+ runs on typical infiniband based clusters with 2 cpus/blade (12 cores with Westmere, soon to be 16 with Sandy Bridge) 24GB/blade (32 with Sandy Bridge), local disk for swap and OS only (no local data) and a big 10GigE based parallel file system. When we've tried to run structural codes on clusters, we've had to beef up the memory on each blade (double or triple) and we also had to install a RAID 0 pair of drives on each blade as well for scratch I/O. |
|
April 13, 2012, 18:39 |
|
#10 |
Senior Member
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18 |
Thanks for that info TMG. Did you guys find that it helped to use SSD's for high speed IO? Also, your blades have quite a lot of cores ... did you ever compare performance between either fully loading up each blade and spreading the load out? For example, if you were going to run say 48 cores, would it be quicker if you used say 4 fully loaded blades, or if you spread it out over say 24 blades, using only 1 core per CPU?
I'm trying to get a feel for the thin node (have less cores per box, with good IB interconnect) vs. fat node (low number of nodes, but many cores per node, relying on on-motherboard hypertransport links rather than fast low latency links between the nodes) question. |
|
April 13, 2012, 21:27 |
|
#11 |
Member
Join Date: Mar 2009
Posts: 44
Rep Power: 17 |
I havent ever tried using SSD's, but it certainly sounds like a logical thing to do. You probably don't have to write the solution to them, but if you can direct all the scratch files to SSD, it makes sense. I don't know enough about the speeds and feeds of SSD vs multiple high speed drives, but if its faster, I would bet it would help.
As far as spreading out the load (I assume you mean for CFD?) - on a Westmere I find that 6 cores are between 5 and 6 faster than 1 core. Running on 8 out of 12 cores on each blade is just about as good as running on 1 core on 8 separate blades. I've never seen a configuration with more than 2 cpu's in 1 blade or node work well (for flow) but I have no problem with 2 cpu's. There are so many ways to look at this and part of it has to do with how your licenses work. The STAR-CCM+ power licenses don't charge per core so you don't pay a license penalty for adding another core, even if it doesn't scale perfectly as long as its faster than not adding it. If you pay a lot for each additional core then you have to look for ways to get the most out of each one and maybe that means leaving some empty. The other thing to think about is the more cores you have in each box, the more important it is to have higher infiniband bandwidth. 16 cores all trying to talk to other blades at the same time must stress your interconnect pipe bandwidth more than 6 or 8 (or 1). With lower core counts, you probably have to worry about latency more than bandwidth. I guess its not much of a distinction as everyone is selling QDR if not FDR anyway. For what its worth, the Sandy Bridge cpu's, with the 4th memory channel really look good to us. We've seen some tests and there is no way that they could get good scaling to all 8 cores if it didn't mean more bandwidth. |
|
April 20, 2012, 10:06 |
|
#12 |
New Member
CFDLife
Join Date: Apr 2009
Posts: 7
Rep Power: 16 |
Thanks to all for your inputs. Kyle, CapSizer & evcelica...Your suggestions were most valuable...Thank you again!!
|
|
April 20, 2012, 20:03 |
|
#13 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23 |
OK so I ran the CFX Benchmark file. Its in: Program Files/ANSYS/v14/CFX/Examples folder. Instead of showing wall time which rounds to the second, I used average processor time. Results for 1-6 threads are as follows.
Avg Processor time: 1 - 26.40 sec 2 - 15.61 sec 3 - 11.06 sec 4 - 8.99 sec 5 - 7.88 sec 6 - 7.39 sec Scaling 1 - 1.00 2 - 1.69 3 - 2.39 4 - 2.94 5 - 3.35 6 - 3.57 Processor was 3930K Overclocked to 4.4GHz. 2133MHz Ram @ 9-11-10-28-2 timings. It would be great if others could run this benchmark. Perhaps I could run different Ram timings to see their effect as well. |
|
April 21, 2012, 21:39 |
|
#14 |
Member
Join Date: Mar 2009
Posts: 44
Rep Power: 17 |
Evcelica-I want to thank you for running CFX benchmarks. Its appreciated. I think what you see is what I expect. You can't scale 6x on 6 processors because CFX, like every other unstructured finite volume CFD code is memory bandwidth limited. I would bet you if you stopped the overclocking you wouldn't see much of a difference in real time (wall time) either. In fact, I would have reported wall time because in the end the only thing that counts with multiple processors is how fast you get out the other side in real time. I'm not sure if avg processor time measures the same thing. Anyway - thank you for reporting this. Was this on a Sandy Bridge or a Westmere?
|
|
April 24, 2012, 01:38 |
|
#15 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23 |
This was with sandy bridge-E. The reason I reported average processor time was to discern wall time which looks like it rounded to the nearest second or truncated the time. Both 5 and 6 processors gave me a wall time of 7.000 seconds, but I had to look at the average CPU time to see on was 7.8, and one was 7.3.
I'll try some different overclocks and memory speeds to see what the differences are. |
|
April 30, 2012, 11:07 |
License cost drive efficiency needs
|
#16 |
New Member
Clark Paterson
Join Date: Apr 2012
Location: USA
Posts: 3
Rep Power: 13 |
For all you ANSYS users, be sure to check your core efficiency! We installed a small cluster running MS HPC 2008 R2 at my company. When we were looking at benchmarks, we noticed that 6-core Westmere's were only 25% faster than 4-core Westmeres. You should have gotten a 50% increase if you were going by core count. We went with E-5620s and are extremely happy. According to the Intel chip site, this is the best $/MHz chip in the Westmere line. It isn’t the fastest by far, but you don’t need to pay a price premium for negligible gains. You also pay more per core for a 6-core architecture and higher.
With ANSYS and other commercial codes, you pay by the core. Don't waste your money on licensing inefficient cores. Yes, the 6-core version was a bit faster than the 4-core version. However, if you have a fast interconnect, you’d get better performance from 3 4-core chips instead of 2 6-core chips. You’d pay about the same in license costs. Yes, with ANSYS you’d be at 8, 32, 128… core runs with fully loaded HPC Packs. Recently we ran an ANSYS Mechanical benchmark against a Cray at a local university. The clusters are the same age. Our 8 cores solved in 113 minutes. On the Cray, 8 cores took 172 minutes, 24 Cores took 132 minutes. Our Lizard tests show 95.8% efficiency per node. Has anyone seen data for Sandy Bridge showing core efficiency on ANSYS? Back to back benchmarks of 4, 6, 8 core versions would show if they’ve opened the pipelines enough to maintain scaling efficiency. |
|
April 30, 2012, 19:58 |
|
#17 |
Senior Member
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23 |
Here what I've found on available benchmarks for mechanical and CFD:
Mechanical Benchmark Using PADT Benchmark #7 - Normalized to 120 iterations 3930K at 4.6GHz (Sandy Bridge E) 2133MHz RAM Cores wall time Speedup Core Efficiency 2 2985 1 (baseline) 100% (baseline) 4 1972 1.51 75.7% 6 1763 1.69 56.4% 2600K at 4.8GHz (Sandy Bridge) 1600MHz RAM Cores wall time Speedup Core Efficiency 2 3609 1 (baseline) 100% (baseline) 4 3299 1.09 54.7% CFD Benchmark Using supplied CFX Benchmark 3930K @ 4.4GHz (Sandy Bridge E) 2133MHz RAM Cores wall time Speedup Core Efficiency 2 15.61 1 (baseline) 100% (baseline) 4 8.99 1.74 86.8% 6 7.39 2.11 70.41% I don't have the 4 core anymore, so I can't run the CFX benchmark on it. |
|
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
RAM speed, is it worth the fastest? | Jordi | Hardware | 6 | March 22, 2012 18:44 |
Fastest software ICE combustion modelling | charnilaz | Main CFD Forum | 0 | December 28, 2011 11:25 |
Fastest parallel solver for tridiagonal system | Arpiruk | Main CFD Forum | 16 | August 28, 2007 06:14 |
for MS-Windows: which is the fastest processor? | Peter Specht | FLUENT | 2 | July 21, 2006 07:49 |
info about fastest available workstation | Rahel Yusuf | Main CFD Forum | 2 | July 28, 2005 19:43 |