CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Hardware

Fastest Processor for CFD

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By evcelica

Reply
 
LinkBack Thread Tools Display Modes
Old   April 11, 2012, 08:48
Default Fastest Processor for CFD
  #1
New Member
 
CFDLife
Join Date: Apr 2009
Posts: 7
Rep Power: 8
soni7007 is on a distinguished road
I am currently using Intel Nehalem Processor (W5580) for my CFD calculations. It was the fastest processor when i bought a few year back. I need to upgrade now. Is there a faster processor available now? Which is the best and the fastest? Any benchmarks?
soni7007 is offline   Reply With Quote

Old   April 11, 2012, 11:53
Default
  #2
Senior Member
 
Join Date: Mar 2009
Location: Austin, TX
Posts: 134
Rep Power: 9
kyle is on a distinguished road
The fastest Xeon E5 is probably the fastest single chip right now. It should be quite a bit faster than the old Nehalems because the memory bandwidth is so much higher.
kyle is offline   Reply With Quote

Old   April 12, 2012, 07:40
Default
  #3
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 181
Rep Power: 9
CapSizer is on a distinguished road
You can go and look at the fpRate figures on spec.org . These benchmarks are not necessarily fully representative of the performance that you can expect in CFD, but do give a reasonable indication, especially within a processor family. As Kyle has pointed out, if you can afford them, the Xeon E5 processors are almost certainly the quickest ones available at the moment. Do not underestimate the significance of this step up in memory bandwidth. In many CFD codes, it seems to be memory more than anything else that determines speed, and the higher priced E5 processors are monsters in this regard. (50 GB/s vs. 20 or 30 for the older generation chips).
CapSizer is offline   Reply With Quote

Old   April 12, 2012, 08:51
Default
  #4
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
I have found the memory controller in the Sandy bridge E chips, like the XEON-E5, to not be capable of pushing the memory anywhere close to 50GB/sec. On my 3930K with 2133MHz Ram I'm showing more like 18GB/sec in synthetic benchmarks, and less in ANSYS mechanical applications. My i7-2600K actualy gives me higher memory bandwidth, and its only dual channel 1600MHz, which is very disappointing.

Regardless, the Sandy-bridge E chips are still probably the fastest single chips, with the XEON E5-2687W being the fastest of the line. Though for the price of this one CPU, you could build a 4 node cluster of 2500Ks or 2600Ks and mop the floor with the XEON.

I went with the 3930K (XEON twin is the E5-1650) as it seemed it would be very close in performance to the 2687W but it costs less than one third the price.

There are a lot of "Euler-3D" benchmarks which is a CFD benchmark. They show a single 3960X beating the CFD performance of two Xeon W5580s, and a dual E5-2687W system just destroying everything:
http://www.tomshardware.com/reviews/...ew,3149-9.html
shreyasr likes this.
evcelica is offline   Reply With Quote

Old   April 12, 2012, 09:42
Default
  #5
Senior Member
 
Join Date: Mar 2009
Location: Austin, TX
Posts: 134
Rep Power: 9
kyle is on a distinguished road
Quote:
Originally Posted by evcelica View Post
I have found the memory controller in the Sandy bridge E chips, like the XEON-E5, to not be capable of pushing the memory anywhere close to 50GB/sec. On my 3930K with 2133MHz Ram I'm showing more like 18GB/sec in synthetic benchmarks, and less in ANSYS mechanical applications. My i7-2600K actualy gives me higher memory bandwidth, and its only dual channel 1600MHz, which is very disappointing.
Are you sure you have the RAM installed correctly on your 3930K system? It sounds like you are only using two of the four memory channels. You have to have at least four sticks of RAM and they have to be installed into the correct slots (assuming you have a MB with more than four RAM slots).

I haven't had a chance to play with an X79 system, but all the benchmarks I have seen show the memory bandwidth over 40GB/sec with just 1600MHz RAM.
kyle is offline   Reply With Quote

Old   April 12, 2012, 13:36
Default
  #6
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
I've got 8 Dimms installed and recognized as 32GB. I've seen benchmarks confirming sandy bridge is faster than sandy bridge E. I think it depends on the benchmarking software. check these benchmarks:
http://compare-processors.com/forum/...tion-t225.html
I've read the four channels were not to increase memory speed but to increase the memory capacity. I don't know though.

In real world tests using ANSYS 14.0 for a buckling analysis, the 2600K showed 14GB/s I/O rate, and 3930K showed only 12GB/sec I/O rate. As far as solve time: with only 2 cores the 2600K was just slightly faster; and with 3 cores, the 3930K was slightly faster, a couple percent in each case.

In other, less I/O bound problems, the 3930K is incredibly fast. Check out this ANSYS benchmark where I absolutely destroyed everything, even dual Xeon systems using Tesla GPUs: http://www.padtinc.com/support/bench...07/default.asp

Last edited by evcelica; April 12, 2012 at 13:55. Reason: ___
evcelica is offline   Reply With Quote

Old   April 13, 2012, 13:25
Default
  #7
TMG
Member
 
Join Date: Mar 2009
Posts: 44
Rep Power: 8
TMG is on a distinguished road
Any conclusions drawn from running Ansys on structural problems in the context of a CFD forum are not terribly informative. We run Ansys (and Abaqus and Nastran) on structural problems and STAR-CCM+ for flow and the types of hardware we've found best for each are very different. If you have Ansys licenses, why don't you run Fluent or CFX benchmarks? They would be much more applicable to this particular group.
TMG is offline   Reply With Quote

Old   April 13, 2012, 14:51
Default
  #8
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 181
Rep Power: 9
CapSizer is on a distinguished road
TMG, could you elaborate a bit on the statement " ... the types of hardware we've found best for each are very different ..." ? It is what one would expect, but it would be very valuable if you were prepared to share some more of what you have learnt.
CapSizer is offline   Reply With Quote

Old   April 13, 2012, 15:18
Default
  #9
TMG
Member
 
Join Date: Mar 2009
Posts: 44
Rep Power: 8
TMG is on a distinguished road
So we haven't done exhaustive comparisons, but typically we run structural codes on single shared memory machines with maybe 4 cpus/24 cores (fastest Xeons du jour) as much memory as we can stuff in and a RAID based disk array to speed I/O by striping. With the structural codes, if you don't have really fast I/O, it doesn't matter much how good the rest is. They aren't typical memory bandwidth driven.

STAR-CCM+ runs on typical infiniband based clusters with 2 cpus/blade (12 cores with Westmere, soon to be 16 with Sandy Bridge) 24GB/blade (32 with Sandy Bridge), local disk for swap and OS only (no local data) and a big 10GigE based parallel file system. When we've tried to run structural codes on clusters, we've had to beef up the memory on each blade (double or triple) and we also had to install a RAID 0 pair of drives on each blade as well for scratch I/O.
TMG is offline   Reply With Quote

Old   April 13, 2012, 17:39
Default
  #10
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 181
Rep Power: 9
CapSizer is on a distinguished road
Thanks for that info TMG. Did you guys find that it helped to use SSD's for high speed IO? Also, your blades have quite a lot of cores ... did you ever compare performance between either fully loading up each blade and spreading the load out? For example, if you were going to run say 48 cores, would it be quicker if you used say 4 fully loaded blades, or if you spread it out over say 24 blades, using only 1 core per CPU?

I'm trying to get a feel for the thin node (have less cores per box, with good IB interconnect) vs. fat node (low number of nodes, but many cores per node, relying on on-motherboard hypertransport links rather than fast low latency links between the nodes) question.
CapSizer is offline   Reply With Quote

Old   April 13, 2012, 20:27
Default
  #11
TMG
Member
 
Join Date: Mar 2009
Posts: 44
Rep Power: 8
TMG is on a distinguished road
I havent ever tried using SSD's, but it certainly sounds like a logical thing to do. You probably don't have to write the solution to them, but if you can direct all the scratch files to SSD, it makes sense. I don't know enough about the speeds and feeds of SSD vs multiple high speed drives, but if its faster, I would bet it would help.
As far as spreading out the load (I assume you mean for CFD?) - on a Westmere I find that 6 cores are between 5 and 6 faster than 1 core. Running on 8 out of 12 cores on each blade is just about as good as running on 1 core on 8 separate blades. I've never seen a configuration with more than 2 cpu's in 1 blade or node work well (for flow) but I have no problem with 2 cpu's. There are so many ways to look at this and part of it has to do with how your licenses work. The STAR-CCM+ power licenses don't charge per core so you don't pay a license penalty for adding another core, even if it doesn't scale perfectly as long as its faster than not adding it. If you pay a lot for each additional core then you have to look for ways to get the most out of each one and maybe that means leaving some empty.
The other thing to think about is the more cores you have in each box, the more important it is to have higher infiniband bandwidth. 16 cores all trying to talk to other blades at the same time must stress your interconnect pipe bandwidth more than 6 or 8 (or 1). With lower core counts, you probably have to worry about latency more than bandwidth. I guess its not much of a distinction as everyone is selling QDR if not FDR anyway.
For what its worth, the Sandy Bridge cpu's, with the 4th memory channel really look good to us. We've seen some tests and there is no way that they could get good scaling to all 8 cores if it didn't mean more bandwidth.
TMG is offline   Reply With Quote

Old   April 20, 2012, 09:06
Default
  #12
New Member
 
CFDLife
Join Date: Apr 2009
Posts: 7
Rep Power: 8
soni7007 is on a distinguished road
Thanks to all for your inputs. Kyle, CapSizer & evcelica...Your suggestions were most valuable...Thank you again!!
soni7007 is offline   Reply With Quote

Old   April 20, 2012, 19:03
Default
  #13
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
OK so I ran the CFX Benchmark file. Its in: Program Files/ANSYS/v14/CFX/Examples folder. Instead of showing wall time which rounds to the second, I used average processor time. Results for 1-6 threads are as follows.

Avg Processor time:
1 - 26.40 sec
2 - 15.61 sec
3 - 11.06 sec
4 - 8.99 sec
5 - 7.88 sec
6 - 7.39 sec

Scaling
1 - 1.00
2 - 1.69
3 - 2.39
4 - 2.94
5 - 3.35
6 - 3.57

Processor was 3930K Overclocked to 4.4GHz. 2133MHz Ram @ 9-11-10-28-2 timings. It would be great if others could run this benchmark. Perhaps I could run different Ram timings to see their effect as well.
evcelica is offline   Reply With Quote

Old   April 21, 2012, 20:39
Default
  #14
TMG
Member
 
Join Date: Mar 2009
Posts: 44
Rep Power: 8
TMG is on a distinguished road
Evcelica-I want to thank you for running CFX benchmarks. Its appreciated. I think what you see is what I expect. You can't scale 6x on 6 processors because CFX, like every other unstructured finite volume CFD code is memory bandwidth limited. I would bet you if you stopped the overclocking you wouldn't see much of a difference in real time (wall time) either. In fact, I would have reported wall time because in the end the only thing that counts with multiple processors is how fast you get out the other side in real time. I'm not sure if avg processor time measures the same thing. Anyway - thank you for reporting this. Was this on a Sandy Bridge or a Westmere?
TMG is offline   Reply With Quote

Old   April 24, 2012, 00:38
Default
  #15
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
This was with sandy bridge-E. The reason I reported average processor time was to discern wall time which looks like it rounded to the nearest second or truncated the time. Both 5 and 6 processors gave me a wall time of 7.000 seconds, but I had to look at the average CPU time to see on was 7.8, and one was 7.3.
I'll try some different overclocks and memory speeds to see what the differences are.
evcelica is offline   Reply With Quote

Old   April 30, 2012, 10:07
Default License cost drive efficiency needs
  #16
New Member
 
Join Date: Apr 2012
Posts: 1
Rep Power: 0
cfpaters is on a distinguished road
For all you ANSYS users, be sure to check your core efficiency! We installed a small cluster running MS HPC 2008 R2 at my company. When we were looking at benchmarks, we noticed that 6-core Westmere's were only 25% faster than 4-core Westmeres. You should have gotten a 50% increase if you were going by core count. We went with E-5620s and are extremely happy. According to the Intel chip site, this is the best $/MHz chip in the Westmere line. It isn’t the fastest by far, but you don’t need to pay a price premium for negligible gains. You also pay more per core for a 6-core architecture and higher.

With ANSYS and other commercial codes, you pay by the core. Don't waste your money on licensing inefficient cores. Yes, the 6-core version was a bit faster than the 4-core version. However, if you have a fast interconnect, you’d get better performance from 3 4-core chips instead of 2 6-core chips. You’d pay about the same in license costs. Yes, with ANSYS you’d be at 8, 32, 128… core runs with fully loaded HPC Packs.

Recently we ran an ANSYS Mechanical benchmark against a Cray at a local university. The clusters are the same age. Our 8 cores solved in 113 minutes. On the Cray, 8 cores took 172 minutes, 24 Cores took 132 minutes. Our Lizard tests show 95.8% efficiency per node.



Has anyone seen data for Sandy Bridge showing core efficiency on ANSYS? Back to back benchmarks of 4, 6, 8 core versions would show if they’ve opened the pipelines enough to maintain scaling efficiency.
cfpaters is offline   Reply With Quote

Old   April 30, 2012, 18:58
Default
  #17
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 487
Rep Power: 10
evcelica is on a distinguished road
Here what I've found on available benchmarks for mechanical and CFD:

Mechanical Benchmark
Using PADT Benchmark #7 - Normalized to 120 iterations

3930K at 4.6GHz (Sandy Bridge E) 2133MHz RAM
Cores wall time Speedup Core Efficiency
2 2985 1 (baseline) 100% (baseline)
4 1972 1.51 75.7%
6 1763 1.69 56.4%

2600K at 4.8GHz (Sandy Bridge) 1600MHz RAM
Cores wall time Speedup Core Efficiency
2 3609 1 (baseline) 100% (baseline)
4 3299 1.09 54.7%


CFD Benchmark
Using supplied CFX Benchmark
3930K @ 4.4GHz (Sandy Bridge E) 2133MHz RAM
Cores wall time Speedup Core Efficiency
2 15.61 1 (baseline) 100% (baseline)
4 8.99 1.74 86.8%
6 7.39 2.11 70.41%

I don't have the 4 core anymore, so I can't run the CFX benchmark on it.
evcelica is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
RAM speed, is it worth the fastest? Jordi Hardware 6 March 22, 2012 18:44
Fastest software ICE combustion modelling charnilaz Main CFD Forum 0 December 28, 2011 11:25
Fastest parallel solver for tridiagonal system Arpiruk Main CFD Forum 16 August 28, 2007 05:14
for MS-Windows: which is the fastest processor? Peter Specht FLUENT 2 July 21, 2006 06:49
info about fastest available workstation Rahel Yusuf Main CFD Forum 2 July 28, 2005 18:43


All times are GMT -4. The time now is 15:19.