CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Hardware check: 32 core setup for CFD

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 17, 2017, 05:45
Default Hardware check: 32 core setup for CFD
  #1
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 12
SLC is on a distinguished road
Hi,

I would very much appreciate a second set of eyes on my pick of hardware for a 32 core compute setup for Fluent/CFX.

My company is exclusively supplied by Dell, and I cannot "build my own," so to speak.

My potential build consists of 2 nodes (Dell Precision 7920), each with:

2 x Intel Xeon Gold 6134 - 3.2 GHz - 8C
12 x 8 GB DDR4 2666 MHz (ECC) - 96 GB total

So that's 32 cores spread over 4 sockets over 2 separate nodes (with six DIMMS per socket).

This appears to hit the sweet spot for a powerful yet affordable setup.

Will I be severely bottlenecked if I connect the two nodes using 10 GbE? Would going to a faster interconnect be advisable?

Thanks
SLC is offline   Reply With Quote

Old   November 17, 2017, 06:56
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,897
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
No objections on the build itself.
Just two comments:
1) make sure to get dual-rank DIMMs
2) consider a quad-socket setup. The Xeon Gold 6134 you picked are actually a good choice for quad-socket thanks to their 3 UPI links.
This way you avoid any hassle with node interconnects and maybe even save a few bucks. If you do not need 2 dedicated workstations for whatever reason, I highly recommend quad-socket instead.

If you need a node interconnect, you can always try 10gbit ethernet first and see if it affects your scaling. Run a 16-core job on one machine, run the same job spread across the machine on 8 cores each, compare the results.
If you want a faster interconnect for only two nodes, all you need are two additional infiniband cards, no switch required. I have seen benchmarks where 10gbit ethernet did not impose a serious bottleneck for a small number of nodes, but results may vary.
Then again, UPI links (quad-socket) are faster than infiniband.
flotus1 is offline   Reply With Quote

Old   November 17, 2017, 07:52
Default
  #3
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 12
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
No objections on the build itself.
Just two comments:
1) make sure to get dual-rank DIMMs
2) consider a quad-socket setup. The Xeon Gold 6134 you picked are actually a good choice for quad-socket thanks to their 3 UPI links.
This way you avoid any hassle with node interconnects and maybe even save a few bucks. If you do not need 2 dedicated workstations for whatever reason, I highly recommend quad-socket instead.

If you need a node interconnect, you can always try 10gbit ethernet first and see if it affects your scaling. Run a 16-core job on one machine, run the same job spread across the machine on 8 cores each, compare the results.
If you want a faster interconnect for only two nodes, all you need are two additional infiniband cards, no switch required. I have seen benchmarks where 10gbit ethernet did not impose a serious bottleneck for a small number of nodes, but results may vary.
Then again, UPI links (quad-socket) are faster than infiniband.
Thanks for the feedback, much appreciated.

I've looked into a quad-socket setup from Dell, but it actually works out to approximately twice the price (!) of two dual-socket machines. So I could get four dual-socket machines for the price of one quad-socket machine. I think the reason for this is that the dual-socket machine falls within Dell's Precision line, whereas the quad-socket is the more expensive PowerEdge line.

As for interconnect, our server room is already setup with 10 gig ethernet, and we've got spare cards and switches available. So I will probably do as you say and start off with that and see how it performs.

When it comes to single vs dual-rank memory, this I am unsure about! The online configurator for Dell doesn't mention rank, it just states "96 GB (12 x 8 GB) 2666 MHz DDR4 RDIMM ECC." I will contact our Dell sales rep and ask.

Thanks again!
SLC is offline   Reply With Quote

Old   November 27, 2017, 12:13
Default
  #4
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 12
SLC is on a distinguished road
I have 2 x HPC Packs, so I can run on 32 simultaneous cores. We will not be purchasing additional HPC Packs in the foreseeable future.

Will I see a performance hit in terms of network and I/O activities if I'm running CFD calculations on *all* my available physical cores? Does, for example, an infiniband/10GbE connection require dedicated CPU resources in order to function well while running a simulation?

I'm considering going for dual 12 core CPUs (Intel Xeon Gold 6136 or 6146) instead of dual 8 core Xeon Golds - which will mean I have spare compute power even when fully utilizing my 32 core HPC license. So I would be running on 32 out of the 48 available cores, leaving cores free for network and I/O duties. Are there specific downsides to this?

Granted, the 6146 is rather expensive, but so be it. Could also consider the Xeon Gold 6144 - 8C 3.5 Ghz, but then I wouldn't have any spare compute power again.


These are my options, I think:

2 x Intel Xeon Gold 6134 - 8C 3.2 Ghz (Turbo to 3.7 Ghz when running on 8 cores), 8 MiB L2 cache
2 x Intel Xeon Gold 6136 - 12C 3.0 Ghz (Turbo to 3.6 Ghz when running on 8 cores), 12 MiB L2 cache
2 x Intel Xeon Gold 6144 - 8C 3.5 Ghz (Turbo to 4.1 Ghz when running on 8 cores), 8 MiB L2 cache
2 x Intel Xeon Gold 6146 - 12C 3.2 Ghz (Turboboost to 4.0 Ghz when running on 8 cores), 12 MiB L2 cache

Ignoring the price differences, and assuming my memory configuration is optimal and the same for all CPUs, how would I get the best performance out of my 32-core HPC license?

Last edited by SLC; November 27, 2017 at 15:26.
SLC is offline   Reply With Quote

Old   November 27, 2017, 15:53
Default
  #5
Senior Member
 
Join Date: Mar 2009
Location: Austin, TX
Posts: 160
Rep Power: 15
kyle is on a distinguished road
You might see a slight performance hit using every core, but not one large enough that spending an additional $4000 would make sense.

If you've got more money to spend, why not get 3 machines with the 6 core Xeon Gold 6128? It would be significantly faster than either of your suggestions because you'd have 50% more memory bandwidth and a little more cache. You'd still get the small bump from having unused cores.

I wouldn't even bother with ethernet. Second hand FDR Infiniband equipment costs next to nothing and configuration is trivial.

Edit - Also, you shouldn't be worried about how much L2 cache you are getting. L2 cache is private to each core, and it's the same per core for every model. No matter what, if you're running on 32 cores, you're utilizing 32mb of L2 cache. The rest is sitting idle. Additional L3 cache would be a benefit however since it is shared.
kyle is offline   Reply With Quote

Old   November 27, 2017, 16:35
Default
  #6
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 12
SLC is on a distinguished road
Quote:
Originally Posted by kyle View Post
You might see a slight performance hit using every core, but not one large enough that spending an additional $4000 would make sense.

If you've got more money to spend, why not get 3 machines with the 6 core Xeon Gold 6128? It would be significantly faster than either of your suggestions because you'd have 50% more memory bandwidth and a little more cache. You'd still get the small bump from having unused cores.

I wouldn't even bother with ethernet. Second hand FDR Infiniband equipment costs next to nothing and configuration is trivial.

Edit - Also, you shouldn't be worried about how much L2 cache you are getting. L2 cache is private to each core, and it's the same per core for every model. No matter what, if you're running on 32 cores, you're utilizing 32mb of L2 cache. The rest is sitting idle. Additional L3 cache would be a benefit however since it is shared.
Thanks for your feedback.

Yeah, I'm going for FDR Infiniband. I'll have to buy it new direct from Dell, but they gave me a decent quote on pair of Mellonax ConnectX-3 VPI FDR Infiniband cards.

I didn't realise that about L2 and L3 cache - thanks for the pointer.

The delta in terms of $$ for me in going from the Xeon Gold 6134 to:

4 x Intel Xeon Gold 6134: --------
4 x Intel Xeon Gold 6136: + $1000
4 x Intel Xeon Gold 6144: + $3200
4 x Intel Xeon Gold 6146: + $5000

Changing the setup into a three node build with the Xeon Gold 6128 is prohibatively expensive, unfortunately. It would be an extra ~$15,000 USD overall, even after considering the lower price point of the actual CPUs. Having to get 12 sticks of RAM for each node makes it pricey! I'd also have to get an Infiniband switch (I think?).

So perhaps the question is: should I go for the 6134 (8C) or the 6136 (12C)?
SLC is offline   Reply With Quote

Old   November 29, 2017, 04:42
Default
  #7
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 12
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
No objections on the build itself.
Just two comments:
1) make sure to get dual-rank DIMMs
2) consider a quad-socket setup. The Xeon Gold 6134 you picked are actually a good choice for quad-socket thanks to their 3 UPI links.
This way you avoid any hassle with node interconnects and maybe even save a few bucks. If you do not need 2 dedicated workstations for whatever reason, I highly recommend quad-socket instead.

If you need a node interconnect, you can always try 10gbit ethernet first and see if it affects your scaling. Run a 16-core job on one machine, run the same job spread across the machine on 8 cores each, compare the results.
If you want a faster interconnect for only two nodes, all you need are two additional infiniband cards, no switch required. I have seen benchmarks where 10gbit ethernet did not impose a serious bottleneck for a small number of nodes, but results may vary.
Then again, UPI links (quad-socket) are faster than infiniband.
Quote:
Originally Posted by SLC View Post
Thanks for your feedback.

The delta in terms of $$ for me in going from the Xeon Gold 6134 to:

4 x Intel Xeon Gold 6134: --------
4 x Intel Xeon Gold 6136: + $1000
4 x Intel Xeon Gold 6144: + $3200
4 x Intel Xeon Gold 6146: + $5000

So perhaps the question is: should I go for the 6134 (8C) or the 6136 (12C)?
flotus1, what are your thoughts on the choice of CPU?

Is there any point in going for a 12C CPU in order ensure there is spare processing power when running a simulation on 8 of those cores, compared to an 8C CPU where all 8 cores will be running at 100 % load? The base clock of the 8C CPU is 3.2 GHz, whereas for the 12C CPU it is 3.0 GHz.
SLC is offline   Reply With Quote

Old   November 29, 2017, 06:27
Default
  #8
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 2,897
Rep Power: 40
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Yeah base clocks... nobody except Intel knows what turbo frequencies these CPUs will run in non-AVX or mild-AVX workloads. An educated guess would be that the 12-core CPU with only 8 cores load will run similar or even higher clock speeds than the 8-core CPU thanks to the higher TDP (150W vs 130W)
https://www.microway.com/knowledge-c...e-family-cpus/

I don't think it will be much of a benefit having idle CPU cores that can handle communication. But then again, the price difference is "only" 1000$ which should be small compared to the total system cost. And you get the CPUs with potentially higher clock speeds.
Intels Skylake-SP seems to be missing a similar 10-core CPU. Which is kind of weird given the lineup consists of at least 58 different CPUs.
flotus1 is offline   Reply With Quote

Old   November 30, 2017, 12:02
Default
  #9
SLC
Member
 
Join Date: Jul 2011
Posts: 53
Rep Power: 12
SLC is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Yeah base clocks... nobody except Intel knows what turbo frequencies these CPUs will run in non-AVX or mild-AVX workloads. An educated guess would be that the 12-core CPU with only 8 cores load will run similar or even higher clock speeds than the 8-core CPU thanks to the higher TDP (150W vs 130W)
https://www.microway.com/knowledge-c...e-family-cpus/

I don't think it will be much of a benefit having idle CPU cores that can handle communication. But then again, the price difference is "only" 1000$ which should be small compared to the total system cost. And you get the CPUs with potentially higher clock speeds.
Intels Skylake-SP seems to be missing a similar 10-core CPU. Which is kind of weird given the lineup consists of at least 58 different CPUs.
Thanks for your help.

Think I've landed on the Xeon Gold 6144
SLC is offline   Reply With Quote

Old   December 6, 2017, 18:05
Default
  #10
Senior Member
 
Micael
Join Date: Mar 2009
Location: Canada
Posts: 155
Rep Power: 15
Micael is on a distinguished road
Would be great if you can post few benchmarks once you have your system running
Micael is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
2D Glass Melt Simulation Setup marmz FLUENT 5 October 9, 2016 16:25
[ICEM] Hexa mesh, curve mesh setup, bunching law Anorky ANSYS Meshing & Geometry 4 November 12, 2014 01:27
OpenFoam Cluster Hardware ChrisA Hardware 3 July 25, 2013 04:42
Questions about CPU's: quad core, dual core, etc. Tim FLUENT 0 February 26, 2007 15:02
Hardware for Serious Use Justin CFX 6 January 3, 2007 01:15


All times are GMT -4. The time now is 09:30.