CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Optimal 32-core system

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   July 3, 2013, 06:23
Default Optimal 32-core system
  #1
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
My company has aproved an upgrade of our CFD licenses to ANSYS CFX to enable 32-parallel processing. So I am about to configure a hardware solution to utilize up to 32-cores. As the cost of HPC licenses is high relative to the hardware my focus is on getting the highest performance out of the 32-parallel processes. I will be running cases up to 10-40 mio. cells., oil spray combustion with radiation. Based on my initial investigations I came up with a system consisting of 2 PCs of 16 cores connected to a mini-cluster. The cost for a master node in a rack system is relatively high for such a small system.

2 pcs PC:
Dual CPU Xeon E5-2670 (8 cores)
8 x 4 GB RAM, 1600 MHz (2 set of 4 memory channels) = 32 GB
HDD: 250 GB, 7200 rpm for Windows
One of the machines will be with 1 TB 10,000 rpm SATA-600 for CFD files.
Interconnect for mini-cluster: 1 Gigabit or faster
Interconnect from mini-cluster to user workstation (Pre/post): 1 Gigabit

My questions regarding this configurations:

1) I expect the performance of the individual PCs to be limited by memory bandwidth. So it might be a waste of money to go for higher speed CPUs. But which CPU fits the available memory bandwidth: E5-2690, 2687W, 2680, 2670, 2667, 2665 or 2660?
I can tell their SPECfp_rate from: http://www.spec.org/cpu2006/results/rfp2006.html
- but that might not tell what their performance is in practice if memory bandwidth is the bottle neck?

2) My ANSYS reseller has done the attached performance benchmark on a similar system. The decrease in scaling from 16 to 32 cores indicate that the applied 1 Gigabit cluster interface is limitting the performance. Thus I would like to know how fast (transfer rate and latency) an interconnect I need to have full scaling from 16-32 cores. I guess if the speed of the interconnect resemples that of the RAM, then increased speed would not gain performance? Anyone who has experience on this or can refer to results?

3) How fast disks would you recommend? Currently I am running SSD, which still takes a few minutes to load a large case i CFX-pre. But I dont know if this is disk speed limited. My plan so far is 1 TB 10,000 rpm disks in RAID 0. Pre/post will be running on local workstation (1 Gigabit connection) but using the files on the cluster.

4) A lot of people on this forum prefers i7. However, then I would need 5 PCs to get 30 cores - which would even increase the requirements for a fast cluster interconnect and most likely decrease the scaling further. Moreover the i7 doesnt seem to compete with the fastes Xeon E5 on raw single CPU performance:
http://www.cpubenchmark.net/high_end_cpus.html

I would really appreciate you input on my questions above!

Best regards
Kim Bindesbøll
Attached Images
File Type: jpg Scaling Xeon E5-2670.jpg (46.4 KB, 92 views)
bindesboll is offline   Reply With Quote

Old   July 3, 2013, 07:20
Default
  #2
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18
CapSizer is on a distinguished road
The important bit of information that they didn't give you about that benchmark test is how the < 32 process jobs were distributed. Meaning, when they ran 8 processes, did they run them all on one machine, or all on one CPU or did they distribute them 2-2-2-2 across the 4 CPU's? That would be a meaningful way of getting some info about how well your interconnect is performing. The scaling on the face of it does not look that good, even when going from 4 cores to 8, leave alone 32. I would imagine that things would improve if you could use Infiniband between the two Xeon machines. It is possible to set that up point to point, without a switch, and if you use second hand cards from EBay, it would be inexpensive. But check the benchmark first, it could be as simple as that being just too small, but I would imagine that 10 million cells should be large enough to scale well over 32 cores.
CapSizer is offline   Reply With Quote

Old   July 3, 2013, 07:32
Default
  #3
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
I agree. I asked that question but they did not know. They guessed that the distribution of cores running was:
2 = 2-0-0-0
4 = 4-0-0-0
8 = 8-0-0-0
16 = 8-8-0-0
32 = 8-8-8-8

The case run was 10 mio. cells, 1000 oil spray droplets, combustion (EDM, 2 reactions) and radiation. This is a typical heavy case for my work. I think the size is adequate for a 32-core cluster.

If I can connect the cluster with Infiniband directly without a switch I also consider that as a solution with high performance for the money. Do you have any idea if 10 Gbit/s is sufficient or a 40 Gbit/s would increase performance. What Gbit/s would the 1600 MHz RAM correspond to?
bindesboll is offline   Reply With Quote

Old   July 3, 2013, 07:48
Default
  #4
Senior Member
 
Charles
Join Date: Apr 2009
Posts: 185
Rep Power: 18
CapSizer is on a distinguished road
Quote:
Originally Posted by bindesboll View Post
Do you have any idea if 10 Gbit/s is sufficient or a 40 Gbit/s would increase performance. What Gbit/s would the 1600 MHz RAM correspond to?
I don't think the inter-node memory bandwidth is that important, or if it is, you are in big trouble. You don't want to shift significant amounts of data between the various memory banks. This is where NUMA and affinity come in. A solver process should stay on one core, or at least on one physical CPU, and its data should stay on the associated memory bank. If you get this (OpenMPI enforces it, for example), then the amount of data being exchanged between the various nodes should be reasonably small, and I don't think actual network bandwidth vs. memory bandwidth is really relevant. What does count for a lot is latency, and there I think even 10 Gb IB is much, much better than Gb ethernet, and not necessarily much worse than 40 Gb. In any event, it is probably fairly cheap to test with second-hand 10 Gb cards.
CapSizer is offline   Reply With Quote

Old   July 3, 2013, 10:50
Default
  #5
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
Quote:
What does count for a lot is latency, and there I think even 10 Gb IB is much, much better than Gb ethernet, and not necessarily much worse than 40 Gb.
Thanks a lot! I get the point regarding cluster-interconnect.

1) Regarding choice of CPU I just did an analysis of total cost of HPC licenses + hardware per performance of a system with E5-2670 compared to a system with E5-2690, which showed that the E5-2690 system has a higher cost per performance.

Any inputs regarding issues 3) and 4) ?
bindesboll is offline   Reply With Quote

Old   July 3, 2013, 17:52
Default
  #6
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23
evcelica is on a distinguished road
You may want to consider getting four machines with dual quad core XEON E5-2643.

See this thread:http://www.cfd-online.com/Forums/cfx...dware-cfx.html
evcelica is offline   Reply With Quote

Old   July 3, 2013, 22:06
Default
  #7
Senior Member
 
Join Date: Dec 2009
Posts: 131
Rep Power: 19
mjgraf is on a distinguished road
did your supplier run with that 32GB per machine as speced?

RAM may be your limiting factor considering 10 million nodes and all those sets of equations, it may be swapping. Make sure the reseller is checking for this using the performance monitor.

I recently ran a 15 million node job, single phase and it wanted to use about 36GB of RAM for the solver on a single 16 core E5-2690 node. When you distribute, the memory requirement per partition is reduced, so you have some play.
Also, and this is a common oversight, the master node that partitions the job must be able to load the entire model into memory to perform the partitioning. There are now options for parallel partitioning, but I have not tried it yet. If you plan on creating 40 million node jobs, I am pretty confident that 32GB of RAM is not enough. ICEM tetra will most likely not create the mesh in 32GB of RAM, HEXA should be ok. Other experience comments?

Since it is only two machines, gigabit should be fine and the job should be ensured to be running in RAM, no swapping. The biggest concern with running distributed (MPI) is the latency and you should not have a large problem with the two nodes. for extra bandwidth you can always team the nics.

Also, choice of MPI is somewhat important. Which CFX version? Test with Platform and Intel distributed parallel to check performance for the hardware in use. I only run on Linux, so I can not comment on how Windows performs.

Harddisks are not that critical for CFX work, a few 10k or 15k drives in RAID0 will be fine. In my experience the load times are not effected greatly by local or networked access. If you are writing a lot of large bak or trn files over a GigE connection, this would slowly add a lot of time to your wall clock time for the solution.

just some thoughts. Make sure that Ansys reseller does the job right so you are happy with the end result. you have a complex problem, good luck.
mjgraf is offline   Reply With Quote

Old   July 3, 2013, 23:03
Default
  #8
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23
evcelica is on a distinguished road
Quote:
Originally Posted by mjgraf View Post
a few 10k or 15k drives in RAID0 will be fine.
1995 called, they want their hard drives back!
evcelica is offline   Reply With Quote

Old   July 3, 2013, 23:21
Default
  #9
Senior Member
 
Join Date: Dec 2009
Posts: 131
Rep Power: 19
mjgraf is on a distinguished road
you spec for the application. why over spec and spend more than necessary.
mjgraf is offline   Reply With Quote

Old   July 4, 2013, 04:32
Default
  #10
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
My test case is 10 mio. cells - 5 mio. nodes. CFX 14.0 - but in future 15.0. Steady state, SST turbulence model, 1000 oil droplets, evaporation, combustion with 2 reactions and 6 gas species, discrete transfer radiation model. This is a typical heavy case for my present work, running on a Xeon X5687 quad-core 3.6 GHz, 24 GB 1333 MHz RAM. In future I expect same type of problem but up to 4 times more nodes.

Quote:
Originally Posted by mjgraf View Post
Also, and this is a common oversight, the master node that partitions the job must be able to load the entire model into memory to perform the partitioning.
Above mentioned case takes up 5 GB when loaded in CFX-pre and 18 GB when running solver 4 parallel processes. So does "load the entire model into memory" of the cluster head node means 5 GB or 18 GB - or something in between? I expect RAM requirement to increase proportionally to the number of mesh nodes? If 32 GB on cluster head node is not enough I guess the solution could be to have 64 GB on the head node and 32 GB on the slave node?

Quote:
Originally Posted by mjgraf View Post
ICEM tetra will most likely not create the mesh in 32GB of RAM, HEXA should be ok
I expect to run pre/post (meshing) on my present workstation (above specifications) and solver on the 32-core cluster.

Quote:
Originally Posted by mjgraf View Post
If you are writing a lot of large bak or trn files over a GigE connection, this would slowly add a lot of time to your wall clock time for the solution.
As I only have 2 cluster nodes I do not need a switch, but only 2 network connectors - so Infiniband becomes affordable. As the solver is running on the cluster and the CFD files are also on the cluster, I do not expect this to be a problem.
bindesboll is offline   Reply With Quote

Old   July 4, 2013, 10:13
Default
  #11
Senior Member
 
Join Date: Dec 2009
Posts: 131
Rep Power: 19
mjgraf is on a distinguished road
i had a nice lengthy post composed and my browser lost it.

in a nutshell, your config is good for your CELL count (previously missed this and assumed node), but just watch the memory usage.

Utilize the out file memory statics for your jobs.
The partitioner should be fine for these small 40 million cell jobs with 32GB of RAM.
The solver will be close.
Get a good estimate using the following section in the out file.

Code:
 +--------------------------------------------------------------------+
 |        Memory Allocated for Run  (Actual usage may be less)        |
 +--------------------------------------------------------------------+

 Allocated storage in:    Kwords
                          Words/Node
                          Words/Elem
                          Kbytes
                          Bytes/Node

 Partition | Real       | Integer    | Character| Logical  | Double
 ----------+------------+------------+----------+----------+----------
     Total |  4414191.3 |  1355571.6 |  56603.6 |   1040.0 |    128.0
           |     348.23 |     106.94 |     4.47 |     0.08 |     0.01
           |     138.90 |      42.65 |     1.78 |     0.03 |     0.00
           | 34485869.6 |  5295201.6 |  55277.0 |   1015.6 |   1000.0
           |    2785.83 |     427.76 |     4.47 |     0.08 |     0.08
 ----------+------------+------------+----------+----------+----------
sum the bottom row, multiply by your projected NODE count and divide by 1024^3 to get a projected TOTAL GB of RAM required.

We typically used a rule of thumb to allocate 150k to 200k NODES per partition. Anything smaller and you risk slowing the job down due to the interconnect communication and scaling plummets.

I really also do not see ethernet being a real limiting factor here.
If you want to see the effect of overall system performance, look at the bottom of the out file for the "Job Information" and the difference between the "Total CPU time" and "Total wall clock time". If everything is nicely balanced these two should be within seconds or possibly minutes of each other for simulations that run under 24 hours.

I may get some flack for this one because all DIMMs are not populated, but consider a memory upgrade path. If the motherboard only has 8 slots, you may want to switch from the 8x4GB to a 4x8GB population to allow you to easily double your memory later to ensure matched memory. If not, you will have unmatched memory if you only swap half the sticks or you will need to replace them all. Some motherboards slow down the memory in unmatched RANK situations.

One last suggestion just to make sure, do not use the Hyperthreads, use only the physical cores. So not sure if these "PCs" you are getting are single or dual socket Xeon boards. If you leave hyperthreading enabled, ensure that the processes are pinned to the physical cores and not rely on the Windows scheduler to do the job correctly.
mjgraf is offline   Reply With Quote

Old   July 5, 2013, 06:41
Default
  #12
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
Quote:
Originally Posted by mjgraf View Post
I may get some flack for this one because all DIMMs are not populated, but consider a memory upgrade path. If the motherboard only has 8 slots, you may want to switch from the 8x4GB to a 4x8GB population to allow you to easily double your memory later to ensure matched memory. If not, you will have unmatched memory if you only swap half the sticks or you will need to replace them all. Some motherboards slow down the memory in unmatched RANK situations.
Only using 4 DIMMs will roughly half the performance of the system as only half of the memory channels will be active. So I dont see that as an option at all. According to this:
http://globalsp.ts.fujitsu.com/dmsp/...ance-ww-en.pdf
- memory bandwidth can be maintained with 2 DIMMs per memory channel. So it is possible to add more DIMMs. Isnt it possible to get new DIMMs that are paired to the existing ones?

Quote:
Originally Posted by mjgraf View Post
One last suggestion just to make sure, do not use the Hyperthreads, use only the physical cores. So not sure if these "PCs" you are getting are single or dual socket Xeon boards.
The system is dual socket, thus: 2 PCs x 2 CPU x 8 cores = 32 core system
I will set Hypertreading off.
bindesboll is offline   Reply With Quote

Old   July 5, 2013, 13:12
Default
  #13
Senior Member
 
Join Date: Dec 2009
Posts: 131
Rep Power: 19
mjgraf is on a distinguished road
i never rely on synthetic benchmarks as these are not application specific.
my previous work has shown using NUMA and properly balancing and pinning threads typically creates the best performance. This means that the threads and that thread's memory footprint are ensured to stay on that processor and memory controller and not exceed the memory for that socket. Once you hit the interconnect interleaved or go into remote memory, performance is +/- but memory bandwidth goes down.

This is a good white paper I was able to find.
ftp://ftp.dell.com/Manuals/all-produ...rs12_en-us.pdf

Last edited by mjgraf; July 5, 2013 at 21:09.
mjgraf is offline   Reply With Quote

Old   July 6, 2013, 00:01
Default
  #14
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23
evcelica is on a distinguished road
Any reason you are not considering the 4 quad core machines I recommended earlier? gong over 4 cores per CPU is not going to net you much gain in CFX as the memory bandwidth is usually the bottleneck. I have run simulations with CFX where 4,5 and 6 cores all give the same performance, (and that is with 2133MHz memory ) so going to 8 cores per CPU would be quite a waste of cores depending on your memory bandwidth needs.
evcelica is offline   Reply With Quote

Old   July 8, 2013, 11:15
Default
  #15
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
Quote:
Originally Posted by evcelica View Post
Any reason you are not considering the 4 quad core machines I recommended earlier? gong over 4 cores per CPU is not going to net you much gain in CFX as the memory bandwidth is usually the bottleneck. I have run simulations with CFX where 4,5 and 6 cores all give the same performance, (and that is with 2133MHz memory ) so going to 8 cores per CPU would be quite a waste of cores depending on your memory bandwidth needs.
What system did you test for 4, 5 and 6 core performance?

The performance characteristic I posted in the first post in this thread shows that the performance increases from 4 to 8 cores - however, not as much as from 2 to 4 cores. The numbers behind the graph can be used to calculate the scaling efficiency (actual performance relative to core-count up scaling):
From 2 to 4 cores: 91%
From 4 to 8 cores: 68%
From 8 to 16 cores: 92%
From 16 to 32 cores: 82%

Conclusions based on the numbers above:
From 2 to 4 cores has good scaling – likely because 4 memory channels are available in both cases.
From 4 to 8 cores has poor scaling – likely because 8 cores have to share 4 memory channels.
From 8 to 16 cores has good scaling – likely because motherboard CPU interface is fast enough and memory channels per CPU is not decreased.
From 16 to 32 cores has medium scaling – likely because node interconnect pose some limitation.

The cost of a E5-2643 based system (4 nodes x 2 CPUs x 4 cores) will be a factor 1.38 relative to a E5-2670 system (2 nodes x 2 CPUs x 8 cores) - excluding node interconnect. The nice thing about the 2 node system is that a network swicth is not required (which is costly). The joker is to what extent the 4 node system will be slowed down due to the increased number of nodes and interconnect.
bindesboll is offline   Reply With Quote

Old   July 8, 2013, 21:41
Default
  #16
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23
evcelica is on a distinguished road
I was always under the impression that you could scale up to 4 nodes pretty well with only gigE vs. infiniband.
See slide number 11 here: http://www.hpcadvisorycouncil.com/pdf/CFX_Analysis.pdf
Of course this was with different machines, but I'm sure you could find some more benchmarks with more modern CPUs to compare node counts and interconnects.

I was using an i7-3930K, with 2133MHz RAM, clocked at 4.4GHz, the high CPU clock speed could be a reason I was memory bandwidth limited. (Every problem scales differently though of course) I see the E5-2670 only run at 2.6 GHz, so memory bandwidth may not be as big of an issue. The E5-2643 run at 3.3 GHz (27% faster than the 2670). and you will have double the memory channels.
I know I would put my money on the 4 node system with gigE over the 2 node system with infiniband, although I have no personal experience benchmarking the two clusters. The link from my first post had a guy that seemed like he did a lot of real world benchmarking and believed the E5-2643 was the best choice for CFX, you might want to contact him directly and get some more information from him.
evcelica is offline   Reply With Quote

Old   July 9, 2013, 07:51
Default
  #17
Member
 
Kim Bindesbøll Andersen
Join Date: Oct 2010
Location: Aalborg, Denmark
Posts: 39
Rep Power: 15
bindesboll is on a distinguished road
Quote:
Originally Posted by evcelica View Post
I know I would put my money on the 4 node system with gigE over the 2 node system with infiniband, although I have no personal experience benchmarking the two clusters.
I dont know to what extent the conclusions in slide number 11 http://www.hpcadvisorycouncil.com/pdf/CFX_Analysis.pdf (AMD Opteron
2382, DDR2 800 MHz) can be applied to present Xeon generation E5-2600, DDR3 1600 MHz, as these have significantly improved memory interface.
Comparison of below two benchmarks (for Fluent) indicates the difference in imparct from interconnect between previous Xeon generation (X5570) and present (E5-2680):
Xeon X5570: Slide 9-10 http://www.hpcadvisorycouncil.com/pd...ysis_Intel.pdf
Xeon E5-2680: Slide 8 http://www.hpcadvisorycouncil.com/pd...el_E5_2680.pdf
Conclusion on above: For previous Xeon generation 1GbE is sufficient for 2-4 nodes systems. For present Xeon generation IB is the only that significantly gain performance for parallel.

Below slides are very interesting even though they are for FLOW-3D and not for CFX. I dont know to what extent FLOW-3D conclusions also are valid for CFX.
Slide 8: http://www.hpcadvisorycouncil.com/pd..._2680_flow.pdf
Slide 8: http://www.hpcadvisorycouncil.com/pd...ling_Intel.pdf

It seems that 1 GbE is hopeless above 2 nodes, and that IB even at 2 nodes improve performance by 52% (only one of the cases).
bindesboll is offline   Reply With Quote

Old   July 9, 2013, 11:58
Default
  #18
Senior Member
 
Erik
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 1,166
Rep Power: 23
evcelica is on a distinguished road
Thanks for sharing those, and your previous benchmarks, they are of great value! Some of those really make gigE look bad! Those benchmarks are for 16 processes per node (dual 8 core), where we are talking about 8 per node if you went with the (x4) 2643 cluster, so it should not be as bottlenecked. Also CFX seems to scale much better than fluent with gigE, see below:
http://www.hpcadvisorycouncil.com/pd...e_Analysis.pdf
Slides 8 and 9 (Fluent vs. CFX scaling)
(Still not that well though)

You could consider getting the infiniband switch and 4 node system, it will cost more of course, but will also be faster than the 2 node system. My simple analysis between the two is below, which used the same scaling efficiencies you show earlier: (see attached)

I assume the 27% CPU clock speed increase gives a 15% performance boost here, and show a ~25% overall improvement between the two clusters. (This is very rudimentary of course, and assumes a lot)

Using gigE you would probably lose at least that 25% to interconnect loses, so I agree that the two node would probably be better if you didn't want to spend the extra money on infiniband for the four node cluster, though the infiniband is pretty cheap compared to the licenses and computers which I'm sure are $100K+. Though the computer cost may not be worth the relatively small performance increase.

Thanks again for sharing all your info, its been informative.
Attached Images
File Type: png Untitled.png (8.2 KB, 31 views)
evcelica is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
mpirun, best parameters pablodecastillo Hardware 18 November 10, 2016 13:36
solving a conduction problem in FLUENT using UDF Avin2407 Fluent UDF and Scheme Programming 1 March 13, 2015 03:02
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
CFX11 + Fortran compiler ? Mohan CFX 20 March 30, 2011 19:56
running Fluent on a DUAL CORE system Ralf Schmidt FLUENT 3 June 20, 2006 11:21


All times are GMT -4. The time now is 14:55.