CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Updating CFD server E5-2697AV4 to something faster (x2/x3 the speed)

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree16Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 11, 2024, 04:31
Default
  #41
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
Quote:
Originally Posted by MFGT View Post
The old server ran with 2600Mhz clock speed maximum, which is 25% less. So I dont understand its the same performance. We use the old 8x16 GB 2133MHz RAM and according to our IT, it is placed in the correct sockets. Furher carry over is the data HDD.
A plot of time vs number of cores (1, 2, 4, 8, 16, 32) with a decent sized CFD problem (e.g. 2M mesh like the pinned openfoam benchmark) will enable you to see when memory access starts to become the bottleneck. This may provide a clue to what's wrong. For example, the single core result may not line up with the benchmark of others, the efficiency may start plumetting at low cores (memory in wrong sockets), etc...

What tends to matter most for implicit CFD simulations is the number of memory channels. Your new system has either 8 or 16 depending on whether you have 1 or 2 processors. Your old dual Xeon system has 8 channels. So if your new system has only 1 processor I would expect implicit CFD performance to be in the same ballpark with the same memory. But information on parallel efficiency would help with being sure about where the bottleneck lies.

PS If you are only using 8 memory chips with 2 processors then that is very likely the problem since there are 16 memory channels. 8 more memory chips is likely to double the performance for runs using higher numbers of cores.
andy_ is offline   Reply With Quote

Old   October 11, 2024, 06:44
Default
  #42
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
Thanks for your inpout Andy. Memory size is no issue, its propably the Speed.



And new RAM wasnt in budget unfortunatly.

So we are working on the DIMM population order at the moment.
According to the manual (https://www.hpe.com/psnow/doc/a00038346enw) we have set the 8 DIMMs in all white slots of Channels C, D, G and H of both processors, so they should run in quad channel mode. There is just this note at 8 DIMMs ***:
*** Recommended only with processors that have 128 MB L3 cache or less.
Well, the 7F52 has 256MB, is this an issue? Should be indeed also use more than 8 Dimms then?

Previously we had the 8 DIMMs in two channels per processor. Strangely enough this was faster?

edit:found the error, 1 DIMM wasnt clicked in correctly.
Attached Images
File Type: png DIMM.png (41.2 KB, 8 views)
MFGT is offline   Reply With Quote

Old   October 11, 2024, 07:27
Default
  #43
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
As far as I can see your system is behaving as expected when populated with insufficient memory chips. When budget allows buying the extra memory chips will likely double the performance when using higher number of cores with an implicit CFD problem of reasonable size and hence bring the performance inline with what you had hoped.

I have no experience with how to configure too few memory chips because it is not something I have ever considered doing given it substantially reduces the effective number of cores available for largish CFD runs. If you run a reasonably large implicit CFD benchmark on 1, 2, 4, 8, 16, 32, 64 cores and plot the parallel efficiency it will likely show there is currently little to be gained by using more than around 16 cores. If the machine is also used for other types of simulations these may run with a better parallel efficiency.
andy_ is offline   Reply With Quote

Old   October 11, 2024, 07:33
Default
  #44
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
Quote:
Originally Posted by MFGT View Post
edit:found the error, 1 DIMM wasnt clicked in correctly.
This has given you the expected overall performance or brought the populating different slots with different configurations in line with expectations?
andy_ is offline   Reply With Quote

Old   October 11, 2024, 08:28
Default
  #45
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
Compared to the old server we now have a Speedup of +64%
(test with flowbench simulation).
Both configurations used 60 cores with HT on.


30 cores without HT has a speedup of +36%, the wrong DIMM config with 60 cores had only +10%, the one with only 7 effective DIMMs even had -14%.
MFGT is offline   Reply With Quote

Old   October 11, 2024, 10:24
Default
  #46
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
Quote:
Originally Posted by MFGT View Post
30 cores without HT has a speedup of +36%, the wrong DIMM config with 60 cores had only +10%, the one with only 7 effective DIMMs even had -14%.
Thanks but have you got any information on parallel efficiency with cores using a single reasonably large implicit CFD test case? I am asking because it would be interesting to know how much performance is lost by not fully populating the memory slots. I had assumed when memory was the bottleneck it would be pretty much linear but if you can configure the cpu to memory connections perhaps this is not the case?
andy_ is offline   Reply With Quote

Old   October 11, 2024, 17:02
Default
  #47
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 370
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by andy_ View Post
Thanks but have you got any information on parallel efficiency with cores using a single reasonably large implicit CFD test case? I am asking because it would be interesting to know how much performance is lost by not fully populating the memory slots. I had assumed when memory was the bottleneck it would be pretty much linear but if you can configure the cpu to memory connections perhaps this is not the case?
You can run the benchmark with the 8 memory channels and then repeat later with the 16 channels. You will see that single core performance will be essentially unchanged. However, as more cores come into use the memory bottleneck will make the 8 channel config fall back more and more.
wkernkamp is offline   Reply With Quote

Old   October 11, 2024, 18:20
Default
  #48
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
You can run the benchmark with the 8 memory channels and then repeat later with the 16 channels. You will see that single core performance will be essentially unchanged. However, as more cores come into use the memory bottleneck will make the 8 channel config fall back more and more.
Yes but my question was about how much? I had expected the performance to be essentially halved with 8 instead of 16 dimms when memory access becomes the bottleneck for a largish model and the number of cores exceeding the number of memory channels. Yet the OP is reporting a 64% increase. I am not familiar with his benchmark which I think is an average of a range of different simulations (?) and so if there are some explicit ones and/or some small ones that are going to parallelise more efficiently on higher core numbers that may be the reason. If so, he is unlikely to see an equivalent increase when running his own large implicit problems. If not, then I would like to know how using the same dimms with the same effective number of memory channels (assuming they are?) a simulation that is being strongly limited by memory access can run at significantly different speeds.

In this case we have a lot of unknowns and it may not be possible to sort out quite what is going on without a widely used and understood CFD benchmark. The one pinned at the top of this forum has lots of results although I had to fiddle a bit to get it to run which I guess is going to put people off. The NAS parallel benchmarks were really useful for understanding this sort of thing but they didn't seem to catch on possibly because they produced a range of plots rather than a single number and later versions became rather supercomputer orientated.
andy_ is offline   Reply With Quote

Old   October 12, 2024, 10:37
Default
  #49
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
Hi,

my testcase is a relativly small Flowbench Simulation, with up to 600.000 cells. I ran the exact same case several times with different configurations. And when calculating the speedup its very similar if I consider a) overall runtime or b) average Walltime per Timestep.

With RAM limitation I dont mean the overall amount (the case needs less than 16GB), its rather that slow 2133MHz RAM where up to 3200MHz is supported now.

And I am sorry, I wont do any benchmarks with different software, as I cant waste time with non project related work in my job.


But I will rerun a full cycle simulation and compare the results as well. Here we talk about cases with up to 1.5 million cells, including detailed chemistry etc.
MFGT is offline   Reply With Quote

Old   October 13, 2024, 05:24
Default
  #50
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
Thanks for the clarification. It will be interesting to see how much the 64% changes with a larger more representative simulation but without more information it looks like we will have to speculate about what might be going on and what will or will not bring improvements.
andy_ is offline   Reply With Quote

Old   October 15, 2024, 05:44
Default
  #51
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
I have some numbers, but of course I didnt rerun full cycle simulations with various numbers of cores. My impression is that the average walltime per timestep is reduced by 34-37% which equals nearly a speedup of 52-58% for full cycle simulations.


We are happy with that, considering we only made the following changes :
  • more modern CPU
  • increase of base clock speed: 2.6 GHz -> 3.5 Ghz
  • higher memory channel, which we are not using yet (quad -> octa)
Constants:
  • same number of Cores/Threads
  • same RAM 8x16GB = 128GB (2133MHz)
Since the passmark benchmark said a +92% of performance, I expect these values if we also consider upgrading the RAM (The old system allowed 2400MHz, the new one could utilize 3200MHz).
MFGT is offline   Reply With Quote

Old   October 15, 2024, 05:55
Default
  #52
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
I did ran a small test, 20°CA of an engine simulation and analyzed the WallTimes. HT was activated all the time.

See the figure below.

We see an improvement until 32 cores, however HT works very good with this machine as 60 cores gives another improvement of 35% compared to 32 cores.
Attached Images
File Type: png Speedup.png (15.2 KB, 13 views)
MFGT is offline   Reply With Quote

Old   October 15, 2024, 10:19
Default
  #53
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
That does not seem to be scaling as I would expect for a typical distributed memory implicit CFD code with a reasonably large grid running on a shared memory machine (e.g. the pinned openfoam benchmark). Do you know how the solver in your program works and how it is parallelised? Can you provide a link to it because my googling "flowbench simulation" hasn't thrown up something obvious.

(And we are back to wanting to run something like the NAS parallel benchmarks in order to understand the performance).
andy_ is offline   Reply With Quote

Old   October 15, 2024, 16:01
Default
  #54
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 370
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by MFGT View Post
I did ran a small test, 20°CA of an engine simulation and analyzed the WallTimes. HT was activated all the time.

See the figure below.

We see an improvement until 32 cores, however HT works very good with this machine as 60 cores gives another improvement of 35% compared to 32 cores.

I agree with andy_. If you configure your memory right with more dimms and appropriate speed, you will get linear performance to 16 cores I would think. Then it will fall off and even go down when you exceed 32 cores.
wkernkamp is offline   Reply With Quote

Old   October 16, 2024, 03:15
Default
  #55
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
Quote:
Originally Posted by andy_ View Post
That does not seem to be scaling as I would expect for a typical distributed memory implicit CFD code with a reasonably large grid running on a shared memory machine (e.g. the pinned openfoam benchmark). Do you know how the solver in your program works and how it is parallelised? Can you provide a link to it because my googling "flowbench simulation" hasn't thrown up something obvious.

(And we are back to wanting to run something like the NAS parallel benchmarks in order to understand the performance).
I am using CONVERGE CFD where I perform engine simulations (injection and combustion) or simple flowbench simulation to optimize port performance.

Since I am only the user, I can not install OpenFoam (have never worked with it) on my own and do some benchmark there, even if that would be very interesing. Furthermore, we are on Windows Server 2016, no Linux.

If you can guide me on how to run that benchmark I may be able to convince our IT to install and configure OpenFoam, although I dont really care about performance against other CFD Servers.

We see an improvement from our old server which is noteable and may be further improved with e.g. 16x8GB 3200MHz DIMMs.
I dont need more RAM, since I have never exceeded 100GB of usage.
MFGT is offline   Reply With Quote

Old   October 16, 2024, 04:54
Default
  #56
Senior Member
 
andy
Join Date: May 2009
Posts: 301
Rep Power: 18
andy_ is on a distinguished road
Quote:
Originally Posted by MFGT View Post
I am using CONVERGE CFD where I perform engine simulations (injection and combustion) or simple flowbench simulation to optimize port performance.

Since I am only the user, I can not install OpenFoam (have never worked with it) on my own and do some benchmark there, even if that would be very interesing. Furthermore, we are on Windows Server 2016, no Linux.

If you can guide me on how to run that benchmark I may be able to convince our IT to install and configure OpenFoam, although I dont really care about performance against other CFD Servers.

We see an improvement from our old server which is noteable and may be further improved with e.g. 16x8GB 3200MHz DIMMs.
I dont need more RAM, since I have never exceeded 100GB of usage.
Does your test case involve an adaptive and/or moving grid?

If you are using a single commercial code then running it with representative models is likely to be the most relevant benchmark. If you use other codes to perhaps check how efficiently your current commercial code is implemented then they will need to perform the same simulation or at least the same type of simulation.

I am not familiar with the details of the converge code but given the size of the efficiency improvements reported for version 4 it is likely still in the process of becoming well developed. This is perhaps to be expected for a newish code (assuming it is newish) and should improve with time if the company is competently run and profitable.
andy_ is offline   Reply With Quote

Old   October 16, 2024, 05:41
Default
  #57
Senior Member
 
Tobias
Join Date: May 2016
Location: Germany
Posts: 294
Rep Power: 11
MFGT is on a distinguished road
Quote:
Originally Posted by andy_ View Post
Does your test case involve an adaptive and/or moving grid?

If you are using a single commercial code then running it with representative models is likely to be the most relevant benchmark. If you use other codes to perhaps check how efficiently your current commercial code is implemented then they will need to perform the same simulation or at least the same type of simulation.

I am not familiar with the details of the converge code but given the size of the efficiency improvements reported for version 4 it is likely still in the process of becoming well developed. This is perhaps to be expected for a newish code (assuming it is newish) and should improve with time if the company is competently run and profitable.
Yes, the results above involve moving surfaces and adaptive grid. It was a 20°CA section of an engine simulation. We wont switch to another code because we are really happy with it.

The CONVERGE Code isnt that new (more than 15/20 years old) and should be more than profitable (80% of engine developers worldwide use it).

I had a look at the results again and have to make some corrections. I was using reported runtime for the speedup calculations, but a closer look showed that at short runtimes of 4 to 80 mins (which I had for core variations) the impact of simulation setup and writing output becomes overweight. So when calulating the Speedup by reported time for solving the transport equations only it is the same profile, but higher.

E.g.:
16 Cores: 12.1 Speedup
32 Cores: 22.1 Speedup
48 Cores: 22.2 Speedup
60 Cores: 29.7 Speedup
64 Cores: 28.2 Speedup
MFGT is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
On the CFD market and trends sbaffini Main CFD Forum 14 June 13, 2017 12:48
CFD Online Celebrates 20 Years Online jola Site News & Announcements 22 January 31, 2015 01:30
how to solve the diverage of high speed centrifugal compressor, CFD code is STAR CCM layth STAR-CCM+ 3 May 21, 2012 06:48
Which is better to develop in-house CFD code or to buy a available CFD package. Tareq Al-shaalan Main CFD Forum 10 June 13, 1999 00:27
public CFD Code development Heinz Wilkening Main CFD Forum 38 March 5, 1999 12:44


All times are GMT -4. The time now is 11:59.