CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > ANSYS > FLUENT

2x E5-2697A performance issues

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 20, 2023, 18:57
Default 2x E5-2697A performance issues
  #1
Member
 
Otari kemularia
Join Date: Mar 2018
Posts: 44
Rep Power: 8
otokemo is on a distinguished road
2x E5-2697A performance worse than Ryzen 5800x.

From the suggestions that I got here, I've upgraded my office Dell PT7810 with double E5-2697As, now I got 32 cores and 64 threads. For the comparison, I run 4E+06 element transient simulation on two computers, one with 5800x and the second with 2x E5-2697A, both of the systems have 128gb DDR4 Ram with latest to have ECC memory. For the 5800x I used only 14 threads and for the Xeons - 60. The task manager shows that the according number of cores are being used, but the 5800x performs much better. Is it possible that we don't have according number of HPCpack? I'm writing this from home and cannot check it right now, but it didn't occured me before, since the task manager showed the 100% load on 60 threads.

Last edited by otokemo; January 21, 2023 at 01:12.
otokemo is offline   Reply With Quote

Old   January 20, 2023, 19:04
Default
  #2
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,665
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.


Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.
LuckyTran is online now   Reply With Quote

Old   January 21, 2023, 01:14
Default
  #3
Member
 
Otari kemularia
Join Date: Mar 2018
Posts: 44
Rep Power: 8
otokemo is on a distinguished road
Quote:
Originally Posted by LuckyTran View Post
40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.


Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.
My bad! It's 4e+6 not e+3.
Wym non ideal looking?
otokemo is offline   Reply With Quote

Old   January 21, 2023, 19:51
Default
  #4
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 307
Rep Power: 12
wkernkamp is on a distinguished road
For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.


Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per:
Quote:
Originally Posted by Crowdion View Post
To get the best performance you should populate the ram slots correctly. IBM Document in the link below explains the concept of balanced memory configurations

https://lenovopress.lenovo.com/lp050...-configuration

If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x.
wkernkamp is offline   Reply With Quote

Old   January 22, 2023, 02:17
Default
  #5
Member
 
Otari kemularia
Join Date: Mar 2018
Posts: 44
Rep Power: 8
otokemo is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.


Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per:



If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x.
All 8 slots are filled with exactly same 16Gb dimms.

As for the threads, whenever I used all the threads I was always getting better performance on any desktop configuration.
Even with my 5800x, I'm using all threads.
otokemo is offline   Reply With Quote

Old   January 22, 2023, 02:38
Default
  #6
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,665
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).


You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes.


As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.
LuckyTran is online now   Reply With Quote

Old   January 22, 2023, 03:00
Default
  #7
Member
 
Otari kemularia
Join Date: Mar 2018
Posts: 44
Rep Power: 8
otokemo is on a distinguished road
Quote:
Originally Posted by LuckyTran View Post
60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).


You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes.


As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.
Really interesting, thanks!
So either I put all 64 (which isn't adviced) or I put 32. But also it is recomended to leave 1-2 cores free, so they can work on read/write and other tasks? So if I select only 30 cores, is it still ok?
otokemo is offline   Reply With Quote

Old   January 22, 2023, 04:20
Default
  #8
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,665
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.


30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?
LuckyTran is online now   Reply With Quote

Old   January 22, 2023, 08:32
Default
  #9
Member
 
Otari kemularia
Join Date: Mar 2018
Posts: 44
Rep Power: 8
otokemo is on a distinguished road
Quote:
Originally Posted by LuckyTran View Post
It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.


30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?
The recommendation for Ansys Mechanical, is to leave 1-2 cores free, so they could do the reas/write tasks. I've done that with 30 cores (left 2 cores free) and it reduced the calculation time. I thought the same was with Fluent.
otokemo is offline   Reply With Quote

Old   January 23, 2023, 01:09
Default
  #10
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,665
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
Well it would certainly be important to mention that in your first post. If you are optimizing for reading/writing to a hard disk then you have issues completely unrelated how to do CFD. You have a problem of reading/writing to the hard disk.


In the vast majority of cases, very lii writing to the hard disk is done in CFD. If the disk writing rate is significant then you're likely writing TB's per hr and your hard disk is already full and your cpu's are idle because they're not doing anything.


Also, leaving 1 or 2 cores free definitely doesn't apply to hyperthreaded cores.
LuckyTran is online now   Reply With Quote

Old   January 23, 2023, 05:49
Default
  #11
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,398
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.

Since your CFD simulations don't behave as expected, we need to produce as much information as we can get.
1) You can start by explaining as detailed as possible what your simulation is doing.
2) And equally important: how you measure run time, and what is included in your run time.
3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing
4) We need strong scaling tests. These usually provide valuable insight into what is going wrong.
So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us.
Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results.

If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware"
You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.
flotus1 is offline   Reply With Quote

Old   January 23, 2023, 07:53
Default
  #12
Senior Member
 
Lorenzo Galieti
Join Date: Mar 2018
Posts: 373
Rep Power: 12
LoGaL is on a distinguished road
TBH, for a 4 million elements mesh, I wouldn't be surprised if the way higher clock frequency of the 5800x balanced out the memory bandwidth reduction. Most probably also, the non-ECC memory of the ryzen runs at a higher clockspeed (3600 vs 2933?). This should partially compensate the lower channel count, doesn't it?

I am not even sure if there is a memory bandwidth problem with such small mesh...
LoGaL is offline   Reply With Quote

Old   January 23, 2023, 09:44
Default
  #13
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,665
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
Well after learning that the mesh is not 40k cells, I am just going with the assumption that the purpose of this exercise is to benchmark two systems so I would expect the CFD simulation being used as a benchmark to actually capable of stressing the system. If 4 million cells is not enough, then it is a simple matter to make a bigger mesh that does.

The scaling tests with 1,2,4 cores on both systems will tell you where the limitations arise, whether the cpu is the bottleneck or the memory is the bottleneck or in this case apparently, if the hard disk is the bottle neck.

I also highly recommend turning off HT on every machine. Whatever is done or is not done please at least stop doing the things that don't make sense, don't launch calculations using 14 threads on a ryzen or 60 threads on a E5-2697A. And I'm not even going to bother to ask if we are running base clock speeds or if any of the multipliers have been adjusted. Just run the test.
LuckyTran is online now   Reply With Quote

Old   January 23, 2023, 10:05
Default
  #14
Senior Member
 
Lorenzo Galieti
Join Date: Mar 2018
Posts: 373
Rep Power: 12
LoGaL is on a distinguished road
Ok, let's see then. Though, it really feels like a very talented 100m sprinter (the ryzen) is being compared to a marathon runner in a 200m competition. Maybe it's not his/her forte, but the 100m sprinter wins.
LoGaL is offline   Reply With Quote

Old   January 23, 2023, 10:36
Default
  #15
Member
 
Otari kemularia
Join Date: Mar 2018
Posts: 44
Rep Power: 8
otokemo is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.

Since your CFD simulations don't behave as expected, we need to produce as much information as we can get.
1) You can start by explaining as detailed as possible what your simulation is doing.
2) And equally important: how you measure run time, and what is included in your run time.
3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing
4) We need strong scaling tests. These usually provide valuable insight into what is going wrong.
So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us.
Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results.

If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware"
You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.
1) I was doing transient simulation with LES of a flowe over a golf ball.
2) I measured it with totall wall clock.
3) Noted.

4) Here's the test I made, but with a steady flow over a wing and 2.5e+6 nodes and 6.68e+6 element count. K-omega turbulent model and Coupled solver.

Here everything look fine and normal, as it should be. I think there was some problem with the golf ball mesh and LES model.
Attached Images
File Type: png Screenshot 2023-01-23 193509.png (22.7 KB, 14 views)
File Type: png Screenshot 2023-01-23 193608.png (29.2 KB, 20 views)

Last edited by otokemo; January 24, 2023 at 01:36. Reason: The number of elements is 6.68e+6, nodes - 2.5e+6
otokemo is offline   Reply With Quote

Old   January 23, 2023, 11:11
Default
  #16
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,665
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
And that's pretty much a textbook result. Your machines are working fine. The Ryzen runs faster at the same thread count exactly according to the ratio of the core clock speeds. So for this particular simulation you are not bandwidth limited. Once you get grid sizes that completely fill up the ram though, then you will be and in that scenario you should see the Ryzen at maximum throughput is slower according to the bandwidth ratio roughly half but i don't know what your actual bandwidth is to tell you what the actual ratio will be.

And the results for the Ryzen at 12 and 16 threads and the E5-2697A at 48 and 64 isn't a surprise either. Hyperthreading isn't much better, sometimes it sucks, but it certainly cost you more money to buy more licenses!


I guess now would be a great time to go back and debug your LES model to figure out why it is running so slow. It's not the machine but it could be some factor that has compounded with the model-machine.
LuckyTran is online now   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
If memory bound : how to improve performance? aerosayan Main CFD Forum 13 July 7, 2021 05:44
Issues with poor performance in faster CPU gian93 Hardware 9 October 29, 2018 13:34
Fluent cluster performance issues foamer FLUENT 0 November 11, 2016 07:13
Performance issues with rad.correct() in a loop chriss85 OpenFOAM Programming & Development 1 August 17, 2016 10:16
Performance issues with channelOodLes steja OpenFOAM Running, Solving & CFD 20 July 20, 2005 08:33


All times are GMT -4. The time now is 17:52.