2x E5-2697A performance issues

otokemo · January 20, 2023, 18:57

2x E5-2697A performance worse than Ryzen 5800x.

From the suggestions that I got here, I've upgraded my office Dell PT7810 with double E5-2697As, now I got 32 cores and 64 threads. For the comparison, I run 4E+06 element transient simulation on two computers, one with 5800x and the second with 2x E5-2697A, both of the systems have 128gb DDR4 Ram with latest to have ECC memory. For the 5800x I used only 14 threads and for the Xeons - 60. The task manager shows that the according number of cores are being used, but the 5800x performs much better. Is it possible that we don't have according number of HPCpack? I'm writing this from home and cannot check it right now, but it didn't occured me before, since the task manager showed the 100% load on 60 threads.

LuckyTran · January 20, 2023, 19:04

40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.

Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.

otokemo · January 21, 2023, 01:14

Quote:

Originally Posted by LuckyTran

40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.

Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.

My bad! It's 4e+6 not e+3.
Wym non ideal looking?

wkernkamp · January 21, 2023, 19:51

For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.

Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per:

Quote:

Originally Posted by Crowdion

To get the best performance you should populate the ram slots correctly. IBM Document in the link below explains the concept of balanced memory configurations

https://lenovopress.lenovo.com/lp050...-configuration

If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x.

otokemo · January 22, 2023, 02:17

Quote:

Originally Posted by wkernkamp

For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.

Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per:

If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x.

All 8 slots are filled with exactly same 16Gb dimms.

As for the threads, whenever I used all the threads I was always getting better performance on any desktop configuration.
Even with my 5800x, I'm using all threads.

LuckyTran · January 22, 2023, 02:38

60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).

You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes.

As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.

otokemo · January 22, 2023, 03:00

Quote:

Originally Posted by LuckyTran

60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).

You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes.

As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.

Really interesting, thanks!
So either I put all 64 (which isn't adviced) or I put 32. But also it is recomended to leave 1-2 cores free, so they can work on read/write and other tasks? So if I select only 30 cores, is it still ok?

LuckyTran · January 22, 2023, 04:20

It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.

30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?

otokemo · January 22, 2023, 08:32

Quote:

Originally Posted by LuckyTran

It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.

30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?

The recommendation for Ansys Mechanical, is to leave 1-2 cores free, so they could do the reas/write tasks. I've done that with 30 cores (left 2 cores free) and it reduced the calculation time. I thought the same was with Fluent.

LuckyTran · January 23, 2023, 01:09

Well it would certainly be important to mention that in your first post. If you are optimizing for reading/writing to a hard disk then you have issues completely unrelated how to do CFD. You have a problem of reading/writing to the hard disk.

In the vast majority of cases, very lii writing to the hard disk is done in CFD. If the disk writing rate is significant then you're likely writing TB's per hr and your hard disk is already full and your cpu's are idle because they're not doing anything.

Also, leaving 1 or 2 cores free definitely doesn't apply to hyperthreaded cores.

flotus1 · January 23, 2023, 05:49

All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.

Since your CFD simulations don't behave as expected, we need to produce as much information as we can get.
1) You can start by explaining as detailed as possible what your simulation is doing.
2) And equally important: how you measure run time, and what is included in your run time.
3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing
4) We need strong scaling tests. These usually provide valuable insight into what is going wrong.
So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us.
Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results.

If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware"
You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.

LoGaL · January 23, 2023, 07:53

TBH, for a 4 million elements mesh, I wouldn't be surprised if the way higher clock frequency of the 5800x balanced out the memory bandwidth reduction. Most probably also, the non-ECC memory of the ryzen runs at a higher clockspeed (3600 vs 2933?). This should partially compensate the lower channel count, doesn't it?

I am not even sure if there is a memory bandwidth problem with such small mesh...

LuckyTran · January 23, 2023, 09:44

Well after learning that the mesh is not 40k cells, I am just going with the assumption that the purpose of this exercise is to benchmark two systems so I would expect the CFD simulation being used as a benchmark to actually capable of stressing the system. If 4 million cells is not enough, then it is a simple matter to make a bigger mesh that does.

The scaling tests with 1,2,4 cores on both systems will tell you where the limitations arise, whether the cpu is the bottleneck or the memory is the bottleneck or in this case apparently, if the hard disk is the bottle neck.

I also highly recommend turning off HT on every machine. Whatever is done or is not done please at least stop doing the things that don't make sense, don't launch calculations using 14 threads on a ryzen or 60 threads on a E5-2697A. And I'm not even going to bother to ask if we are running base clock speeds or if any of the multipliers have been adjusted. Just run the test.

LoGaL · January 23, 2023, 10:05

Ok, let's see then. Though, it really feels like a very talented 100m sprinter (the ryzen) is being compared to a marathon runner in a 200m competition. Maybe it's not his/her forte, but the 100m sprinter wins.

otokemo · January 23, 2023, 10:36

Quote:

Originally Posted by flotus1

All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.

Since your CFD simulations don't behave as expected, we need to produce as much information as we can get.
1) You can start by explaining as detailed as possible what your simulation is doing.
2) And equally important: how you measure run time, and what is included in your run time.
3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing
4) We need strong scaling tests. These usually provide valuable insight into what is going wrong.
So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us.
Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results.

If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware"
You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.

1) I was doing transient simulation with LES of a flowe over a golf ball.
2) I measured it with totall wall clock.
3) Noted.

4) Here's the test I made, but with a steady flow over a wing and 2.5e+6 nodes and 6.68e+6 element count. K-omega turbulent model and Coupled solver.

Here everything look fine and normal, as it should be. I think there was some problem with the golf ball mesh and LES model.

LuckyTran · January 23, 2023, 11:11

And that's pretty much a textbook result. Your machines are working fine. The Ryzen runs faster at the same thread count exactly according to the ratio of the core clock speeds. So for this particular simulation you are not bandwidth limited. Once you get grid sizes that completely fill up the ram though, then you will be and in that scenario you should see the Ryzen at maximum throughput is slower according to the bandwidth ratio roughly half but i don't know what your actual bandwidth is to tell you what the actual ratio will be.

And the results for the Ryzen at 12 and 16 threads and the E5-2697A at 48 and 64 isn't a surprise either. Hyperthreading isn't much better, sometimes it sucks, but it certainly cost you more money to buy more licenses!

I guess now would be a great time to go back and debug your LES model to figure out why it is running so slow. It's not the machine but it could be some factor that has compounded with the model-machine.

January 20, 2023, 18:57	2x E5-2697A performance issues	#1
otokemo Member Otari kemularia Join Date: Mar 2018 Posts: 44 Rep Power: 8	2x E5-2697A performance worse than Ryzen 5800x. From the suggestions that I got here, I've upgraded my office Dell PT7810 with double E5-2697As, now I got 32 cores and 64 threads. For the comparison, I run 4E+06 element transient simulation on two computers, one with 5800x and the second with 2x E5-2697A, both of the systems have 128gb DDR4 Ram with latest to have ECC memory. For the 5800x I used only 14 threads and for the Xeons - 60. The task manager shows that the according number of cores are being used, but the 5800x performs much better. Is it possible that we don't have according number of HPCpack? I'm writing this from home and cannot check it right now, but it didn't occured me before, since the task manager showed the 100% load on 60 threads. Last edited by otokemo; January 21, 2023 at 01:12.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
If memory bound : how to improve performance?	aerosayan	Main CFD Forum	13	July 7, 2021 05:44
Issues with poor performance in faster CPU	gian93	Hardware	9	October 29, 2018 13:34
Fluent cluster performance issues	foamer	FLUENT	0	November 11, 2016 07:13
Performance issues with rad.correct() in a loop	chriss85	OpenFOAM Programming & Development	1	August 17, 2016 10:16
Performance issues with channelOodLes	steja	OpenFOAM Running, Solving & CFD	20	July 20, 2005 08:33

January 20, 2023, 19:04		#2
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,704 Rep Power: 66	40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything. Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.

January 22, 2023, 02:38		#6
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,704 Rep Power: 66	60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running). You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes. As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.

January 22, 2023, 04:20		#8
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,704 Rep Power: 66	It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down. 30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?

January 23, 2023, 01:09		#10
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,704 Rep Power: 66	Well it would certainly be important to mention that in your first post. If you are optimizing for reading/writing to a hard disk then you have issues completely unrelated how to do CFD. You have a problem of reading/writing to the hard disk. In the vast majority of cases, very lii writing to the hard disk is done in CFD. If the disk writing rate is significant then you're likely writing TB's per hr and your hard disk is already full and your cpu's are idle because they're not doing anything. Also, leaving 1 or 2 cores free definitely doesn't apply to hyperthreaded cores.

January 23, 2023, 05:49		#11
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,412 Rep Power: 49	All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point. Since your CFD simulations don't behave as expected, we need to produce as much information as we can get. 1) You can start by explaining as detailed as possible what your simulation is doing. 2) And equally important: how you measure run time, and what is included in your run time. 3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing 4) We need strong scaling tests. These usually provide valuable insight into what is going wrong. So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us. Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results. If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware" You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.

January 23, 2023, 07:53		#12
LoGaL Senior Member Lorenzo Galieti Join Date: Mar 2018 Posts: 375 Rep Power: 12	TBH, for a 4 million elements mesh, I wouldn't be surprised if the way higher clock frequency of the 5800x balanced out the memory bandwidth reduction. Most probably also, the non-ECC memory of the ryzen runs at a higher clockspeed (3600 vs 2933?). This should partially compensate the lower channel count, doesn't it? I am not even sure if there is a memory bandwidth problem with such small mesh...

January 23, 2023, 09:44		#13
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,704 Rep Power: 66	Well after learning that the mesh is not 40k cells, I am just going with the assumption that the purpose of this exercise is to benchmark two systems so I would expect the CFD simulation being used as a benchmark to actually capable of stressing the system. If 4 million cells is not enough, then it is a simple matter to make a bigger mesh that does. The scaling tests with 1,2,4 cores on both systems will tell you where the limitations arise, whether the cpu is the bottleneck or the memory is the bottleneck or in this case apparently, if the hard disk is the bottle neck. I also highly recommend turning off HT on every machine. Whatever is done or is not done please at least stop doing the things that don't make sense, don't launch calculations using 14 threads on a ryzen or 60 threads on a E5-2697A. And I'm not even going to bother to ask if we are running base clock speeds or if any of the multipliers have been adjusted. Just run the test.

January 23, 2023, 10:05		#14
LoGaL Senior Member Lorenzo Galieti Join Date: Mar 2018 Posts: 375 Rep Power: 12	Ok, let's see then. Though, it really feels like a very talented 100m sprinter (the ryzen) is being compared to a marathon runner in a 200m competition. Maybe it's not his/her forte, but the 100m sprinter wins.

January 23, 2023, 11:11		#16
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,704 Rep Power: 66	And that's pretty much a textbook result. Your machines are working fine. The Ryzen runs faster at the same thread count exactly according to the ratio of the core clock speeds. So for this particular simulation you are not bandwidth limited. Once you get grid sizes that completely fill up the ram though, then you will be and in that scenario you should see the Ryzen at maximum throughput is slower according to the bandwidth ratio roughly half but i don't know what your actual bandwidth is to tell you what the actual ratio will be. And the results for the Ryzen at 12 and 16 threads and the E5-2697A at 48 and 64 isn't a surprise either. Hyperthreading isn't much better, sometimes it sucks, but it certainly cost you more money to buy more licenses! I guess now would be a great time to go back and debug your LES model to figure out why it is running so slow. It's not the machine but it could be some factor that has compounded with the model-machine.