CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   FLUENT (https://www.cfd-online.com/Forums/fluent/)
-   -   2x E5-2697A performance issues (https://www.cfd-online.com/Forums/fluent/247224-2x-e5-2697a-performance-issues.html)

otokemo January 20, 2023 18:57

2x E5-2697A performance issues
 
2x E5-2697A performance worse than Ryzen 5800x.

From the suggestions that I got here, I've upgraded my office Dell PT7810 with double E5-2697As, now I got 32 cores and 64 threads. For the comparison, I run 4E+06 element transient simulation on two computers, one with 5800x and the second with 2x E5-2697A, both of the systems have 128gb DDR4 Ram with latest to have ECC memory. For the 5800x I used only 14 threads and for the Xeons - 60. The task manager shows that the according number of cores are being used, but the 5800x performs much better. Is it possible that we don't have according number of HPCpack? I'm writing this from home and cannot check it right now, but it didn't occured me before, since the task manager showed the 100% load on 60 threads.

LuckyTran January 20, 2023 19:04

40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.


Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.

otokemo January 21, 2023 01:14

Quote:

Originally Posted by LuckyTran (Post 843220)
40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.


Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate.

My bad! It's 4e+6 not e+3.
Wym non ideal looking?

wkernkamp January 21, 2023 19:51

For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.


Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per:
Quote:

Originally Posted by Crowdion (Post 843249)
To get the best performance you should populate the ram slots correctly. IBM Document in the link below explains the concept of balanced memory configurations

https://lenovopress.lenovo.com/lp050...-configuration


If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x.

otokemo January 22, 2023 02:17

Quote:

Originally Posted by wkernkamp (Post 843255)
For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.


Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per:



If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x.

All 8 slots are filled with exactly same 16Gb dimms.

As for the threads, whenever I used all the threads I was always getting better performance on any desktop configuration.
Even with my 5800x, I'm using all threads.

LuckyTran January 22, 2023 02:38

60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).


You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes.


As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.

otokemo January 22, 2023 03:00

Quote:

Originally Posted by LuckyTran (Post 843260)
60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).


You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes.


As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput.

Really interesting, thanks!
So either I put all 64 (which isn't adviced) or I put 32. But also it is recomended to leave 1-2 cores free, so they can work on read/write and other tasks? So if I select only 30 cores, is it still ok?

LuckyTran January 22, 2023 04:20

It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.


30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?

otokemo January 22, 2023 08:32

Quote:

Originally Posted by LuckyTran (Post 843262)
It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.


30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!?

The recommendation for Ansys Mechanical, is to leave 1-2 cores free, so they could do the reas/write tasks. I've done that with 30 cores (left 2 cores free) and it reduced the calculation time. I thought the same was with Fluent.

LuckyTran January 23, 2023 01:09

Well it would certainly be important to mention that in your first post. If you are optimizing for reading/writing to a hard disk then you have issues completely unrelated how to do CFD. You have a problem of reading/writing to the hard disk.


In the vast majority of cases, very lii writing to the hard disk is done in CFD. If the disk writing rate is significant then you're likely writing TB's per hr and your hard disk is already full and your cpu's are idle because they're not doing anything.


Also, leaving 1 or 2 cores free definitely doesn't apply to hyperthreaded cores.

flotus1 January 23, 2023 05:49

All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.

Since your CFD simulations don't behave as expected, we need to produce as much information as we can get.
1) You can start by explaining as detailed as possible what your simulation is doing.
2) And equally important: how you measure run time, and what is included in your run time.
3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing
4) We need strong scaling tests. These usually provide valuable insight into what is going wrong.
So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us.
Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results.

If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware"
You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.

LoGaL January 23, 2023 07:53

TBH, for a 4 million elements mesh, I wouldn't be surprised if the way higher clock frequency of the 5800x balanced out the memory bandwidth reduction. Most probably also, the non-ECC memory of the ryzen runs at a higher clockspeed (3600 vs 2933?). This should partially compensate the lower channel count, doesn't it?

I am not even sure if there is a memory bandwidth problem with such small mesh...

LuckyTran January 23, 2023 09:44

Well after learning that the mesh is not 40k cells, I am just going with the assumption that the purpose of this exercise is to benchmark two systems so I would expect the CFD simulation being used as a benchmark to actually capable of stressing the system. If 4 million cells is not enough, then it is a simple matter to make a bigger mesh that does.

The scaling tests with 1,2,4 cores on both systems will tell you where the limitations arise, whether the cpu is the bottleneck or the memory is the bottleneck or in this case apparently, if the hard disk is the bottle neck.

I also highly recommend turning off HT on every machine. Whatever is done or is not done please at least stop doing the things that don't make sense, don't launch calculations using 14 threads on a ryzen or 60 threads on a E5-2697A. And I'm not even going to bother to ask if we are running base clock speeds or if any of the multipliers have been adjusted. Just run the test.

LoGaL January 23, 2023 10:05

Ok, let's see then. Though, it really feels like a very talented 100m sprinter (the ryzen) is being compared to a marathon runner in a 200m competition. Maybe it's not his/her forte, but the 100m sprinter wins.

otokemo January 23, 2023 10:36

2 Attachment(s)
Quote:

Originally Posted by flotus1 (Post 843303)
All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.

Since your CFD simulations don't behave as expected, we need to produce as much information as we can get.
1) You can start by explaining as detailed as possible what your simulation is doing.
2) And equally important: how you measure run time, and what is included in your run time.
3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing
4) We need strong scaling tests. These usually provide valuable insight into what is going wrong.
So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us.
Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results.

If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware"
You can attach that as a text file to your next post. It is possible that numactl needs to be installed first.

1) I was doing transient simulation with LES of a flowe over a golf ball.
2) I measured it with totall wall clock.
3) Noted.

4) Here's the test I made, but with a steady flow over a wing and 2.5e+6 nodes and 6.68e+6 element count. K-omega turbulent model and Coupled solver.

Here everything look fine and normal, as it should be. I think there was some problem with the golf ball mesh and LES model.

LuckyTran January 23, 2023 11:11

And that's pretty much a textbook result. Your machines are working fine. The Ryzen runs faster at the same thread count exactly according to the ratio of the core clock speeds. So for this particular simulation you are not bandwidth limited. Once you get grid sizes that completely fill up the ram though, then you will be and in that scenario you should see the Ryzen at maximum throughput is slower according to the bandwidth ratio roughly half but i don't know what your actual bandwidth is to tell you what the actual ratio will be.

And the results for the Ryzen at 12 and 16 threads and the E5-2697A at 48 and 64 isn't a surprise either. Hyperthreading isn't much better, sometimes it sucks, but it certainly cost you more money to buy more licenses!


I guess now would be a great time to go back and debug your LES model to figure out why it is running so slow. It's not the machine but it could be some factor that has compounded with the model-machine.


All times are GMT -4. The time now is 16:34.