|
[Sponsors] |
![]() |
![]() |
#1 |
Member
Otari kemularia
Join Date: Mar 2018
Posts: 40
Rep Power: 7 ![]() |
2x E5-2697A performance worse than Ryzen 5800x.
From the suggestions that I got here, I've upgraded my office Dell PT7810 with double E5-2697As, now I got 32 cores and 64 threads. For the comparison, I run 4E+06 element transient simulation on two computers, one with 5800x and the second with 2x E5-2697A, both of the systems have 128gb DDR4 Ram with latest to have ECC memory. For the 5800x I used only 14 threads and for the Xeons - 60. The task manager shows that the according number of cores are being used, but the 5800x performs much better. Is it possible that we don't have according number of HPCpack? I'm writing this from home and cannot check it right now, but it didn't occured me before, since the task manager showed the 100% load on 60 threads. Last edited by otokemo; January 21, 2023 at 01:12. |
|
![]() |
![]() |
![]() |
![]() |
#2 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,250
Rep Power: 63 ![]() ![]() ![]() |
40k elements is a very small count and will not scale to high thread counts well. ECC memory is slower than non-ECC memory. 14 and 60 threads is also non-ideal loading for either system and makes it really hard to benchmark anything.
Fluent will tell you right away if you don't have enough licenses on launch. It will not calculate. |
|
![]() |
![]() |
![]() |
![]() |
#3 | |
Member
Otari kemularia
Join Date: Mar 2018
Posts: 40
Rep Power: 7 ![]() |
Quote:
Wym non ideal looking? |
||
![]() |
![]() |
![]() |
![]() |
#4 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 261
Rep Power: 12 ![]() |
For best performance you should ues 32 threads on the 2xE5-2697A v4. That is the actual number of cores. Hyper Threading does not help CFD performance because speed is not compute limited but memory bandwidth limited. To the contrary, it will hurt because of memory bandwidth contention between all these threads.
Also, make sure that you have at least eight, equal sized, equal performance RDIMMs total, four per CPU and installed in the correct slot 1 for each memory channel as per: Quote:
If your set-up of the2xE5-2697Av4 system is fixed, you should see these machines be at least twice as fast as the 5800x. |
||
![]() |
![]() |
![]() |
![]() |
#5 | |
Member
Otari kemularia
Join Date: Mar 2018
Posts: 40
Rep Power: 7 ![]() |
Quote:
As for the threads, whenever I used all the threads I was always getting better performance on any desktop configuration. Even with my 5800x, I'm using all threads. |
||
![]() |
![]() |
![]() |
![]() |
#6 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,250
Rep Power: 63 ![]() ![]() ![]() |
60 fluent processes does not utilize all 32 cores and 64 threads. 60 processes running as fast as they can also will not show 100% core utilization in the task manager, it would show 60/64=93.75% (or a little higher since since background tasks will likely be running).
You never want a partial number of hyperthreaded cores loaded, either none or all of them. It slows down the calculation by pretty much 50% since the processes have to wait for the others to synchronize each iteration if you use a weird number of fluent processes. As mentioned the E5-2697A v4 should run 2x as fast because it supports quad-channel memory whereas a 5800x only supports dual channel. It has double the memory throughput. |
|
![]() |
![]() |
![]() |
![]() |
#7 | |
Member
Otari kemularia
Join Date: Mar 2018
Posts: 40
Rep Power: 7 ![]() |
Quote:
So either I put all 64 (which isn't adviced) or I put 32. But also it is recomended to leave 1-2 cores free, so they can work on read/write and other tasks? So if I select only 30 cores, is it still ok? |
||
![]() |
![]() |
![]() |
![]() |
#8 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,250
Rep Power: 63 ![]() ![]() ![]() |
It is advised to use 64. Do not leave 1 or 2 hyperthreaded cores idle, that's exactly what slows it down.
30 would do better than 60 because it not have a synchronization issue. However, you should not be leaving cores to do other tasks because you should not be doing other tasks! CFD is memory bandwidth limited when you have a reasonably sized mesh, doing anything else will slow down the calculation because it must compete with the memory usage. The only time you should even think of only utilizing 30 is if you just don't have licenses. You spent all that money on a better machine just to not use it!? |
|
![]() |
![]() |
![]() |
![]() |
#9 | |
Member
Otari kemularia
Join Date: Mar 2018
Posts: 40
Rep Power: 7 ![]() |
Quote:
|
||
![]() |
![]() |
![]() |
![]() |
#10 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,250
Rep Power: 63 ![]() ![]() ![]() |
Well it would certainly be important to mention that in your first post. If you are optimizing for reading/writing to a hard disk then you have issues completely unrelated how to do CFD. You have a problem of reading/writing to the hard disk.
In the vast majority of cases, very lii writing to the hard disk is done in CFD. If the disk writing rate is significant then you're likely writing TB's per hr and your hard disk is already full and your cpu's are idle because they're not doing anything. Also, leaving 1 or 2 cores free definitely doesn't apply to hyperthreaded cores. |
|
![]() |
![]() |
![]() |
![]() |
#11 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,308
Rep Power: 44 ![]() ![]() |
All right, since it was my recommendation that brought us here, let's try to get to the bottom of this. I mentioned the caveat of counterintuitive results multiple times while making a tentative recommendation, but that's besides the point.
Since your CFD simulations don't behave as expected, we need to produce as much information as we can get. 1) You can start by explaining as detailed as possible what your simulation is doing. 2) And equally important: how you measure run time, and what is included in your run time. 3) In order to keep things simple, I recommend disabling SMT/HT on both systems before doing further testing 4) We need strong scaling tests. These usually provide valuable insight into what is going wrong. So you take your case, and run it with 1, 2, 4, and 8 threads on the Ryzen 5800X. Log the run time for each test and share the results with us. Same with the Xeon system. Run it on 1, 2, 4, 8, 16, 32 threads and share the results. If you are on Linux, please also share the output of "sudo dmidecode -t 17" and "numactl --hardware" You can attach that as a text file to your next post. It is possible that numactl needs to be installed first. |
|
![]() |
![]() |
![]() |
![]() |
#12 |
Senior Member
Lorenzo Galieti
Join Date: Mar 2018
Posts: 348
Rep Power: 11 ![]() |
TBH, for a 4 million elements mesh, I wouldn't be surprised if the way higher clock frequency of the 5800x balanced out the memory bandwidth reduction. Most probably also, the non-ECC memory of the ryzen runs at a higher clockspeed (3600 vs 2933?). This should partially compensate the lower channel count, doesn't it?
I am not even sure if there is a memory bandwidth problem with such small mesh... |
|
![]() |
![]() |
![]() |
![]() |
#13 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,250
Rep Power: 63 ![]() ![]() ![]() |
Well after learning that the mesh is not 40k cells, I am just going with the assumption that the purpose of this exercise is to benchmark two systems so I would expect the CFD simulation being used as a benchmark to actually capable of stressing the system. If 4 million cells is not enough, then it is a simple matter to make a bigger mesh that does.
The scaling tests with 1,2,4 cores on both systems will tell you where the limitations arise, whether the cpu is the bottleneck or the memory is the bottleneck or in this case apparently, if the hard disk is the bottle neck. I also highly recommend turning off HT on every machine. Whatever is done or is not done please at least stop doing the things that don't make sense, don't launch calculations using 14 threads on a ryzen or 60 threads on a E5-2697A. And I'm not even going to bother to ask if we are running base clock speeds or if any of the multipliers have been adjusted. Just run the test. |
|
![]() |
![]() |
![]() |
![]() |
#14 |
Senior Member
Lorenzo Galieti
Join Date: Mar 2018
Posts: 348
Rep Power: 11 ![]() |
Ok, let's see then. Though, it really feels like a very talented 100m sprinter (the ryzen) is being compared to a marathon runner in a 200m competition. Maybe it's not his/her forte, but the 100m sprinter wins.
|
|
![]() |
![]() |
![]() |
![]() |
#15 | |
Member
Otari kemularia
Join Date: Mar 2018
Posts: 40
Rep Power: 7 ![]() |
Quote:
2) I measured it with totall wall clock. 3) Noted. 4) Here's the test I made, but with a steady flow over a wing and 2.5e+6 nodes and 6.68e+6 element count. K-omega turbulent model and Coupled solver. Here everything look fine and normal, as it should be. I think there was some problem with the golf ball mesh and LES model. Last edited by otokemo; January 24, 2023 at 01:36. Reason: The number of elements is 6.68e+6, nodes - 2.5e+6 |
||
![]() |
![]() |
![]() |
![]() |
#16 |
Senior Member
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,250
Rep Power: 63 ![]() ![]() ![]() |
And that's pretty much a textbook result. Your machines are working fine. The Ryzen runs faster at the same thread count exactly according to the ratio of the core clock speeds. So for this particular simulation you are not bandwidth limited. Once you get grid sizes that completely fill up the ram though, then you will be and in that scenario you should see the Ryzen at maximum throughput is slower according to the bandwidth ratio roughly half but i don't know what your actual bandwidth is to tell you what the actual ratio will be.
And the results for the Ryzen at 12 and 16 threads and the E5-2697A at 48 and 64 isn't a surprise either. Hyperthreading isn't much better, sometimes it sucks, but it certainly cost you more money to buy more licenses! I guess now would be a great time to go back and debug your LES model to figure out why it is running so slow. It's not the machine but it could be some factor that has compounded with the model-machine. |
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
If memory bound : how to improve performance? | aerosayan | Main CFD Forum | 13 | July 7, 2021 05:44 |
Issues with poor performance in faster CPU | gian93 | Hardware | 9 | October 29, 2018 13:34 |
Fluent cluster performance issues | foamer | FLUENT | 0 | November 11, 2016 07:13 |
Performance issues with rad.correct() in a loop | chriss85 | OpenFOAM Programming & Development | 1 | August 17, 2016 10:16 |
Performance issues with channelOodLes | steja | OpenFOAM Running, Solving & CFD | 20 | July 20, 2005 08:33 |