Best PC recommendation for special CFD simulation with a short time step

Habib-CFD · November 29, 2019, 01:04

Hi guys, I use the Flow 3D software for special CFD simulation including both heat and mass transfer. With respect to the nature of the problem, I have to use a very short time step in the term of convergence. The total cell count is about 100k. Currently, I employ the AMD 2970wx configured by 4*8GB 3200MHz ram and CentOS v7.7. This PC shows very high performance in the multi-cores computation of default examples. Although in the cases with the higher time step, the 20 cores performance is double time higher than 8 cores condition, the optimized condition of a customized model achieved just at 8 cores

. I checked many various methods, for example, disabling SMT, memory interleaving as the channel, configuring Numa cores, and many other efforts without special benefits. I already checked the other PC (Intel 6580K) and found the same multi-processing problem.

I will be grateful if you suggest some hardware replacement for this special case (Up to 2000$) or some method to improve the performance of my current PC. The following link mentioned about floating-point performance as the most important parameter in CFD.

https://www.flow3d.com/hardware-sele...w-3d-products/
https://en.wikichip.org/wiki/flops

Based on the attached link, the 2970WX uses the AVX2 & FMA (128-bit) while the intel products utilize the AVX2 & FMA (256-bit) or AVX-512 & FMA (512-bit) with higher FLOPs per cycle. I have tried to find the benefits of AVX-512, although there are some claims on reducing efficiency

!!

https://software.intel.com/en-us/for...g/topic/815069
https://lemire.me/blog/2018/04/19/by...st-experiment/

As I understand, the poor efficiency of multi-cores means that the problem is not correctly balanced between all cores or that the time needed for communications is important compared to computing time. I believe my model is somehow limited by the intercommunications time. BTW, I am not really sure about the new architecture of the intel product. My license covers up to 32 cores and the time step must be limited to 5e-06s with respect to 0.25mm of the cubic mesh size.

flotus1 · November 29, 2019, 04:50

There is A LOT to unravel here. I might get back to it over the course of the weekend.
For now, I can not help but notice that the first link you quoted is riddled with questionable claims about CFD performance, and it contradicts itself.
To quote from that source:

Quote:

because CFD solver performance is entirely dependent on the floating-point performance of the CPU

This is plain wrong. Memory performance plays an equally important role, if not more important.
Weirdly enough, they follow up by contradicting themselves:

Quote:

Skylake and newer considerations

We have determined that the way RAM is physically populated on the board is extremely important for performance on Skylake and newer architectures. For a CPU that supports six memory channels, 6 or 12 DIMMs should be populated identically. For four channel CPUs, 4 or 8 DIMMs should be populated identically.
An unbalanced configuration, where the memory channels or DIMM size/speed are mismatched, reduces performance significantly.

Which is a complicated way of saying: memory performance is crucial for CFD solver performance.

So in short: you are not looking for the highest theoretical floating point performance in a CPU, but a balance between FP performance and memory subsystem performance.

flotus1 · December 1, 2019, 04:59

All right here is my take on this issue.

First things first: with a low cell count of 200k, your case just might not scale very well on many cores, no matter what you do. Time step size should have very little impact on the whole situation.
But here is what you can try with your 2970WX

Disable SMT
Deactivate the cores on chiplets without direct memory access. Running CFD codes on them will slow down your computations. This will leave you with 12 usable cores. Either deactivate the cores in the bios (no idea if it lets you do that), or pin your simulation to the other cores using taskset or similar tools.
Use NUMA mode, set memory interleaving option to channel
Make sure memory is populated correctly, and runs at the intended frequency. The motherboard manual should have information about which DIMM slots to use for 4 DIMMs.
Optimize memory timings further, using Ryzen DRAM calculator.
Last, not least: make sure there are no other bottlenecks. For example I/O from frequent writes to a slow disk. In this case, increase the output interval.

As for buying a different PC just for this application: While the TR 2970WX might not be ideal for this task, it will be very difficult to get a significantly better configuration in the 2000$ price range. At least once all of the issues above have been addressed.

Habib-CFD · December 1, 2019, 07:20

Quote:

Originally Posted by flotus1

All right here is my take on this issue.

First things first: with a low cell count of 200k, your case just might not scale very well on many cores, no matter what you do. Time step size should have very little impact on the whole situation.
As for buying a different PC just for this application: While the TR 2970WX might not be ideal for this task, it will be very difficult to get a significantly better configuration in the 2000$ price range. At least once all of the issues above have been addressed.

At first, thank you for your helpful reply. I need to mention that all my experience on CFD just limited to Flow 3D software and threadripper based system so maybe some of my claims are not true for the other cases.

I agree with your comment on multi-core performance using 200k cell but the effect of the time step is completely obvious. Bellow the 1e-6s step the solving time for 4 cores is similar to 20 cores. I am a little confused about the result and very interesting to find the bottleneck.

I already checked these customization. For example in my case, disabling the SMT did not show significant improvement (less than 5 percent). Overclocking the frequency of RAM from 2166 (M.B. default) to 3200MHz (RAM default with the best timing) showed about 10 percent improvement. All four slots populated correctly in quad mode. In addition, I use a high-speed M.2 memory (970 Evo). Setting the memory interleaving in Die mode revealed more benefits than Channel mode (5 percent totally). It seems the auto-configuration in ASRock x399 taichi works very well.

Maybe, it is better to focus on the codes in simulation and looking for a way to increase the time step or better defining the problem.
Thanks again.

flotus1 · December 1, 2019, 08:48

If time step size really has such a high impact with your test case, it will be necessary to find out what is going on here. This is not normal.

Apart from that, you still seem to be missing the most important optimization I mentioned. Run the code only on cores that have direct memory access. Without this potentially huge bottleneck out of the way, judging the impact of other factors is rather pointless.

Habib-CFD · December 1, 2019, 10:01

Oh, I missed to explain the set up of the cores with direct access to ram. The Linux command provides some option for disabling cores directly so I checked different sets, e.g. disabling die 1 and 3 with and without SMT. As I mentioned, due to plenty of threads in 2970wx, the effect somehow was vanished. I heard that this bottleneck has a significant effect on some benchmark like 7-zip compression using higher than 8 cores, but in Flow 3D the condition looks different.

Thank you.

November 29, 2019, 01:04	Best PC recommendation for special CFD simulation with a short time step	#1
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	Hi guys, I use the Flow 3D software for special CFD simulation including both heat and mass transfer. With respect to the nature of the problem, I have to use a very short time step in the term of convergence. The total cell count is about 100k. Currently, I employ the AMD 2970wx configured by 4*8GB 3200MHz ram and CentOS v7.7. This PC shows very high performance in the multi-cores computation of default examples. Although in the cases with the higher time step, the 20 cores performance is double time higher than 8 cores condition, the optimized condition of a customized model achieved just at 8 cores. I checked many various methods, for example, disabling SMT, memory interleaving as the channel, configuring Numa cores, and many other efforts without special benefits. I already checked the other PC (Intel 6580K) and found the same multi-processing problem. I will be grateful if you suggest some hardware replacement for this special case (Up to 2000$) or some method to improve the performance of my current PC. The following link mentioned about floating-point performance as the most important parameter in CFD. https://www.flow3d.com/hardware-sele...w-3d-products/ https://en.wikichip.org/wiki/flops Based on the attached link, the 2970WX uses the AVX2 & FMA (128-bit) while the intel products utilize the AVX2 & FMA (256-bit) or AVX-512 & FMA (512-bit) with higher FLOPs per cycle. I have tried to find the benefits of AVX-512, although there are some claims on reducing efficiency!! https://software.intel.com/en-us/for...g/topic/815069 https://lemire.me/blog/2018/04/19/by...st-experiment/ As I understand, the poor efficiency of multi-cores means that the problem is not correctly balanced between all cores or that the time needed for communications is important compared to computing time. I believe my model is somehow limited by the intercommunications time. BTW, I am not really sure about the new architecture of the intel product. My license covers up to 32 cores and the time step must be limited to 5e-06s with respect to 0.25mm of the cubic mesh size.

December 1, 2019, 04:59		#3
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	All right here is my take on this issue. First things first: with a low cell count of 200k, your case just might not scale very well on many cores, no matter what you do. Time step size should have very little impact on the whole situation. But here is what you can try with your 2970WX Disable SMT Deactivate the cores on chiplets without direct memory access. Running CFD codes on them will slow down your computations. This will leave you with 12 usable cores. Either deactivate the cores in the bios (no idea if it lets you do that), or pin your simulation to the other cores using taskset or similar tools. Use NUMA mode, set memory interleaving option to channel Make sure memory is populated correctly, and runs at the intended frequency. The motherboard manual should have information about which DIMM slots to use for 4 DIMMs. Optimize memory timings further, using Ryzen DRAM calculator. Last, not least: make sure there are no other bottlenecks. For example I/O from frequent writes to a slow disk. In this case, increase the output interval. As for buying a different PC just for this application: While the TR 2970WX might not be ideal for this task, it will be very difficult to get a significantly better configuration in the 2000$ price range. At least once all of the issues above have been addressed. Habib-CFD likes this.

December 1, 2019, 08:48		#5
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	If time step size really has such a high impact with your test case, it will be necessary to find out what is going on here. This is not normal. Apart from that, you still seem to be missing the most important optimization I mentioned. Run the code only on cores that have direct memory access. Without this potentially huge bottleneck out of the way, judging the impact of other factors is rather pointless. Habib-CFD likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Time Step Continuity Errors simpleFoam	Dorian1504	OpenFOAM Running, Solving & CFD	1	October 9, 2022 09:23
[Other] Contribution a new utility: refine wall layer mesh based on yPlus field	lakeat	OpenFOAM Community Contributions	58	December 23, 2021 02:36
AMI speed performance	danny123	OpenFOAM	21	October 24, 2020 04:13
How to write k and epsilon before the abnormal end	xiuying	OpenFOAM Running, Solving & CFD	8	August 27, 2013 15:33
IcoFoam parallel woes	msrinath80	OpenFOAM Running, Solving & CFD	9	July 22, 2007 02:58

December 1, 2019, 10:01		#6
Habib-CFD Member Join Date: Oct 2019 Posts: 63 Rep Power: 6	Oh, I missed to explain the set up of the cores with direct access to ram. The Linux command provides some option for disabling cores directly so I checked different sets, e.g. disabling die 1 and 3 with and without SMT. As I mentioned, due to plenty of threads in 2970wx, the effect somehow was vanished. I heard that this bottleneck has a significant effect on some benchmark like 7-zip compression using higher than 8 cores, but in Flow 3D the condition looks different. Thank you.