|
[Sponsors] |
Best practices on hybrid architecture CPUs (like i9-13900K) |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
August 29, 2024, 14:55 |
Best practices on hybrid architecture CPUs (like i9-13900K)
|
#1 |
Senior Member
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9 |
Hi all,
I'm looking for ways to speedup my simulations with the hardware that I have (i9-13900K, 2x32GB DDR5 6000MHz). This processor has 8 P-cores and 16 E-cores. I've read here that running a job on all 24 cores is nearly the same as running it on just the 8 P-cores (without hyperthreading). I'd like to know what you guys suggest to optimize my core usage in two scenarios:
Some extra questions: - How can I pin the MPI job to only the use P or E cores? Better yet, how can I make a default configuration that every time I submit an mpi job, it will use the 8 P-cores first, and then use the other E-cores if submit another one? - When using the "--cpu-set" flag in mpirun, how do I know the core indices for the P and E cores? - Noob question now: "lscpu" says that I have 16 cores and 2 threads per core (total of 32 threads). I do have 32 threads, but on a different arrangement... Is it somewhat blind to the hybrid architecture of P and E cores, or am I missing some setup? - Another noob question: my memory seems to be running at 4800MHz, but it's a 6000MHz memory. Can I simply increase the frequency? Should I be concerned about anything by doing this? - In long simulations (10+days running directly) should I make some special precautions, like not using full RAM speed, less cores, etc? Thank you all! |
|
August 30, 2024, 14:49 |
|
#2 |
Senior Member
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9 |
I made some progress, but still haven't found all the answers!
Using flags like "bind-to core" and "-cpu-list" can pin the job to specific cores. The -report-bindings flag shows that the process has been properly bound. The P-cores IDs are 0-7 and E-cores are 8-23 (with HT off) However, for some reason the solver runs at full speed (as if P-cores are used) even if I bind it to E-cores. Don't know why, but it seems to me that WSL doesn't have the "privileges" to dictate processor affinity... If I manually set it in task manager it works, but when sending it through mpirun flags, it seems that Windows takes over and decide the affinities. About the lscpu question, it seems that WSL doesn't see all cores in the right topology. Turning off HT solved this, and now lscpu sees 24 cores. About the memory questions, it seems it's a matter of overclocking (XMP) the memory. Haven't tried this yet. When I manage to forcibly use E cores, I'll start experimenting with processorWeights trying to optimize the load on each one. Last edited by JulioPieri; August 30, 2024 at 16:14. |
|
August 30, 2024, 19:52 |
|
#3 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
You might have a look in this thread:
Intel i9 13900K with 8 channel were are Game Changer for CFD |
|
September 2, 2024, 10:51 |
|
#4 |
Senior Member
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9 |
Yes, I've read it! Many useful info there indeed.
So my conclusions are: 1) There is no benefit to use more than 8 cores, as you face memory channel saturation and/or bottleneck from the E cores. I ran a test can for same results with 8, 12, 16, 24 (HT on/off). 2) HT on/off doesn't seem to change anything as well, maybe because the system is managing thing behind the curtain 3) Load balancing (mesh decomposition with bias to the P cores) actually worsen the results. I still have some doubts: 1) Why decomposing the domain with bias (say, 2x more mesh elements to the P cores) doesn't work? I'd expect that there would be a point where adding slower E-cores would help at least a little by slightly unloading the P-cores. 2) Simultaneous 4-8 cores simulations seems to run at same speed, even with binding on. I'd expect that the ones bound to cores 0-8 would run faster... Is this a limitation of running it through WSL? 3) Should I overclock my memory? It's advertised as 6000, but its running only at 4800. Would it cause system instabilities, or physical damage to any component? |
|
September 3, 2024, 01:22 |
|
#5 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
1) It is in theory possible to give slower cores less domain to load every core 100%. However, no improvement can occur if the memory os bottlenecked. 2) I do believe that WSL load balances. Maybe do a dual install with linux? 3) You should definitely overclock your memory. Just up the multiplier and you are probably fine. I have seen experts overclock to 7200. |
||
September 3, 2024, 09:58 |
|
#6 | |
Senior Member
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9 |
Quote:
I increased the memory to 6000 and got almost 20% improvement indeed. Further overclocking it beyond the nominal spec sounds risky to me... To upgrade my station, is there anything worth doing with this setup or it's better to save to a completely new setting? Like equal cores and with more memory channels, etc |
||
September 3, 2024, 16:38 |
|
#7 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
|
||
September 3, 2024, 17:47 |
|
#8 |
Senior Member
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9 |
I might upgrade this PC in the near future.
Do you think it's better to abandon the hybrid i9-13900K completely and buy a fresh workstation, with a more modular setup, with processors better suited for CFD, etc? Or reusing this processor and, say, just add another one (changing to a two socket motherboard) would be a good choice? I mean, can I make this PC better for CFD or I'd be better off getting a whole new workstation? |
|
September 4, 2024, 01:04 |
|
#9 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14 |
Quote:
Have you actually had slow runtimes? |
||
September 4, 2024, 10:21 |
|
#10 |
Senior Member
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9 |
Thank you for your suggestion. Actually it's a PC dedicated to CFD, which I purchased only considering the processor's clock. At the time, I got blown away by the 5.6GHz of the 13900, and I thought i could make use of the all 24/32HT cores available - even if at a non linear scaling. But only being able to effectively use 8 cores really fell short of my expectations.
I think I don't have specially slow runtimes. The cavity3D with 1MM cells run to the simulated time of 0.015s in: 8.14s for 8 cores (HT on) 7.71s for 16 cores (HT on) 7.29s for 32 cores (HTded) From your comments in the other post, it seems a good result for 8 core. For the other decomposition setting, the gains are marginal, maybe within tolerance. I wouldn't expected any reduction in runtime for using more than 8 cores in 13900. But I want to further increase my processing capacity so I can take on more complex projects. Also being able to run multiple simulations at once is appealing. |
|
Tags |
13900, hyperthreading, mpirun |
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OpenFOAM benchmarks on various hardware | eric | Hardware | 820 | November 20, 2024 13:37 |
General recommendations for CFD hardware [WIP] | flotus1 | Hardware | 19 | June 23, 2024 19:02 |
Workstation Suggestions For A Newbie | mrtcnsmgr | Hardware | 1 | February 22, 2023 02:13 |
AMD Epyc 9004 "Genoa" buyers guide for CFD | flotus1 | Hardware | 8 | January 16, 2023 06:23 |
Version 15 on Mac OS X | gschaider | OpenFOAM Installation | 113 | December 2, 2009 11:23 |