CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Best practices on hybrid architecture CPUs (like i9-13900K)

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By JulioPieri

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 29, 2024, 14:55
Default Best practices on hybrid architecture CPUs (like i9-13900K)
  #1
Senior Member
 
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9
JulioPieri is on a distinguished road
Hi all,

I'm looking for ways to speedup my simulations with the hardware that I have (i9-13900K, 2x32GB DDR5 6000MHz).

This processor has 8 P-cores and 16 E-cores. I've read here that running a job on all 24 cores is nearly the same as running it on just the 8 P-cores (without hyperthreading).

I'd like to know what you guys suggest to optimize my core usage in two scenarios:
  1. Max speed for one single (large) simulation
    Should I only use the 8-P? Could 16-E be better in some scenarios (like very large cell count)? Should HT be on or off?
  2. Multiple mpi job running simultaneously, with 4 to 8 cores each
    Something like 8-P + 8-E + 8-E, for instance, would work well? Can I turn HT on to "gain 8 threads", or will I lose performance on the other job using the P-cores?

Some extra questions:
- How can I pin the MPI job to only the use P or E cores? Better yet, how can I make a default configuration that every time I submit an mpi job, it will use the 8 P-cores first, and then use the other E-cores if submit another one?
- When using the "--cpu-set" flag in mpirun, how do I know the core indices for the P and E cores?
- Noob question now: "lscpu" says that I have 16 cores and 2 threads per core (total of 32 threads). I do have 32 threads, but on a different arrangement... Is it somewhat blind to the hybrid architecture of P and E cores, or am I missing some setup?
- Another noob question: my memory seems to be running at 4800MHz, but it's a 6000MHz memory. Can I simply increase the frequency? Should I be concerned about anything by doing this?
- In long simulations (10+days running directly) should I make some special precautions, like not using full RAM speed, less cores, etc?

Thank you all!
JulioPieri is offline   Reply With Quote

Old   August 30, 2024, 14:49
Default
  #2
Senior Member
 
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9
JulioPieri is on a distinguished road
I made some progress, but still haven't found all the answers!

Using flags like "bind-to core" and "-cpu-list" can pin the job to specific cores. The -report-bindings flag shows that the process has been properly bound. The P-cores IDs are 0-7 and E-cores are 8-23 (with HT off)
However, for some reason the solver runs at full speed (as if P-cores are used) even if I bind it to E-cores. Don't know why, but it seems to me that WSL doesn't have the "privileges" to dictate processor affinity... If I manually set it in task manager it works, but when sending it through mpirun flags, it seems that Windows takes over and decide the affinities.

About the lscpu question, it seems that WSL doesn't see all cores in the right topology. Turning off HT solved this, and now lscpu sees 24 cores.

About the memory questions, it seems it's a matter of overclocking (XMP) the memory. Haven't tried this yet.

When I manage to forcibly use E cores, I'll start experimenting with processorWeights trying to optimize the load on each one.

Last edited by JulioPieri; August 30, 2024 at 16:14.
JulioPieri is offline   Reply With Quote

Old   August 30, 2024, 19:52
Default
  #3
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
You might have a look in this thread:
Intel i9 13900K with 8 channel were are Game Changer for CFD
wkernkamp is offline   Reply With Quote

Old   September 2, 2024, 10:51
Default
  #4
Senior Member
 
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9
JulioPieri is on a distinguished road
Yes, I've read it! Many useful info there indeed.

So my conclusions are:
1) There is no benefit to use more than 8 cores, as you face memory channel saturation and/or bottleneck from the E cores. I ran a test can for same results with 8, 12, 16, 24 (HT on/off).
2) HT on/off doesn't seem to change anything as well, maybe because the system is managing thing behind the curtain
3) Load balancing (mesh decomposition with bias to the P cores) actually worsen the results.


I still have some doubts:
1) Why decomposing the domain with bias (say, 2x more mesh elements to the P cores) doesn't work? I'd expect that there would be a point where adding slower E-cores would help at least a little by slightly unloading the P-cores.
2) Simultaneous 4-8 cores simulations seems to run at same speed, even with binding on. I'd expect that the ones bound to cores 0-8 would run faster... Is this a limitation of running it through WSL?
3) Should I overclock my memory? It's advertised as 6000, but its running only at 4800. Would it cause system instabilities, or physical damage to any component?
wkernkamp likes this.
JulioPieri is offline   Reply With Quote

Old   September 3, 2024, 01:22
Default
  #5
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by JulioPieri View Post
I still have some doubts:
1) Why decomposing the domain with bias (say, 2x more mesh elements to the P cores) doesn't work? I'd expect that there would be a point where adding slower E-cores would help at least a little by slightly unloading the P-cores.
2) Simultaneous 4-8 cores simulations seems to run at same speed, even with binding on. I'd expect that the ones bound to cores 0-8 would run faster... Is this a limitation of running it through WSL?
3) Should I overclock my memory? It's advertised as 6000, but its running only at 4800. Would it cause system instabilities, or physical damage to any component?

1) It is in theory possible to give slower cores less domain to load every core 100%. However, no improvement can occur if the memory os bottlenecked.
2) I do believe that WSL load balances. Maybe do a dual install with linux?
3) You should definitely overclock your memory. Just up the multiplier and you are probably fine. I have seen experts overclock to 7200.
wkernkamp is offline   Reply With Quote

Old   September 3, 2024, 09:58
Default
  #6
Senior Member
 
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9
JulioPieri is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
1) It is in theory possible to give slower cores less domain to load every core 100%. However, no improvement can occur if the memory os bottlenecked.
2) I do believe that WSL load balances. Maybe do a dual install with linux?
3) You should definitely overclock your memory. Just up the multiplier and you are probably fine. I have seen experts overclock to 7200.
Thank you!
I increased the memory to 6000 and got almost 20% improvement indeed. Further overclocking it beyond the nominal spec sounds risky to me...

To upgrade my station, is there anything worth doing with this setup or it's better to save to a completely new setting? Like equal cores and with more memory channels, etc
JulioPieri is offline   Reply With Quote

Old   September 3, 2024, 16:38
Default
  #7
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by JulioPieri View Post
Thank you!
I increased the memory to 6000 and got almost 20% improvement indeed. Further overclocking it beyond the nominal spec sounds risky to me...

To upgrade my station, is there anything worth doing with this setup or it's better to save to a completely new setting? Like equal cores and with more memory channels, etc
I don't understand the question. Are you thinking of replacing motherboard and processor?
wkernkamp is offline   Reply With Quote

Old   September 3, 2024, 17:47
Default
  #8
Senior Member
 
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9
JulioPieri is on a distinguished road
I might upgrade this PC in the near future.

Do you think it's better to abandon the hybrid i9-13900K completely and buy a fresh workstation, with a more modular setup, with processors better suited for CFD, etc? Or reusing this processor and, say, just add another one (changing to a two socket motherboard) would be a good choice?

I mean, can I make this PC better for CFD or I'd be better off getting a whole new workstation?
JulioPieri is offline   Reply With Quote

Old   September 4, 2024, 01:04
Default
  #9
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 372
Rep Power: 14
wkernkamp is on a distinguished road
Quote:
Originally Posted by JulioPieri View Post
I might upgrade this PC in the near future.

Do you think it's better to abandon the hybrid i9-13900K completely and buy a fresh workstation, with a more modular setup, with processors better suited for CFD, etc? Or reusing this processor and, say, just add another one (changing to a two socket motherboard) would be a good choice?

I mean, can I make this PC better for CFD or I'd be better off getting a whole new workstation?
For CFD you would probably want a dual EPYC system with all channels having a DIMM. Such a configuration is alot faster but also pricier. Maybe the board will not fit in your PC. Also, the dual CPU systems are not as good for gaming.


Have you actually had slow runtimes?
wkernkamp is offline   Reply With Quote

Old   September 4, 2024, 10:21
Default
  #10
Senior Member
 
Julio Pieri
Join Date: Sep 2017
Posts: 107
Rep Power: 9
JulioPieri is on a distinguished road
Thank you for your suggestion. Actually it's a PC dedicated to CFD, which I purchased only considering the processor's clock. At the time, I got blown away by the 5.6GHz of the 13900, and I thought i could make use of the all 24/32HT cores available - even if at a non linear scaling. But only being able to effectively use 8 cores really fell short of my expectations.

I think I don't have specially slow runtimes. The cavity3D with 1MM cells run to the simulated time of 0.015s in:

8.14s for 8 cores (HT on)
7.71s for 16 cores (HT on)
7.29s for 32 cores (HTded)

From your comments in the other post, it seems a good result for 8 core. For the other decomposition setting, the gains are marginal, maybe within tolerance. I wouldn't expected any reduction in runtime for using more than 8 cores in 13900.

But I want to further increase my processing capacity so I can take on more complex projects. Also being able to run multiple simulations at once is appealing.
JulioPieri is offline   Reply With Quote

Reply

Tags
13900, hyperthreading, mpirun

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM benchmarks on various hardware eric Hardware 820 November 20, 2024 13:37
General recommendations for CFD hardware [WIP] flotus1 Hardware 19 June 23, 2024 19:02
Workstation Suggestions For A Newbie mrtcnsmgr Hardware 1 February 22, 2023 02:13
AMD Epyc 9004 "Genoa" buyers guide for CFD flotus1 Hardware 8 January 16, 2023 06:23
Version 15 on Mac OS X gschaider OpenFOAM Installation 113 December 2, 2009 11:23


All times are GMT -4. The time now is 09:42.