CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > FloEFD, FloWorks & FloTHERM

Garbage FloTHERM performance from 32-core system

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By zwilhoit

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   December 19, 2018, 14:29
Default Garbage FloTHERM performance from 32-core system
  #1
New Member
 
Join Date: Nov 2018
Location: USA
Posts: 9
Rep Power: 7
zwilhoit is on a distinguished road
2x Epyc 7301, 16x16GB DDR4-2666

I've got 32 cores at 2.7 GHz and 16 channels of memory bandwidth. Synthetic memory benchmarks show total read/write bandwidth in excess of 310 GB/s. System has 8 NUMA nodes. Windows Server 2016.

Mentor tech support indicates they've done internal testing showing good scaling on up to 32 cores.

SMT is off and I've set memory interleaving on at the channel level. Haven't tried die or socket interleaving yet but it seems to me like those would be worse options.

I see two problems:

First, while solving and watching CPU utilization, there are long stretches of time where FloTHERM is only using a single thread when initializing or finalizing a radiation factor calculation or the main thermal/airflow solver. Is that expected behavior? For a lot of models, those periods take longer than the parallel solvers do!

But the much larger problem is that the parallel solvers are painfully slow too on the Epyc system. Parallel solve time is equivalent or slower than generic 4-6 core, 2 memory channel desktops and laptops that we have.

Compared to a 12 core, 8 memory channel Haswell system we have, total solve times for our benchmark model are 2.5 times SLOWER although we have 1.9x the parallel compute resources and 2.3x the memory bandwidth. Single-threaded performance is about 75% of the Haswell system.

In terms of CPU time, the Haswell system spends 1845 minutes of CPU time in the parallel solver while the Epyc system spends 15007 minutes of CPU time. So on a per thread basis, the Haswell system is running 8 times faster in the parallel solver than the Epyc system. Something is egregiously wrong.... Any ideas?
akidess likes this.
zwilhoit is offline   Reply With Quote

Old   December 21, 2018, 18:07
Default
  #2
New Member
 
Join Date: Nov 2018
Location: USA
Posts: 9
Rep Power: 7
zwilhoit is on a distinguished road
Enabling “set KMP_AFFINITY=noverbose,nowarnings,compact” in flotherm.bin gives about a 40% improvement but we're still far from where we should be.

I strongly suspect this is because our Epyc system has 8 NUMA nodes.

On our 2x E5-2643 v3 server, going from SMP to NUMA modes results in a ~15% performance penalty.

Compared to our 2x E5-2643 v3 machine, we're now about 1.5x slower in FloTHERM. But OpenFOAM benchmarks done by others on this site on their own similar systems are about 2x faster, right in line with the hardware advantage.

OpenFOAM benchmarks on various hardware
zwilhoit is offline   Reply With Quote

Old   January 7, 2019, 23:28
Smile FloTHERM performance - which version
  #3
New Member
 
DB
Join Date: Jan 2019
Posts: 28
Rep Power: 7
DB99 is on a distinguished road
What revision of FloTHERM are you using and what type Classic or XT? The mesher and geometry transfer functions seem to have sped up significantly with newer versions. I also noticed some time ago that the mesher was only using 1 core, hence showing less than 10% of total utilization. Seems like it has improved lately. I haven't experienced this with the solver.
what is the size of your model? Number of cells.
DB99 is offline   Reply With Quote

Old   January 8, 2019, 15:31
Default
  #4
New Member
 
Join Date: Nov 2018
Location: USA
Posts: 9
Rep Power: 7
zwilhoit is on a distinguished road
FloTHERM classic, 12.2, 1.6M cells.

But we see the same behavior on larger models too.

Feeding the KMP_affinity function the "verbose" command yields a cmd window when the solver is launched. This window shows that the OpenMP library isn't getting our system topology correct.

Mentor support has referred me to their R&D department for further investigation.

For some light reading on optimizing OpenMP library calls for the Epyc architecture.... http://www.prace-ri.eu/IMG/pdf/Best-...-Guide-AMD.pdf
zwilhoit is offline   Reply With Quote

Old   January 10, 2019, 15:09
Default 1.6M cells, 1845 minutes
  #5
New Member
 
DB
Join Date: Jan 2019
Posts: 28
Rep Power: 7
DB99 is on a distinguished road
Wow! something is seriously wrong. I've solved 24M cells for 2U24 server model in that amount of time, actually 18 hours in classic V9. This is on a 4 core Dell 16Gb memory from year 2010. On a Compaq Presario 10 year old laptop with 4Gb memory a 1.6M cell model would solve in about 30 min or hour. Laptop has an AMD processor which seems to run faster than Intel's on some apps.
I've noticed that some virus protection programs slow down the program significantly to 10% or so (Win10 system). I think the virus program is checking out all the temp files of the mesher or solver and interferring. Try turning virus protection off during mesh or solve.
DB
DB99 is offline   Reply With Quote

Old   February 11, 2020, 13:56
Default
  #6
New Member
 
Join Date: Nov 2018
Location: USA
Posts: 9
Rep Power: 7
zwilhoit is on a distinguished road
EPYC 7002 (ROME) processors have moved to a single NUMA node per socket.

Has anybody ran FloTHERM on either EPYC 7002 or Threadripper 39X0X series processors? I'm trying to gauge if upgrading to the new series of processors, with a simpler NUMA topology, would fix my performance issues.
zwilhoit is offline   Reply With Quote

Old   June 19, 2020, 10:12
Default
  #7
New Member
 
Felix
Join Date: Jun 2020
Posts: 1
Rep Power: 0
ffl3883 is on a distinguished road
Quote:
Originally Posted by zwilhoit View Post
EPYC 7002 (ROME) processors have moved to a single NUMA node per socket.

Has anybody ran FloTHERM on either EPYC 7002 or Threadripper 39X0X series processors? I'm trying to gauge if upgrading to the new series of processors, with a simpler NUMA topology, would fix my performance issues.
I have Dual EPYC 7551 32-Core , 256GB DIMM in my system. Have you found any way to improve performance on Flotherm ?
ffl3883 is offline   Reply With Quote

Old   June 21, 2020, 11:05
Default
  #8
New Member
 
Join Date: Nov 2018
Location: USA
Posts: 9
Rep Power: 7
zwilhoit is on a distinguished road
Quote:
Originally Posted by ffl3883 View Post
I have Dual EPYC 7551 32-Core , 256GB DIMM in my system. Have you found any way to improve performance on Flotherm ?
Enabling “set KMP_AFFINITY=noverbose,nowarnings,compact” was the best we ever found. Considering 2nd gen EPYC moved to 1 NUMA node per socket, I doubt this will be fixed anytime soon. My ticket with Mentor R&D got closed. Maybe if Mentor got enough support tickets from enough different companies, they'd fix it.

I considered trying to upgrade the system to 2nd gen EPYC, but get this, on the early version of the Supermicro motherboard I have, the motherboard BIOS ROM is too small (32 instead of 64 mb IIRC) to be able to drop in the 2nd gen processors. Supermicro had to do a dot rev on the motherboard to upgrade the ROM when 2nd gen EPYC came out... So, I can't upgrade processors without ripping up the mobo and starting over.

Oh well, we've found lots of other uses for the server (Ansys, Keyshot, etc.).

I really think 2x 7352 would be killer in FloTHERM for the price.
zwilhoit is offline   Reply With Quote

Old   January 29, 2021, 11:17
Default
  #9
New Member
 
Join Date: Nov 2018
Location: USA
Posts: 9
Rep Power: 7
zwilhoit is on a distinguished road
Bump, has anybody tested Flotherm performance with Zen 2 Threadripper or Epyc?

i.e. Threadripper 3xxxX, Epyc 7xx2

Looks like Zen 3 is just around the corner, I'm assuming also with one NUMA node per socket, which I hypothesize would solve the issue. Not to mention massively improved single-thread performance (radiation, starting the solver...).
zwilhoit is offline   Reply With Quote

Old   January 29, 2021, 16:22
Default
  #10
New Member
 
DB
Join Date: Jan 2019
Posts: 28
Rep Power: 7
DB99 is on a distinguished road
A lot of the discussion is over my head but I've seen that the new OS's with real time disk encryption slow things down a lot (bit locker, Win10). Might see if an exception can be applied.
In FloEFD you can set the radiation solver to only run every x iterations. Say every 5th instead of everyone, recommended by Mentor support. There isn't a setting in the Flotherm app for this but you guys knowing DOS commands might be able to figure it out.
DB99 is offline   Reply With Quote

Old   January 29, 2021, 16:24
Default
  #11
New Member
 
DB
Join Date: Jan 2019
Posts: 28
Rep Power: 7
DB99 is on a distinguished road
The easiest thing to get FloTHERM to run faster is to get a 3.# GHz CPU speed. 3.3/2.7 is 22% faster.
DB99 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Flotherm vs Flotherm XT vs FloEFD zdunol FloEFD, FloWorks & FloTHERM 39 July 14, 2020 09:36
Modeling component of system level in Flotherm SOHYEON MUN FloEFD, FloWorks & FloTHERM 1 April 30, 2014 04:55
Serial Job Jumping from Core to Core Will FLUENT 2 August 25, 2008 15:21
Need ideas-fuel discharge system Jan Main CFD Forum 0 October 9, 2006 05:27
Flotherm-Icepak comparison for electronic cooling Masoud Ameli FLUENT 0 February 28, 2005 10:00


All times are GMT -4. The time now is 00:50.