CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

OpenFOAM benchmarks on various hardware

Register Blogs Community New Posts Updated Threads Search

Like Tree493Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 5, 2020, 03:54
Default Result: 2x AMD EPYC 7542 32-core/ Ubuntu 18 / ESI 2006
  #321
New Member
 
Andi
Join Date: Jun 2018
Posts: 13
Rep Power: 7
meshingpumpkins is on a distinguished road
THX to all the contributers here!

Here are my results. Compared to the other results of the Epyc 7542 the results are pretty similar as expected.

I am pretty happy with this setup.

System:

2x AMD EPYC 7542 32-core 16*32GB 3200MHz/ Ubuntu 18 / ESI 2006

Result:

PHP Code:
Cores    time    speedup    
1    784
,73    1              
4    171
,79    4,567960882    
8     89
,61    8,757169959    
12    66
,93    11,72463768    
16    43
,94    17,85912608    
20    40
,56    19,34738659    
24    35
,06    22,38248716    
28    34
,21    22,93861444    
32    30
,04    26,12283622    
36    29
,57    26,53804532    
40    27
,56    28,47351234    
44    27
,79    28,23785534    
48    24
,38    32,18744873    
52    25
,07    31,30155564    
56     25
,5    30,77372549    
60    24
,48    32,05596405    
64    23
,46    33,44970162 
aparangement likes this.

Last edited by meshingpumpkins; October 7, 2020 at 07:48. Reason: correction of data
meshingpumpkins is offline   Reply With Quote

Old   October 7, 2020, 06:46
Default
  #322
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Hi Meshingpumpkins,

I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy...

Also: As on many platforms the throughput is limited by memory bandwidth, and often max performance is reached well before all cores are utilized - would it not be interesting to somehow get power consumption into the results? Not trivial, I know, but if you get 90% of performance with 60% of used cores, this would be an interesting investigation, no?
Cheers,

Kai.
Kailee71 is offline   Reply With Quote

Old   October 7, 2020, 07:55
Default Result: 2x AMD EPYC 7542 32-core/ Ubuntu 18 / ESI 2006 addon
  #323
New Member
 
Andi
Join Date: Jun 2018
Posts: 13
Rep Power: 7
meshingpumpkins is on a distinguished road
Quote:
Originally Posted by Kailee71 View Post
Hi Meshingpumpkins,

I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy...

Also: As on many platforms the throughput is limited by memory bandwidth, and often max performance is reached well before all cores are utilized - would it not be interesting to somehow get power consumption into the results? Not trivial, I know, but if you get 90% of performance with 60% of used cores, this would be an interesting investigation, no?
Cheers,

Kai.
thx. you are absolutly right.

about the performance vs power consumption question:

this is interessting. but if you have multiple users of a server you would share the cores. in my opinion it also depends on the used case. but one could say that if you make parameter studies of your case it would be a better idea to use the half number of the cores to increase efficiency.

speedup_1_s6.jpg
meshingpumpkins is offline   Reply With Quote

Old   October 7, 2020, 09:09
Default
  #324
Senior Member
 
Joern Beilke
Join Date: Mar 2009
Location: Dresden
Posts: 500
Rep Power: 20
JBeilke is on a distinguished road
Quote:
Originally Posted by Kailee71 View Post
Hi Meshingpumpkins,

I think your last column might need correction; it needs to be the inverse... You divided the runtime by 100, but what you should do for it/s is to divide 100 by runtime... But no biggy...

Kai.

Speedup in the last line means runtine 1 core divided by runtine 64 cores. And the original table is right.
JBeilke is offline   Reply With Quote

Old   October 7, 2020, 09:21
Default
  #325
New Member
 
Andi
Join Date: Jun 2018
Posts: 13
Rep Power: 7
meshingpumpkins is on a distinguished road
Quote:
Originally Posted by JBeilke View Post
Speedup in the last line means runtine 1 core divided by runtine 64 cores. And the original table is right.
Sorry: Kai was right. i corrected the first post and deleted the last column.
meshingpumpkins is offline   Reply With Quote

Old   October 7, 2020, 11:02
Default
  #326
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
... however, the it/s was an interesting metric! Could you add it back in?

;-)
Kailee71 is offline   Reply With Quote

Old   October 14, 2020, 13:08
Default Power requirements...
  #327
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Hello all,

I recently got an HP DL380P with (only) 2x E5-2630v1, 16x 2Rx4Gb 1333 DDR3 (64Gb total), and through the excellent iLO I can also monitor power draw (see attached image). I'll do some comparisons of bare metal Ubuntu 20.04, ESXi 6.7, Win10 WSL, and lastly Freenas with an Ubuntu VM (just for kicks). All will use OF7 from .org natively installed through their Ubuntu repository, so no software optimizing at all.

To start off, here's Ubuntu 20.04:
Code:
SnappyHexMesh
Cores	Pwr(W)	Time(s)	kWh
1	147	2447	0.100
2	158	1557	0.068
4	207	906	0.052
6	223	636	0.039
8	240	522	0.035
12	275	422	0.032

Sim
Cores	Pwr(W)	Time(s)	kWh
1	162	1252	0.056
2	184	645	0.033
4	256	290	0.021
6	288	211	0.017
8	320	176	0.016
12	358	149	0.015
Interesting to me is the last column; certainly for SHM, but also the sim itself, it seems with this setup that using all cores is advisable; of course these CPUs only have 6 cores so are not being bottlenecked by the quad channel memory. However, if one were to extrapolate from these values it seems that we're close to a sweet spot power-draw wise - it's flattening off at 12 used cores, whereas I would expect another 2 or even 4 cores per CPU would reduce runtimes, if maybe only a little with 10c/cpu over the 8c/cpu.

However: I take away from this that running all 12 cores reduces cost of the benchmark by 2/3 for SHM, and nearly 3/4 for the sim, when compared to running single-core.

Results using VMs on the same hardware will follow over the next few days.

Cheers,

Kai.
Attached Images
File Type: jpg Openfoam_iLO.jpg (77.5 KB, 60 views)
Kailee71 is offline   Reply With Quote

Old   October 15, 2020, 21:48
Default
  #328
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 321
Rep Power: 12
wkernkamp is on a distinguished road
Kai,

What is your idle power?

Will
wkernkamp is offline   Reply With Quote

Old   October 16, 2020, 03:17
Default
  #329
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Hi Will,

Idle power hovers between 90 and 100 Watts.

Cheers,

Kai.
Kailee71 is offline   Reply With Quote

Old   October 16, 2020, 09:34
Default
  #330
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Ok now with ESXi 6.5, same hardware as above (DL380p, 2x E5-2630v1, 16x 2Rx4, 1333MHz), vm is identical to bare-metal setup above.

Code:
SnappyHexMesh
Cores	Pwr(W)	Time(s)	kWh
1	158	2522	0.110
2	166	1635	0.747
4	209	936	0.054
6	230	646	0.041
8	239	535	0.036
12	273	430	0.033

Sim
Cores	Pwr(W)	Time(s)	kWh
1	169	1285	0.060
2	189	670	0.035
4	257	302	0.022
6	288	217	0.017
8	317	182	0.016
12	357	154	0.015
Very interesting. After my experience with the dual X5670 machine earlier this year I wasn't hopefull but wow this is usuable, especially if all cores are used. Very pleased with this!

Kai.
Kailee71 is offline   Reply With Quote

Old   October 16, 2020, 16:46
Default
  #331
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Now with bhyve, under freenas 11.3;

Code:
SnappyHexMesh
Cores	Pwr(W)	Time(s)	kWh
1	160	2728	0.12
2	180	1719	0.086
4	207	1028	0.059
6	232	717	0.046
8	249	604	0.042
12	262	924	0.067

Sim
Cores	Pwr(W)	Time(s)	kWh
1	179	1617	0.080
2	210	756	0.044
4	245	427	0.029
6	266	339	0.025
8	285	317	0.025
12	280	556	0.043
This one threw me. Initially I had the exact same setup as the previous two sets, but got results that were way worse (like, twice the runtime). Looking at the processor usage in Freenas I noticed that indeed, with 12 threads, the machine was running at 50% capacity. So I tried turning *off* hyperthreading so only 12 thread were exposed to free. This did improve things, but only up to the 8 threads test; with 12 threads it was still much worse than bare metal or ESXi/Ubuntu.

If anyone has any information on this please let me know - it would be very interesting for me to run this under Freenas directly, rather than having to revert to running ESXi, then Freenas as one VM, and Openfoam in another.

Any help much appreciated.

Kai.
Kailee71 is offline   Reply With Quote

Old   November 16, 2020, 15:46
Default
  #332
New Member
 
M Shaaban
Join Date: Jun 2019
Posts: 9
Rep Power: 6
wildemam is on a distinguished road
For OpenFoam8 (foundation), user will have to:


1. comment the function objects (streamlines and "wallBoundedStreamLines") in the control dict.



2. change the etc director to #includeEtc "caseDicts/mesh/generation/meshQualityDict" in the meshQuality Dict



3. copy the 'surfaceFeatureDict' from the tutorial case, and change the surfacefeatureExtract application Allmesh in the base case to "runApplication surfaceFeatures" in line 9.




then it works. Let's see how my server stands out.
oswald and Fabian2602 like this.
wildemam is offline   Reply With Quote

Old   November 17, 2020, 12:16
Default
  #333
New Member
 
M Shaaban
Join Date: Jun 2019
Posts: 9
Rep Power: 6
wildemam is on a distinguished road
4 x Intel(R) Xeon(R) CPU E5-4657L v2 @ 2.40GHz

128 GB DDR3 1600 MHz
openFoam 8
Ubuntu 20.


# cores Wall time (s):
------------------------
48 77.45
44 77.66
40 77.43
36 77.34
32 77.59
28 78.45
24 79.93
16 89.9
8 133.07
4 245.4
2 652.24
1 27.39

Meshing:
48 real 4m19.655s
44 real 3m43.624s
40 real 3m54.778s
36 real 3m51.182s
32 real 3m48.851s
28 real 3m54.084s
24 real 4m19.289s
16 real 5m46.104s
8 real 7m19.078s
4 real 12m8.124s
2 real 23m45.691s
1 real 0m3.501s


Hitting some ceiling there. I verified that I have 32GB per NUMA nodes. Any ideas for checking the reason for the bottleneck beyond 24 cores?
wildemam is offline   Reply With Quote

Old   November 17, 2020, 13:43
Default
  #334
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
How is the memory populated? 16*16GB?
# dmidecode -t 17
In case you need to find out.
Htop provides a quick and easy way to check which cores are utilized.
flotus1 is offline   Reply With Quote

Old   November 18, 2020, 08:09
Default
  #335
New Member
 
M Shaaban
Join Date: Jun 2019
Posts: 9
Rep Power: 6
wildemam is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
How is the memory populated? 16*16GB?
# dmidecode -t 17
In case you need to find out.
Htop provides a quick and easy way to check which cores are utilized.

Thanks for your reply Flotus1.

There are 8 x 16 GB x 1600 MHz.

attached at banks:

0
1
12
13
24
25
36
37

I guess I will need to get more rams.
wildemam is offline   Reply With Quote

Old   November 18, 2020, 08:12
Default
  #336
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Yeah, my math didn't check out. I meant 16x8GB.
Anyway, you would need 16 identical DIMMs to get peak performance with this system. The scaling behavior you got is pretty typical for not having all memory channels populated.
flotus1 is offline   Reply With Quote

Old   November 19, 2020, 15:04
Default
  #337
New Member
 
Roman G.
Join Date: Apr 2017
Posts: 16
Rep Power: 9
Novel is on a distinguished road
We just bought a new Workstation for our department. Thanks to this Thread we were able to find a good configuration.



The following setup was done:
OpenFOAM was compiled with the tag "-march=znver1". Also SMT was switched off and all processors were set to performance mode using "cpupower frequency-set -g performance" from the HPC Tuning Guide provided by AMD ( http://developer.amd.com/wp-content/resources/56420.pdf).



CPU:

2x AMD EPYC 7532 (Zen2-Rome) 32-Core CPU, 200W, 2.4GHz, 256MB L3 Cache, DDR4-3200
RAM:
256GB (16x 16GB) DDR4-3200 DIMM, REG, ECC, 2R


OpenFOAM v7


cores time (s) speedup
1 677,34 1,00
2 363,04 1,87
4 161,42 4,20
6 101,82 6,65
8 77,16 8,78
12 52,28 12,96
16 39,4 17,19
20 32,01 21,16
24 27,31 24,80
28 24,15 28,05
32 21,53 31,46
36 21,32 31,77
40 20,46 33,11
44 18,99 35,67
48 18,12 37,38
52 17,45 38,82
56 17,06 39,70
60 16,5 41,05
64 15,91 42,57






Until 32 cores the scalling is perfect, afterwards it starts to drop... Is it just caused by the bandwith or can there be other things causing this drop?
linuxguy123 likes this.
Novel is offline   Reply With Quote

Old   November 19, 2020, 16:12
Default
  #338
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Any particular reason for the use of znver1 instead of znver2?
Bandwidth will be part of the reason why scaling tapers off. Lower CPU frequency with more busy cores might be another contribution.
But overall, performance looks pretty impressive.
flotus1 is offline   Reply With Quote

Old   November 19, 2020, 20:29
Default
  #339
New Member
 
M Shaaban
Join Date: Jun 2019
Posts: 9
Rep Power: 6
wildemam is on a distinguished road
Quote:
Originally Posted by wildemam View Post
4 x Intel(R) Xeon(R) CPU E5-4657L v2 @ 2.40GHz

128 GB DDR3 1600 MHz
openFoam 8
Ubuntu 20.


# cores Wall time (s):
------------------------
48 77.45
44 77.66
40 77.43
36 77.34
32 77.59
28 78.45
24 79.93
16 89.9
8 133.07
4 245.4
2 652.24
1 27.39

Meshing:
48 real 4m19.655s
44 real 3m43.624s
40 real 3m54.778s
36 real 3m51.182s
32 real 3m48.851s
28 real 3m54.084s
24 real 4m19.289s
16 real 5m46.104s
8 real 7m19.078s
4 real 12m8.124s
2 real 23m45.691s
1 real 0m3.501s


Hitting some ceiling there. I verified that I have 32GB per NUMA nodes. Any ideas for checking the reason for the bottleneck beyond 24 cores?
Just populated the system with 8 more 16GB DDR3 1600MHz ram.

# cores Wall time (s):
------------------------
48 45.04
44 45.62
40 46.08
36 47.52
32 49
28 52.01
24 56.36
16 73.13
8 127.29
4 239.67
2 602.69

So the added ram made it faster and more scalable. Results are similar to other Xeon processors.


Any recommendations or hints for best practices if I run several aimulations on the same machine?
wildemam is offline   Reply With Quote

Old   November 20, 2020, 02:14
Default
  #340
New Member
 
Roman G.
Join Date: Apr 2017
Posts: 16
Rep Power: 9
Novel is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Any particular reason for the use of znver1 instead of znver2?
Bandwidth will be part of the reason why scaling tapers off. Lower CPU frequency with more busy cores might be another contribution.
But overall, performance looks pretty impressive.

Ups sorry, actually we did compile it using znver2.
Novel is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology wyldckat OpenFOAM 17 November 10, 2017 15:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 19:20
OpenFOAM Training Beijing 22-26 Aug 2016 cfd.direct OpenFOAM Announcements from Other Sources 0 May 3, 2016 04:57
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 06:55
Hardware for OpenFOAM LES LijieNPIC Hardware 0 November 8, 2010 09:54


All times are GMT -4. The time now is 20:10.