|
[Sponsors] |
![]() |
![]() |
#681 |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
processor: 2x E5-2697A 16 cores x2
memory: 16x 8GB Rank 2 DDR4 2400 MT/s Motherboard: GIGABYTE MD90-FS0-ZB Meshing Times: 1 1239.75 2 815.29 4 458.74 8 279.07 12 215.79 16 193.16 20 161.75 24 149.25 28 152.74 32 140.33 Flow Calculation: 1 924.05 2 483.68 4 214.54 8 113.42 12 85.05 16 71.71 20 66.04 24 61.86 28 59.33 32 58.83 Last edited by wkernkamp; March 29, 2023 at 19:59. |
|
![]() |
![]() |
![]() |
![]() |
#682 |
New Member
Yannick
Join Date: May 2018
Posts: 16
Rep Power: 8 ![]() |
2x Epyc 9474F (48 cores) with 24x16GB DDR5-4800MHz
Ubuntu 20.04, OF2206 haven't had time to optimize anything but will look into that later (was on demand power setting). Code:
Meshing Times: 1 801.26 2 533.63 4 290.45 8 160.53 12 118.46 24 81.25 36 72.74 48 67.7 64 63.89 96 65.29 Flow Calculation: 1 534.56 2 288.29 4 115.69 8 54.35 12 37.13 24 20.23 36 14.48 48 11.61 64 9.77 96 8.04 -> (achieved 7.86 with "performance" power setting) |
|
![]() |
![]() |
![]() |
![]() |
#683 | |
New Member
Eduardo
Join Date: Feb 2019
Posts: 9
Rep Power: 8 ![]() |
Yannic, I feel we are living parallel lifes. Last post I uploaded was also comparing with your setup and I am about to get a brand new Epyc Gen4 (only exception is 2x32 cores instead of 48), so we will see how it does
Good to see these results from your side! Regards Quote:
|
||
![]() |
![]() |
![]() |
![]() |
#684 |
Senior Member
|
Questions:
1/ can somebody confirm that bulk of CPU time goes into pressure-solve? 2/ is more fine-grained profiling data (computation and memory management) available? 3/ how is the pressure field solved? Thx! |
|
![]() |
![]() |
![]() |
![]() |
#685 |
New Member
Joost
Join Date: Mar 2023
Posts: 3
Rep Power: 4 ![]() |
I recently bought a bunch of cheap ASUS-RS724Q-E7-RS12-2U-Quad-Node servers for some HPC funs. They came equipped with dual E5-2670's per node so I added 4x16x4GB at $0.4/GB making a total of ~$200 a box or $50 node. Together with a 8-port infiniband switch and DAC's total cost of the 128-core 800GB/s cluster will be 650 euro, lol. Good OpenFoam value I would say.
To get things going I installed rocky 8, warewulf and OFV10 on the head-node and booted the rest as disk-less computes over ethernet. Still waiting for the infiniband stuff, so preliminary 1Gb ethernet benchmark results are presented below. As expected scaling doped of a cliff after 2+ nodes barrier for GAMG solver on eth, so I switched to less chatty PCG for the multi-node results. I'm actually quite impressed by the ethernet results. Even with such a low cell count per core the cluster seems to scale up to 5-nodes on plain ethernet. Power-draw peaks at ~312.5W/node at full load which makes a good office heater if it weren't as noisy as hell. Planning for immersion cooling once everything is up and running. Will update the results after arrival of the 40GB Infiband switch and DAC's in week or so and maybe push the ram to 1600. Fun stuff. Stream_mpi memory bandwidth (8nodes, 128cores) results: ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 366039.5 0.007179 0.006994 0.007267 Scale: 358604.2 0.007406 0.007139 0.008823 Add: 409109.5 0.009584 0.009386 0.009711 Triad: 422958.2 0.009453 0.009079 0.009687 ------------------------------------------------------------- => 3.3GB/s per core 2M Cell motorbike benchmark results on 1GB/s ethernet # cores Wall time (s): ------------------------ 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 Meshing Times: 2 1212.52 4 675.161 8 436.056 16 315.164 24 284.37 32 352.878 40 284.02 48 294.293 56 295.871 64 244.25 72 316.725 80 302.017 88 327.945 96 314.297 104 330.904 112 330.916 120 333.152 128 348.665 Flow Calculation: 2 492.178 4 238.718 8 156.462 16 133.48 24 100.045 32 79.2915 40 77.8058 48 76.6217 56 66.3537 64 40.287 72 55.444 80 36.4447 88 45.045 96 45.448 104 44.2892 112 46.0728 120 40.8395 128 37.8275 |
|
![]() |
![]() |
![]() |
![]() |
#686 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
Quote:
Check the correct completion of some of these runs. Some values don't make sense. For example, the 64 core result is suspect too. I have an HP server with 4x E54657Lv2 and DDR3-1866 memory. The 48 cores complete the benchmark in 40 seconds. Your cluster does 48 in 76 seconds. This means that you have a lot of room for improvement. Not just CPUs, Memory and Networking. Also look at the mpi command: If your system were running the problem with six cores per cpu, it should outperform my server, because six cores on a four channel cpu should see less memory throttling than my 12 cores per cpu even when the memory is 1333. Also check the proper functioning of all your memory sticks using "sudo dmidecode -t 17": If one node has a defective uneven setup, this node will cause each iteration to be slowed, because other nodes will have to wait for it to complete. I think you should be able to realize dual EPYC level performance with you bargain! My estimate is that with 12 core v2 cpus and DDR3-1866 memory, your cluster should complete the benchmark in 10 seconds on 192 cores. |
||
![]() |
![]() |
![]() |
![]() |
#687 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
Quote:
1/ At high core count per channel, the solution is bottle necked in memory bandwidth. For example, my E5-2697Av4 is just 2% faster than my E5-2683v4 in the same machine. These cpus are the same, except for operating frequency having a more than 10% difference. 2/ I dont have more fine grained profiling data. 3/ OpenFOAM uses row by row solves. So each velocity component has a separate solve visiting all cells of the local grid. The loop in simpleFoam.c is: Code:
while (simple.loop()) { Info<< "Time = " << runTime.timeName() << nl << endl; // --- Pressure-velocity SIMPLE corrector { #include "UEqn.H" #include "pEqn.H" } laminarTransport.correct(); turbulence->correct(); runTime.write(); runTime.printExecutionTime(Info); } At high core count per channel I think there are four loops through cells with each taking a comparable amount of time due to the need to wait for memory access. I don't quite remember, but it may be that UEqn code loops three times, separate for each of the velocity components. |
||
![]() |
![]() |
![]() |
![]() |
#688 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
Quote:
1/ At high core count per channel, the solution is bottle necked in memory bandwidth. For example, my E5-2697Av4 is just 2% faster than my E5-2683v4 in the same machine. These cpus are the same, except for operating frequency having a more than 10% difference. 2/ I dont have more fine grained profiling data. 3/ OpenFOAM uses row by row solves. So each velocity component has a separate solve visiting all cells of the local grid. The loop in simpleFoam.c is: Code:
while (simple.loop()) { Info<< "Time = " << runTime.timeName() << nl << endl; // --- Pressure-velocity SIMPLE corrector { #include "UEqn.H" #include "pEqn.H" } laminarTransport.correct(); turbulence->correct(); runTime.write(); runTime.printExecutionTime(Info); } At high core count per channel I think there are four loops through cells with each taking a comparable amount of time due to the need to wait for memory access. I don't quite remember, but it may be that UEqn code loops three times, separate for each of the velocity components. In that case there would be six loops, not four. |
||
![]() |
![]() |
![]() |
![]() |
#689 |
Senior Member
|
@wkernkamp: thank you for your reply.
The motorBike tutorial comes by default with GAMG as solver for the pressure field. Could someone here please try GAMG as a preconditioner instead using Code:
p { solver PCG; preconditioner { preconditioner GAMG; smoother GaussSeidel; nPreSweeps 1; nPostSweeps 1; directSolveCoarsest yes; } tolerance 0; relTol 1e-6; maxIter 50; } Many thanks. Domenico. |
|
![]() |
![]() |
![]() |
![]() |
#690 | ||
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
Quote:
Domenico, In this thread, we compare hardware against the same case. This case should not be changed by changing the input, otherwise, the comparisons don't work. I suggest you start a thread for your interest in a OpenFOAM developer thread. For these comparisons, you could use the benchmark set-up, because it already tracks times against the number of cores. You could do the runs yourself with the benchmark case provided here: Quote:
Will |
|||
![]() |
![]() |
![]() |
![]() |
#691 |
Senior Member
|
Dear Will,
Many thanks again. I am inclined to believe that along with the results on the test case as agreed, one could publish results on modified versions of the test case that possibly run faster. On a variant of the Pitz-Dally test case, switching from GAMG as solver to GAMG as a preconditioner for the pressure on the incompressible flow case resulted on a reduction by four in CG iterations and a reduction of 40% of CPU time. Results as you intend them yield the desired standardisation. Variants as I'm suggesting might show the versatility of OpenFoam. Cheers, Domenico. |
|
![]() |
![]() |
![]() |
![]() |
#692 |
Senior Member
René Thibault
Join Date: Dec 2019
Location: Canada
Posts: 114
Rep Power: 7 ![]() |
After a couple of weeks, here is my new configuration.
Lenovo ThinkStation P710, 2x E5-2698 v4 (40 cores, 12 * GB DDR4[384GB max.]) with 256GB, Ubuntu 22.04 LTS - OpenFOAM 2212v. Special thanks to wkernkamp for your help and tips. Code:
# cores Wall time (s): ------------------------ 1 2 4 8 12 16 20 24 28 32 Meshing Times: 1 1301.56 2 866.03 4 504.52 8 301 12 233.17 16 217.97 20 187.11 24 174.52 28 178.07 32 163.61 Flow Calculation: 1 1090.96 2 553.8 4 239.95 8 129.58 12 96.83 16 82.93 20 76.96 24 74.33 28 71.48 32 70.65 |
|
![]() |
![]() |
![]() |
![]() |
#693 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
Quote:
I presume you now have eight RDIMMs of DDR4-2133, one DIMM per channel. So you can still improve by going to DDR4-2400, when you get your inheritance. Otherwise, you now have one of the top performers for Dual Hasswell/Broadwell, socket 2100-3.Congratulations! |
||
![]() |
![]() |
![]() |
![]() |
#694 | |
Senior Member
René Thibault
Join Date: Dec 2019
Location: Canada
Posts: 114
Rep Power: 7 ![]() |
Quote:
And yes, I have eight DDR4-2133, but rank 2 instead of rank 4 like the other one has. I've noticed that those cpus have a processor base frequency of 2.2GHz but can ramp up to 3.6GHz. So, if I understand correctly, I could overclock it to 2.4GHz? Thanks by the way! I would not be able to do that without your support! Best Regards, Last edited by Tibo99; April 18, 2023 at 09:25. |
||
![]() |
![]() |
![]() |
![]() |
#695 | |
Member
Alejandro Valeije
Join Date: Nov 2014
Location: Spain
Posts: 52
Rep Power: 12 ![]() |
Hi Yannick,
Could it be possible that you (or anybody, it is an open request) run a benchmark with a heavier case? IE motorbike with 30 or 60 million elements? I am trying to set up a server but I need to convince the "bossman" that speedups are good compared with out current setup but for high element cases. I have looked in openbenchmarking but there is no heavy cases solved with this new processors. Thanks to anybody that can help. Regards Quote:
|
||
![]() |
![]() |
![]() |
![]() |
#696 |
New Member
Yannick
Join Date: May 2018
Posts: 16
Rep Power: 8 ![]() |
Hi Alejandro,
hmm, I haven't been working with OF for quite a while, so I don't really know if there is a suitable benchmark (one where I don't have to fiddle around too much to make it work). If there is one, I am happy to run it and compare it with 2x7742 epycs. I don't know if that helps you at all, but I ran >10 mio cell benchmarks in Star-CCM+ and had 2x the performance of the 2x7742 as well. I assume this will also be the case for higher cell counts. |
|
![]() |
![]() |
![]() |
![]() |
#697 | |
Senior Member
Will Kernkamp
Join Date: Jun 2014
Posts: 381
Rep Power: 14 ![]() |
Quote:
The Broadwell CPUs cannot be overclocked, |
||
![]() |
![]() |
![]() |
![]() |
#698 | |
Member
Alejandro Valeije
Join Date: Nov 2014
Location: Spain
Posts: 52
Rep Power: 12 ![]() |
Hi again,
I would be very interested in the tutorial "drivaerFastback" included in OpenFOAM 10, with the large mesh. To run it, the only command line you need is .\Allrun -c xx -m L Where "xx" is the number of cores. If you can't run this version, I could adapt the tutorial to the version you have, or maybe looking for another one. Since I am more interested in raw power than in the scaling, I won't need that you run it in several configurations. If you can't, I think its amazing that you had 2x performance compared with the 7742 although I am not familiar with StarCCM. These CPUs are fire. Best regards, Quote:
|
||
![]() |
![]() |
![]() |
![]() |
#699 | |
New Member
Yannick
Join Date: May 2018
Posts: 16
Rep Power: 8 ![]() |
Quote:
Unfortunately, the only version I have installed is v2206. Is there a good test case which is available in OF10 and v2206? Or maybe I find a way to download that tutorial somewhere.. |
||
![]() |
![]() |
![]() |
![]() |
#700 | |
Senior Member
René Thibault
Join Date: Dec 2019
Location: Canada
Posts: 114
Rep Power: 7 ![]() |
Quote:
So, changing this setting is probably the last thing I can do to push the enveloppe without affecting too much the hardware right? Regards, |
||
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology | wyldckat | OpenFOAM | 17 | November 10, 2017 15:54 |
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days | joegi.geo | OpenFOAM Announcements from Other Sources | 0 | October 1, 2016 19:20 |
OpenFOAM Training Beijing 22-26 Aug 2016 | cfd.direct | OpenFOAM Announcements from Other Sources | 0 | May 3, 2016 04:57 |
New OpenFOAM Forum Structure | jola | OpenFOAM | 2 | October 19, 2011 06:55 |
Hardware for OpenFOAM LES | LijieNPIC | Hardware | 0 | November 8, 2010 09:54 |