|
[Sponsors] |
Abysmal performance of 64 cores opteron based workstation for CFD |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
April 5, 2018, 08:16 |
Abysmal performance of 64 cores opteron based workstation for CFD
|
#1 |
Member
Paul Palladium
Join Date: Jan 2016
Posts: 93
Rep Power: 10 |
Dear All,
Recently with have decided to build a 64 cores opteron (6380) workstation for CFD (OPenFOAM). The motherboard and the opteron are not used (but in excellent condition). We decided to limit the new components to reduce the price of the workstation (manly because it's just a test). The configuration is the following Quad socket Motherboard Supermicro H8QGi+-F 4 x AMD Opteron 6380 (@ 2.5 GHz) 16 x 16 GB RAM DDR3 1600 MHz, 1.5 V Ubuntu 16.04 LTS OpenFOAM 1712+ (compiled with source https://openfoam.com/download/install-source.php) At the beginning we did the same benchmark (structured grids) as here : https://www.pugetsystems.com/labs/hp...d-Opteron-587/ We obtained very good results, on 64 cores the run time was 197 sec and the scalability was almost perfect up to 32 cores (speed is linear with cores umber). #cores #runtime (sec) 64 197 32 301 16 606 8 1230 On a 10 cores xeon workstation E5-2687W v3 we obtain : #cores #runtime (sec) 8 1100 After seeing this results we were pretty enthusiastic. The performance of Opteron workstation is not bad. The problems came up when we decided to test the performance on unstructured grids (with simpleFoam and interFoam solvers). The scalability was horrible. With 64 cores the run is twice faster than with....8 cores. We just gain 20% between 32 cores and 64 cores. (We used scotch algorithm to minimize the communication between processor). This is so abysmal that the little 10 cores workstation run faster than the 64 cores opteron !!! We have read many posts on the forum : AMD FX 8-core or 6-core? 4 cpu motherboard for CFD HPC server (AMD) to 40 million cells - How cores do I need? Where I can buy it? mpirun, best parameters It seems that in deed there is a problem with quad socket opteron and unstructured CFD so we tried the solution proposed in this link https://www.anandtech.com/show/4486/...mark-session/5 We have enable node interleaving in the motherboard BIOS (and tried also the settings here http://hamelot.io/programming/the-im...compute-nodes/) but unfortunately we didn't observe any change on the performance. We also tried to use mpirun option like --bind-to none, or --bind-to socket etc..without success. Maybe someone on this post have an idea on how to deal and improve the performance of this workstation (a BIOS problem ? a RAM problem ? a compilation of OPenFOAM problem ?...). Of course we knew that we couldn't have a performance similar to a true 64 Xeon cluster. We assumed to obtain performance similar to a ~30/40 cores xeon workstation not a 8 cores !!! Thanks in advance for helping us !! F |
|
April 5, 2018, 09:23 |
|
#2 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49 |
I never had the chance to play with this architecture, so I might not be of much help here.
But my 2 cents are: enabling node interleaving for a NUMA-aware application (like OpenFOAM) is counterproductive. I would only enable channel and rank interleaving. Node interleaving would increase latency and decrease bandwidth compared to NUMA where each processor has direct access to it's own memory pool. I think binding processes to the correct cores is the key to success here. I.e. spreading them out as evenly as possible across all 4 nodes and making sure the processes stay on the same core. Further improvements might be possible by having processes with high mutual communication on the same node instead of having them transfer data between nodes. But achieving this in a repeatable and easy-to-use fashion is beyond my capabilities. Edit: thinking about this again: "--bind-to core" combined with "--map-by ppr:N:socket" should get you pretty far. Here N is the number of threads per socket, in your case the total number of threads divided by 4. Then again, intra-node latencies might also play a role here and I know nothing about this architecture All I know is that these "16" core CPUs actually only have 8 FPUs. This might be something you should take into account when distributing processes. How is the performance with NUMA and all threads running on only one node? Last edited by flotus1; April 5, 2018 at 12:02. |
|
April 5, 2018, 12:45 |
|
#3 | |||||
Member
Paul Palladium
Join Date: Jan 2016
Posts: 93
Rep Power: 10 |
Dear flotus1,
Thank you very much for helping us. We disable node interleaving. channel and rank interleaving were already enable. We have run some quick test on a flow around hull (4 M cells) using only 32 cores (because as you said there is only half FPU). We made a comparison on the first 10 iterations with the folowing results: #mpirun 32 cores #run time for achieving the first 10 iterations (sec) default option 193 (sec) --bind-to-core 360 (sec) --bind-to-socket 196 (sec) --bind-to-core -cpu-set 1,3,5,...,63 168.1 (sec) little improvement --bind-to-core -cpu-set 1,3,5,...,63 numactl 168.1 (sec) numactl 185 (sec) Then the average time between iterations is (in the best case) around 15 sec. The problem here is that the performance are realy bad. With 8 cores xeon ( E5-2687W v3) after the first iterations beeing very slow the average time between two iterations is about 20 sec. So 32 cores opterons are only 20% faster than 8 cores xeon. Quote:
Quote:
Here is the result of Quote:
Quote:
Quote:
F |
||||||
April 5, 2018, 17:20 |
|
#4 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49 |
Ok it is worse than I thought. Those segmented caches mess things up even more. I did not expect 8 NUMA nodes on this system.
But to get some grip on the problem, I think you should first establish a baseline: single-core performance. Run the case on one thread with bind-to core. This will help you estimating if the problem has to do with parallelization and thread mapping of with something totally different. It might also be a good idea to use this benchmark instead: OpenFOAM benchmarks on various hardware Knowing how other hardware performs greatly helps judging your own results. In the end you will probably need to distribute threads not by socket, but by L3 cache. Quote:
|
||
April 6, 2018, 04:57 |
|
#5 | ||||
Member
Paul Palladium
Join Date: Jan 2016
Posts: 93
Rep Power: 10 |
Quote:
As you advise us we have run the benchmark based on motorbike tutorial with the following results (without activating mpirun options such as binding or cpu-set): Quote:
Using the 64 cores the performance is worst than with 32 cores (I guess due to the shared FPU). We run the benchmark with numactl and --cpunodebind (If we are right the following command force the distribution of 16 thread on the node 0 et 1, each node having 8 cores) Quote:
Quote:
What are you thought about this results ? |
|||||
April 6, 2018, 10:14 |
|
#6 | |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,426
Rep Power: 49 |
I just realized that I (we) overlooked something:
Quote:
Open the case and re-seat your DIMMs according to the recommendations in chapter 2-9 in the manual http://www.supermicro.com/manuals/mo...G(6)(i)_-F.pdf From your output of numatcl --hardware it even seems like you have a total of 272GB of RAM installed. If those are all 16GB DIMMs, leave out the additional 17th DIMM. |
||
April 6, 2018, 13:36 |
|
#7 | |
Member
Paul Palladium
Join Date: Jan 2016
Posts: 93
Rep Power: 10 |
Quote:
It is clear that this could really slow down the workstation. We decided to remove the CPU n°3 and work with 3 CPU (48 cores). We don't know if it's really advisable to work with only 3 CPU on a quad socket motherboard. Here are the results of the motorbike benchmark : 3xOpteron 6380 #cores #Wall time 24 135 without mpirun option // 106 sec with --cpu-set and bind-to core 48 91 with cpu-set So the performance are better than with quad sockets but not really fantastic. We can expect maybe 75 sec with 64 cores if we suppose that the problem came from the bad DIMM slot. Maybe the results are better with 3 CPU just because we have less intercommunication between CPU. If we go back to the 64 cores configuration (with 3 DIMMS on the third CPU) we have had the following results: #cores #Wall time 4 numactl node 6 and 7 --> 540 sec 4 numactl node 4 and 5 --> 570 sec 4 numactl node 2 and 3 --> 780 sec 4 numactl node 0 and 1 --> 600 sec 4 mpirun without any option --> 474 sec So clearly there is a problem in node 2 and 3....but we thought node 2 and 3 were associated with CPU 2 and not CPU n°3 ! Something is weird here. So maybe there is a problem with the supermicro motherboard (we bought it on ebay) ! Or maybe quad socket are definitely not a good option for CFD. Last edited by Fauster; April 10, 2018 at 07:13. |
||
June 4, 2018, 09:02 |
The absymal parallel performance in quad socket server for OpenFOAM
|
#8 |
New Member
sugumar
Join Date: Jun 2018
Posts: 1
Rep Power: 0 |
Hi Paul Palladium, have you found out the solution for the problem you are currently facing. I am also facing the same issue.
|
|
June 4, 2018, 10:51 |
|
#9 | |
Member
Paul Palladium
Join Date: Jan 2016
Posts: 93
Rep Power: 10 |
Quote:
I don't know if we are facing a problem of inter-socket or RAM communication. How are your performances on the motorbike benchmark ? Paul |
||
Thread Tools | Search this Thread |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon | evcelica | Hardware | 14 | December 15, 2016 06:57 |
parallel performance on BX900 | uzawa | OpenFOAM Installation | 3 | September 5, 2011 15:52 |
amd opteron based workstation needed | ztdep | Hardware | 1 | December 5, 2010 06:23 |
fan performance based on blade shape | balaji | Main CFD Forum | 1 | September 16, 2005 11:56 |
fan performance based on blade shape | balaji | Main CFD Forum | 0 | September 15, 2005 07:24 |