Abysmal performance of 64 cores opteron based workstation for CFD

Fauster · April 5, 2018, 08:16

Dear All,

Recently with have decided to build a 64 cores opteron (6380) workstation for CFD (OPenFOAM). The motherboard and the opteron are not used (but in excellent condition). We decided to limit the new components to reduce the price of the workstation (manly because it's just a test).

The configuration is the following
Quad socket Motherboard Supermicro H8QGi+-F
4 x AMD Opteron 6380 (@ 2.5 GHz)
16 x 16 GB RAM DDR3 1600 MHz, 1.5 V
Ubuntu 16.04 LTS
OpenFOAM 1712+ (compiled with source https://openfoam.com/download/install-source.php)

At the beginning we did the same benchmark (structured grids) as here : https://www.pugetsystems.com/labs/hp...d-Opteron-587/
We obtained very good results, on 64 cores the run time was 197 sec and the scalability was almost perfect up to 32 cores (speed is linear with cores umber).
#cores #runtime (sec)
64 197
32 301
16 606
8 1230
On a 10 cores xeon workstation E5-2687W v3 we obtain :
#cores #runtime (sec)
8 1100

After seeing this results we were pretty enthusiastic. The performance of Opteron workstation is not bad.

The problems came up when we decided to test the performance on unstructured grids (with simpleFoam and interFoam solvers). The scalability was horrible. With 64 cores the run is twice faster than with....8 cores. We just gain 20% between 32 cores and 64 cores. (We used scotch algorithm to minimize the communication between processor). This is so abysmal that the little 10 cores workstation run faster than the 64 cores opteron !!!

We have read many posts on the forum :
AMD FX 8-core or 6-core?
4 cpu motherboard for CFD
HPC server (AMD) to 40 million cells - How cores do I need? Where I can buy it?
mpirun, best parameters

It seems that in deed there is a problem with quad socket opteron and unstructured CFD so we tried the solution proposed in this link
https://www.anandtech.com/show/4486/...mark-session/5

We have enable node interleaving in the motherboard BIOS (and tried also the settings here http://hamelot.io/programming/the-im...compute-nodes/) but unfortunately we didn't observe any change on the performance.

We also tried to use mpirun option like --bind-to none, or --bind-to socket etc..without success.

Maybe someone on this post have an idea on how to deal and improve the performance of this workstation (a BIOS problem ? a RAM problem ? a compilation of OPenFOAM problem ?...). Of course we knew that we couldn't have a performance similar to a true 64 Xeon cluster. We assumed to obtain performance similar to a ~30/40 cores xeon workstation not a 8 cores !!!

Thanks in advance for helping us !!

F

flotus1 · April 5, 2018, 09:23

I never had the chance to play with this architecture, so I might not be of much help here.
But my 2 cents are: enabling node interleaving for a NUMA-aware application (like OpenFOAM) is counterproductive. I would only enable channel and rank interleaving. Node interleaving would increase latency and decrease bandwidth compared to NUMA where each processor has direct access to it's own memory pool.
I think binding processes to the correct cores is the key to success here. I.e. spreading them out as evenly as possible across all 4 nodes and making sure the processes stay on the same core. Further improvements might be possible by having processes with high mutual communication on the same node instead of having them transfer data between nodes. But achieving this in a repeatable and easy-to-use fashion is beyond my capabilities.
Edit: thinking about this again: "--bind-to core" combined with "--map-by ppr:N:socket" should get you pretty far. Here N is the number of threads per socket, in your case the total number of threads divided by 4.

Then again, intra-node latencies might also play a role here and I know nothing about this architecture

All I know is that these "16" core CPUs actually only have 8 FPUs. This might be something you should take into account when distributing processes.

How is the performance with NUMA and all threads running on only one node?

Fauster · April 5, 2018, 12:45

Dear flotus1,

Thank you very much for helping us. We disable node interleaving. channel and rank interleaving were already enable.

We have run some quick test on a flow around hull (4 M cells) using only 32 cores (because as you said there is only half FPU).

We made a comparison on the first 10 iterations with the folowing results:
#mpirun 32 cores #run time for achieving the first 10 iterations (sec)
default option 193 (sec)
--bind-to-core 360 (sec)
--bind-to-socket 196 (sec)
--bind-to-core -cpu-set 1,3,5,...,63 168.1 (sec) little improvement
--bind-to-core -cpu-set 1,3,5,...,63 numactl 168.1 (sec)
numactl 185 (sec)

Then the average time between iterations is (in the best case) around 15 sec.

The problem here is that the performance are realy bad. With 8 cores xeon ( E5-2687W v3) after the first iterations beeing very slow the average time between two iterations is about 20 sec. So 32 cores opterons are only 20% faster than 8 cores xeon.

Quote:

How is the performance with NUMA and all threads running on only one node?

I am not sure to understand your question. Do you mean we should try to run our case by using

Quote:

mpirun -np 32 numactl --physcpubind=0 interFoam -parallel

?

Here is the result of

Quote:

numatcl --hardware

Quote:

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32188 MB
node 0 free: 30857 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32254 MB
node 1 free: 30921 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 16126 MB
node 2 free: 14914 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32254 MB
node 3 free: 30829 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 48382 MB
node 4 free: 46804 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32254 MB
node 5 free: 30422 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 48382 MB
node 6 free: 46504 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 32232 MB
node 7 free: 30904 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 22 16 22 16 22 16
2: 16 22 10 16 16 22 16 22
3: 22 16 16 10 22 16 22 16
4: 16 22 16 22 10 16 16 22
5: 22 16 22 16 16 10 22 16
6: 16 22 16 22 16 22 10 16
7: 22 16 22 16 22 16 16 10

You will find attached the result of

Quote:

hwloc-ls

Thanks you so much for helping us

!

F

flotus1 · April 5, 2018, 17:20

Ok it is worse than I thought. Those segmented caches mess things up even more. I did not expect 8 NUMA nodes on this system.

But to get some grip on the problem, I think you should first establish a baseline: single-core performance. Run the case on one thread with bind-to core. This will help you estimating if the problem has to do with parallelization and thread mapping of with something totally different.
It might also be a good idea to use this benchmark instead: OpenFOAM benchmarks on various hardware
Knowing how other hardware performs greatly helps judging your own results.

In the end you will probably need to distribute threads not by socket, but by L3 cache.

Quote:

I am not sure to understand your question. Do you mean we should try to run our case by using

I meant binding all threads to only one CPU. But without oversubscribing, so maximum 16 threads. This could shed some light on the question whether you are dealing with inter-socket communication issues or not.

Fauster · April 6, 2018, 04:57

Quote:

Originally Posted by flotus1

Ok it is worse than I thought. Those segmented caches mess things up even more. I did not expect 8 NUMA nodes on this system.

But to get some grip on the problem, I think you should first establish a baseline: single-core performance. Run the case on one thread with bind-to core. This will help you estimating if the problem has to do with parallelization and thread mapping of with something totally different.
It might also be a good idea to use this benchmark instead: OpenFOAM benchmarks on various hardware
Knowing how other hardware performs greatly helps judging your own results.

In the end you will probably need to distribute threads not by socket, but by L3 cache.

I meant binding all threads to only one CPU. But without oversubscribing, so maximum 16 threads. This could shed some light on the question whether you are dealing with inter-socket communication issues or not.

Dear flotus1,
As you advise us we have run the benchmark based on motorbike tutorial with the following results (without activating mpirun options such as binding or cpu-set):

Quote:

#cores #Wall time (sec)
4 474
8 237
16 147
32 112.9
64 119.3

As you can see the scalability is nice between 4 and 8 core. Between 8 cores and 16 cores the speedup is about 1.6. Between 16 cores and 32 cores the speedup is about 1.3. Thus, we have a speed up of 2 between 8 cores and 32 cores ! It seems we are clearly dealing with node communication ?

Using the 64 cores the performance is worst than with 32 cores (I guess due to the shared FPU).
We run the benchmark with numactl and --cpunodebind (If we are right the following command force the distribution of 16 thread on the node 0 et 1, each node having 8 cores)

Quote:

mpirun -np 16 numactl --cpunodebind=0,1 simpleFoam -parallel

We obtain the following results

Quote:

#cores #Wall time (sec)
16 323

The speed is twice lower than with 16 cores distributed on the 8 NUMA nodes. Again this has to be linked with the shared FPU.

What are you thought about this results ?

flotus1 · April 6, 2018, 10:14

I just realized that I (we) overlooked something:

Quote:

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32188 MB
node 0 free: 30857 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32254 MB
node 1 free: 30921 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 16126 MB
node 2 free: 14914 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32254 MB
node 3 free: 30829 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 48382 MB
node 4 free: 46804 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32254 MB
node 5 free: 30422 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 48382 MB
node 6 free: 46504 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 32232 MB
node 7 free: 30904 MB

Memory population is off. Each node should have 32GB of RAM for optimal performance.
Open the case and re-seat your DIMMs according to the recommendations in chapter 2-9 in the manual http://www.supermicro.com/manuals/mo...G(6)(i)_-F.pdf
From your output of numatcl --hardware it even seems like you have a total of 272GB of RAM installed. If those are all 16GB DIMMs, leave out the additional 17th DIMM.

Fauster · April 6, 2018, 13:36

Quote:

Originally Posted by flotus1

I just realized that I (we) overlooked something:

Memory population is off. Each node should have 32GB of RAM for optimal performance.
Open the case and re-seat your DIMMs according to the recommendations in chapter 2-9 in the manual http://www.supermicro.com/manuals/mo...G(6)(i)_-F.pdf
From your output of numatcl --hardware it even seems like you have a total of 272GB of RAM installed. If those are all 16GB DIMMs, leave out the additional 17th DIMM.

Yes you are definitely right. After investigating it appears that one slot of memory channel of CPU (n°3) is not working. So as best we can put 4 DIMMs on each socket except the socket n°3 where we can only put 3 DIMMs. One node of cpu n°3 has only 16 GB DIMM.

It is clear that this could really slow down the workstation. We decided to remove the CPU n°3 and work with 3 CPU (48 cores). We don't know if it's really advisable to work with only 3 CPU on a quad socket motherboard. Here are the results of the motorbike benchmark :

3xOpteron 6380
#cores #Wall time
24 135 without mpirun option // 106 sec with --cpu-set and bind-to core
48 91 with cpu-set

So the performance are better than with quad sockets but not really fantastic. We can expect maybe 75 sec with 64 cores if we suppose that the problem came from the bad DIMM slot. Maybe the results are better with 3 CPU just because we have less intercommunication between CPU.

If we go back to the 64 cores configuration (with 3 DIMMS on the third CPU) we have had the following results:

#cores #Wall time
4 numactl node 6 and 7 --> 540 sec
4 numactl node 4 and 5 --> 570 sec
4 numactl node 2 and 3 --> 780 sec
4 numactl node 0 and 1 --> 600 sec
4 mpirun without any option --> 474 sec
So clearly there is a problem in node 2 and 3....but we thought node 2 and 3 were associated with CPU 2 and not CPU n°3 ! Something is weird here.

So maybe there is a problem with the supermicro motherboard (we bought it on ebay) ! Or maybe quad socket are definitely not a good option for CFD.

sugumarcfd · June 4, 2018, 09:02

Hi Paul Palladium, have you found out the solution for the problem you are currently facing. I am also facing the same issue.

Fauster · June 4, 2018, 10:51

Quote:

Hi Paul Palladium, have you found out the solution for the problem you are currently facing. I am also facing the same issue.

Unfortunately no... The motherboard is kind of damaged because all the slots are not working. We can't use the 16 slots "A" when running with 16*16 GB RAM so I thought it was the cause of the problem. I tried with only two CPU and appropriate RAM configuration and the performance are still very bad.

I don't know if we are facing a problem of inter-socket or RAM communication. How are your performances on the motorbike benchmark ?

Paul

April 5, 2018, 08:16	Abysmal performance of 64 cores opteron based workstation for CFD	#1
Fauster Member Paul Palladium Join Date: Jan 2016 Posts: 93 Rep Power: 10	Dear All, Recently with have decided to build a 64 cores opteron (6380) workstation for CFD (OPenFOAM). The motherboard and the opteron are not used (but in excellent condition). We decided to limit the new components to reduce the price of the workstation (manly because it's just a test). The configuration is the following Quad socket Motherboard Supermicro H8QGi+-F 4 x AMD Opteron 6380 (@ 2.5 GHz) 16 x 16 GB RAM DDR3 1600 MHz, 1.5 V Ubuntu 16.04 LTS OpenFOAM 1712+ (compiled with source https://openfoam.com/download/install-source.php) At the beginning we did the same benchmark (structured grids) as here : https://www.pugetsystems.com/labs/hp...d-Opteron-587/ We obtained very good results, on 64 cores the run time was 197 sec and the scalability was almost perfect up to 32 cores (speed is linear with cores umber). #cores #runtime (sec) 64 197 32 301 16 606 8 1230 On a 10 cores xeon workstation E5-2687W v3 we obtain : #cores #runtime (sec) 8 1100 After seeing this results we were pretty enthusiastic. The performance of Opteron workstation is not bad. The problems came up when we decided to test the performance on unstructured grids (with simpleFoam and interFoam solvers). The scalability was horrible. With 64 cores the run is twice faster than with....8 cores. We just gain 20% between 32 cores and 64 cores. (We used scotch algorithm to minimize the communication between processor). This is so abysmal that the little 10 cores workstation run faster than the 64 cores opteron !!! We have read many posts on the forum : AMD FX 8-core or 6-core? 4 cpu motherboard for CFD HPC server (AMD) to 40 million cells - How cores do I need? Where I can buy it? mpirun, best parameters It seems that in deed there is a problem with quad socket opteron and unstructured CFD so we tried the solution proposed in this link https://www.anandtech.com/show/4486/...mark-session/5 We have enable node interleaving in the motherboard BIOS (and tried also the settings here http://hamelot.io/programming/the-im...compute-nodes/) but unfortunately we didn't observe any change on the performance. We also tried to use mpirun option like --bind-to none, or --bind-to socket etc..without success. Maybe someone on this post have an idea on how to deal and improve the performance of this workstation (a BIOS problem ? a RAM problem ? a compilation of OPenFOAM problem ?...). Of course we knew that we couldn't have a performance similar to a true 64 Xeon cluster. We assumed to obtain performance similar to a ~30/40 cores xeon workstation not a 8 cores !!! Thanks in advance for helping us !! F

April 5, 2018, 09:23		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,412 Rep Power: 49	I never had the chance to play with this architecture, so I might not be of much help here. But my 2 cents are: enabling node interleaving for a NUMA-aware application (like OpenFOAM) is counterproductive. I would only enable channel and rank interleaving. Node interleaving would increase latency and decrease bandwidth compared to NUMA where each processor has direct access to it's own memory pool. I think binding processes to the correct cores is the key to success here. I.e. spreading them out as evenly as possible across all 4 nodes and making sure the processes stay on the same core. Further improvements might be possible by having processes with high mutual communication on the same node instead of having them transfer data between nodes. But achieving this in a repeatable and easy-to-use fashion is beyond my capabilities. Edit: thinking about this again: "--bind-to core" combined with "--map-by ppr:N:socket" should get you pretty far. Here N is the number of threads per socket, in your case the total number of threads divided by 4. Then again, intra-node latencies might also play a role here and I know nothing about this architecture All I know is that these "16" core CPUs actually only have 8 FPUs. This might be something you should take into account when distributing processes. How is the performance with NUMA and all threads running on only one node? Last edited by flotus1; April 5, 2018 at 12:02.

June 4, 2018, 09:02	The absymal parallel performance in quad socket server for OpenFOAM	#8
sugumarcfd New Member sugumar Join Date: Jun 2018 Posts: 1 Rep Power: 0	Hi Paul Palladium, have you found out the solution for the problem you are currently facing. I am also facing the same issue.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon	evcelica	Hardware	14	December 15, 2016 06:57
parallel performance on BX900	uzawa	OpenFOAM Installation	3	September 5, 2011 15:52
amd opteron based workstation needed	ztdep	Hardware	1	December 5, 2010 06:23
fan performance based on blade shape	balaji	Main CFD Forum	1	September 16, 2005 11:56
fan performance based on blade shape	balaji	Main CFD Forum	0	September 15, 2005 07:24