HP DL380p G8 - Mem bandwidth drops dramatically at 64Gb?

Kailee71 · December 5, 2020, 16:09

Hi all,

I have a HP DL380p Gen8 with 2x E5-2630 v1, 16x 8Gb (1333, 2Rx4). Running on Ubuntu 20.04. Funny thing is, when I run the motorcycle benchmark, if the machine is idling and empty in terms of ram so that the benchmark runs in the first 64Gb I get the 12 cores to do the SnappyHexMesh in 410s, and the sim in 149s. When I fill up some of the memory so that the benchmark runs in the top 64Gb, but the machine is otherwise idle, performance drops dramatically - SHM 477s, sim 296s. The runs are literally back-to-back without a reboot, with just the creation of a 64GB ramdisk and subsequent creation of urandom-filled file in between to fill the the first 64Gb...

I was doing this as I noticed that when run under ESXi when other VMs would fill the lower half of the ram this behaviour occurred, and I wanted to check if it was the hypervisor, but bare metal ubuntu has the same issue.

Is there anything obvious I'm missing here? Or is this even expected behaviour?

Thanks for any pointers,

Kai.

flotus1 · December 5, 2020, 16:32

My first guess would be: NUMA strikes again.
The way you fill up the "lower" half of the memory matters. From the results you are getting, I would assume that you filled up 64GB of RAM mostly in one of the two NUMA nodes. This forces the system to allocate all the memory for your simulation in a single NUMA node, cutting memory bandwidth in half, and introducing a huge latency penalty.

Kailee71 · December 5, 2020, 16:55

Hi Flotus,

thanks for your reply - my memory is distributed evenly between the two nodes, both have 8 sticks each. However, I did just try to switch on node interleave in the BIOS, and while it decreases overall performance by 7% (SHM 435s (-6%), sim 162s (-9%)), it is now unaffected by "where" in the memory they run.

Is there anything else I need to check on?

Thanks,

Kailee.

flotus1 · December 5, 2020, 18:31

It has nothing to do with how memory modules are physically distributed across the nodes. What I wrote also applies to a perfectly balanced memory configuration. However, populating 8 DIMMs per CPU still does not guarantee a balanced memory configuration on this system, since it has a total of 24 DIMM slots. So it still might be worth to crack open a manual and double-check.

Let me try again:
Your system has two NUMA nodes, unless you enable node interleaving in bios.
The method you use to fill 64GB of RAM most likely allocates all of it to a single NUMA node. Let's call it 0 for simplicity. Now when you start meshing and simulation, the OS allocates all memory needed by these processes on the other -empty- NUMA node 1. So all cores running on the CPU that belongs to NUMA node 0 have to access memory across the QPI link on NUMA node 1. Which is extremely slow compared to accessing memory locally.

Kailee71 · December 5, 2020, 19:13

Hi again,

thank you for your input. What you say make a lot of sense, especially considering enabling node interleave mitigates the problem. The manual had indeed already been cracked open and followed, as had HP's DDR3_Memory_Tool_c03293145.pdf which has a lot of information on memory configurations on these machines;

screenie.jpg

My sticks are in A-H respectively - I would have thought that would put the memory evenly between the processors. The one paragraph that discusses node interleaving wasn't much help... It said pretty much what you did, but no explanation of how to alleviate the problem other than enabling node interleave, with the inherent cost.

Would pinning individual threads to cpus help here, or do I have to somehow tell the mesher & sim to be numa aware (I assumed they already were)?

Kai.

flotus1 · December 5, 2020, 20:05

OpenFOAM with its MPI+domain decomposition parallelization is already NUMA-friendly.
But with all memory on one NUMA node already occupied, the operating system can not allocate additional memory in a NUMA-friendly manner. At least not initially, and with default settings. Things get a little complicated here, and I don't know enough about it.
Anyway, what you see is not the fault of OpenFOAM, but the responsibility of the user. If you really need another program running that eats up half of the system memory, there is not a lot you can do. What would help is getting this program to fill up both NUMA nodes equally. How you do that -or if it is even possible- depends on that program.

Kailee71 · December 5, 2020, 20:19

I didn't mean to imply any fault at all - my apologies if it came across as such.

In fact it's more likely that the RAM disk I set up got pegged to one node and filled that nodes' memory, rather than distributing it, and so openfoam is left with no choice but to initially occupy only the other node's memory. Same likely also with ESXi... I think that's what you're trying to explain, right?

Either way - I'll just have to use node interleave for now. Still a very cost effective learning machine at <250Euro...

Thanks again for your input!

Kai.

December 5, 2020, 16:09	HP DL380p G8 - Mem bandwidth drops dramatically at 64Gb?	#1
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 6	Hi all, I have a HP DL380p Gen8 with 2x E5-2630 v1, 16x 8Gb (1333, 2Rx4). Running on Ubuntu 20.04. Funny thing is, when I run the motorcycle benchmark, if the machine is idling and empty in terms of ram so that the benchmark runs in the first 64Gb I get the 12 cores to do the SnappyHexMesh in 410s, and the sim in 149s. When I fill up some of the memory so that the benchmark runs in the top 64Gb, but the machine is otherwise idle, performance drops dramatically - SHM 477s, sim 296s. The runs are literally back-to-back without a reboot, with just the creation of a 64GB ramdisk and subsequent creation of urandom-filled file in between to fill the the first 64Gb... I was doing this as I noticed that when run under ESXi when other VMs would fill the lower half of the ram this behaviour occurred, and I wanted to check if it was the hypervisor, but bare metal ubuntu has the same issue. Is there anything obvious I'm missing here? Or is this even expected behaviour? Thanks for any pointers, Kai.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
AMD EPYC Rome: mem bandwidth for non-NUMA aware code	zx-81	Hardware	12	December 9, 2020 03:25

December 5, 2020, 16:32		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	My first guess would be: NUMA strikes again. The way you fill up the "lower" half of the memory matters. From the results you are getting, I would assume that you filled up 64GB of RAM mostly in one of the two NUMA nodes. This forces the system to allocate all the memory for your simulation in a single NUMA node, cutting memory bandwidth in half, and introducing a huge latency penalty.

December 5, 2020, 16:55		#3
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 6	Hi Flotus, thanks for your reply - my memory is distributed evenly between the two nodes, both have 8 sticks each. However, I did just try to switch on node interleave in the BIOS, and while it decreases overall performance by 7% (SHM 435s (-6%), sim 162s (-9%)), it is now unaffected by "where" in the memory they run. Is there anything else I need to check on? Thanks, Kailee.

December 5, 2020, 18:31		#4
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	It has nothing to do with how memory modules are physically distributed across the nodes. What I wrote also applies to a perfectly balanced memory configuration. However, populating 8 DIMMs per CPU still does not guarantee a balanced memory configuration on this system, since it has a total of 24 DIMM slots. So it still might be worth to crack open a manual and double-check. Let me try again: Your system has two NUMA nodes, unless you enable node interleaving in bios. The method you use to fill 64GB of RAM most likely allocates all of it to a single NUMA node. Let's call it 0 for simplicity. Now when you start meshing and simulation, the OS allocates all memory needed by these processes on the other -empty- NUMA node 1. So all cores running on the CPU that belongs to NUMA node 0 have to access memory across the QPI link on NUMA node 1. Which is extremely slow compared to accessing memory locally.

December 5, 2020, 19:13		#5
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 6	Hi again, thank you for your input. What you say make a lot of sense, especially considering enabling node interleave mitigates the problem. The manual had indeed already been cracked open and followed, as had HP's DDR3_Memory_Tool_c03293145.pdf which has a lot of information on memory configurations on these machines; screenie.jpg My sticks are in A-H respectively - I would have thought that would put the memory evenly between the processors. The one paragraph that discusses node interleaving wasn't much help... It said pretty much what you did, but no explanation of how to alleviate the problem other than enabling node interleave, with the inherent cost. Would pinning individual threads to cpus help here, or do I have to somehow tell the mesher & sim to be numa aware (I assumed they already were)? Kai.

December 5, 2020, 20:05		#6
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	OpenFOAM with its MPI+domain decomposition parallelization is already NUMA-friendly. But with all memory on one NUMA node already occupied, the operating system can not allocate additional memory in a NUMA-friendly manner. At least not initially, and with default settings. Things get a little complicated here, and I don't know enough about it. Anyway, what you see is not the fault of OpenFOAM, but the responsibility of the user. If you really need another program running that eats up half of the system memory, there is not a lot you can do. What would help is getting this program to fill up both NUMA nodes equally. How you do that -or if it is even possible- depends on that program.

December 5, 2020, 20:19		#7
Kailee71 Member Kailee Join Date: Dec 2019 Posts: 35 Rep Power: 6	I didn't mean to imply any fault at all - my apologies if it came across as such. In fact it's more likely that the RAM disk I set up got pegged to one node and filled that nodes' memory, rather than distributing it, and so openfoam is left with no choice but to initially occupy only the other node's memory. Same likely also with ESXi... I think that's what you're trying to explain, right? Either way - I'll just have to use node interleave for now. Still a very cost effective learning machine at <250Euro... Thanks again for your input! Kai.