CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

HP DL380p G8 - Mem bandwidth drops dramatically at 64Gb?

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   December 5, 2020, 16:09
Default HP DL380p G8 - Mem bandwidth drops dramatically at 64Gb?
  #1
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Hi all,

I have a HP DL380p Gen8 with 2x E5-2630 v1, 16x 8Gb (1333, 2Rx4). Running on Ubuntu 20.04. Funny thing is, when I run the motorcycle benchmark, if the machine is idling and empty in terms of ram so that the benchmark runs in the first 64Gb I get the 12 cores to do the SnappyHexMesh in 410s, and the sim in 149s. When I fill up some of the memory so that the benchmark runs in the top 64Gb, but the machine is otherwise idle, performance drops dramatically - SHM 477s, sim 296s. The runs are literally back-to-back without a reboot, with just the creation of a 64GB ramdisk and subsequent creation of urandom-filled file in between to fill the the first 64Gb...

I was doing this as I noticed that when run under ESXi when other VMs would fill the lower half of the ram this behaviour occurred, and I wanted to check if it was the hypervisor, but bare metal ubuntu has the same issue.

Is there anything obvious I'm missing here? Or is this even expected behaviour?

Thanks for any pointers,

Kai.
Kailee71 is offline   Reply With Quote

Old   December 5, 2020, 16:32
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
My first guess would be: NUMA strikes again.
The way you fill up the "lower" half of the memory matters. From the results you are getting, I would assume that you filled up 64GB of RAM mostly in one of the two NUMA nodes. This forces the system to allocate all the memory for your simulation in a single NUMA node, cutting memory bandwidth in half, and introducing a huge latency penalty.
flotus1 is offline   Reply With Quote

Old   December 5, 2020, 16:55
Default
  #3
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Hi Flotus,

thanks for your reply - my memory is distributed evenly between the two nodes, both have 8 sticks each. However, I did just try to switch on node interleave in the BIOS, and while it decreases overall performance by 7% (SHM 435s (-6%), sim 162s (-9%)), it is now unaffected by "where" in the memory they run.

Is there anything else I need to check on?

Thanks,


Kailee.
Kailee71 is offline   Reply With Quote

Old   December 5, 2020, 18:31
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
It has nothing to do with how memory modules are physically distributed across the nodes. What I wrote also applies to a perfectly balanced memory configuration. However, populating 8 DIMMs per CPU still does not guarantee a balanced memory configuration on this system, since it has a total of 24 DIMM slots. So it still might be worth to crack open a manual and double-check.

Let me try again:
Your system has two NUMA nodes, unless you enable node interleaving in bios.
The method you use to fill 64GB of RAM most likely allocates all of it to a single NUMA node. Let's call it 0 for simplicity. Now when you start meshing and simulation, the OS allocates all memory needed by these processes on the other -empty- NUMA node 1. So all cores running on the CPU that belongs to NUMA node 0 have to access memory across the QPI link on NUMA node 1. Which is extremely slow compared to accessing memory locally.
flotus1 is offline   Reply With Quote

Old   December 5, 2020, 19:13
Default
  #5
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
Hi again,

thank you for your input. What you say make a lot of sense, especially considering enabling node interleave mitigates the problem. The manual had indeed already been cracked open and followed, as had HP's DDR3_Memory_Tool_c03293145.pdf which has a lot of information on memory configurations on these machines;

screenie.jpg

My sticks are in A-H respectively - I would have thought that would put the memory evenly between the processors. The one paragraph that discusses node interleaving wasn't much help... It said pretty much what you did, but no explanation of how to alleviate the problem other than enabling node interleave, with the inherent cost.

Would pinning individual threads to cpus help here, or do I have to somehow tell the mesher & sim to be numa aware (I assumed they already were)?

Kai.
Kailee71 is offline   Reply With Quote

Old   December 5, 2020, 20:05
Default
  #6
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,400
Rep Power: 47
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
OpenFOAM with its MPI+domain decomposition parallelization is already NUMA-friendly.
But with all memory on one NUMA node already occupied, the operating system can not allocate additional memory in a NUMA-friendly manner. At least not initially, and with default settings. Things get a little complicated here, and I don't know enough about it.
Anyway, what you see is not the fault of OpenFOAM, but the responsibility of the user. If you really need another program running that eats up half of the system memory, there is not a lot you can do. What would help is getting this program to fill up both NUMA nodes equally. How you do that -or if it is even possible- depends on that program.
flotus1 is offline   Reply With Quote

Old   December 5, 2020, 20:19
Default
  #7
Member
 
Kailee
Join Date: Dec 2019
Posts: 35
Rep Power: 6
Kailee71 is on a distinguished road
I didn't mean to imply any fault at all - my apologies if it came across as such.

In fact it's more likely that the RAM disk I set up got pegged to one node and filled that nodes' memory, rather than distributing it, and so openfoam is left with no choice but to initially occupy only the other node's memory. Same likely also with ESXi... I think that's what you're trying to explain, right?

Either way - I'll just have to use node interleave for now. Still a very cost effective learning machine at <250Euro...

Thanks again for your input!

Kai.
Kailee71 is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD EPYC Rome: mem bandwidth for non-NUMA aware code zx-81 Hardware 12 December 9, 2020 03:25


All times are GMT -4. The time now is 20:06.