CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Hardware

HP DL580 servers: How to get the most out of my 160 vCPUs!

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   October 24, 2016, 02:34
Unhappy HP DL580 servers: How to get the most out of my 160 vCPUs!
  #1
New Member
 
Neil
Join Date: Oct 2016
Posts: 5
Rep Power: 3
mcneilyo is on a distinguished road
Hi,
After many months of research and about 18 months blowing money on EC2 instances, I took a dive and bought two refurbished HP DL580 servers (for an excellent price I might add).

I've been running many FORTRAN executable in parallel (MODFLOW) to undertake Monte Carlo uncertainty analysis. I'm seeing some very slow runtimes - much much slower than anticipated.

Here's my setup:
2x HP DL580 servers, with the following in each:
2x Hypervisor Windows 10 pro virtual machines
4x Intel® Xeon® Processor E7-4870 (30M Cache, 2.40 GHz, 6.40 GT/s Intel® QPI)
256GB RAM (32x 8GB PC3-10600R)
P410i raid controller
8x 480GB SSD (stripe setup)

Running a single model takes about 1 hour, which I assume = 2.8GHz with boosting. When run 1 model per physical core, I'm seeing speeds of about 4 hours. And then 1 model per virtual core about 6 hour run times.

Where's my bottle-necking coming from? These days, shouldn't I be getting 2.4GHz from each vCPU? Or is it somehow split? Either way, I should be seeing faster run times with 1 model per physical core.

I've tried setting affinity per model, and it did little to speed up the runs. It can't be harddrive speeds, as I'm writing data at about 1 MB/s per model. Each model is consuming 0.5GB RAM.

Any advice greatly appreciated!

Last edited by mcneilyo; October 24, 2016 at 03:14. Reason: wording
mcneilyo is offline   Reply With Quote

Old   October 24, 2016, 02:43
Default
  #2
New Member
 
Neil
Join Date: Oct 2016
Posts: 5
Rep Power: 3
mcneilyo is on a distinguished road
I should add that I saw pretty similar speeds on the EC2 virtual machines. e.g. r3.8xlarge

To save you a google, in total I have 4 x virtual machines. Each machine has 20 physical cores allocated to it (40 vcpus).
mcneilyo is offline   Reply With Quote

Old   November 18, 2016, 14:20
Default
  #3
Member
 
Join Date: May 2012
Posts: 64
Rep Power: 7
alpha754293 is on a distinguished road
I think that your virtual machines is what's really slowing you down.

I've tried it before as well and the virtual machine to host translation makes it run VERY, VERY slowly.

Right now, the way that I have my systems set up (three Core i7-3930Ks), I just use Teamviewer to remotely log in to them (they all run Windows 7 right now) - so you might want to try that instead first.

Also remember that logical cores aren't as good for computationally heavy/intensive tasks like CFD compared to physical cores.

I have disabled HyperThreading on all of my systems for that reason.

Just some quick thoughts...
alpha754293 is offline   Reply With Quote

Old   November 29, 2016, 19:47
Default
  #4
New Member
 
Neil
Join Date: Oct 2016
Posts: 5
Rep Power: 3
mcneilyo is on a distinguished road
Thanks for your thoughts. You are definitely correct - i7s are significantly faster.

We're pretty sure we've solved the problem now. Turns out its probably two things:

1) Heat. We had a very poor ventilation system set up in our server cabinet. Temperatures in there were getting to as high as 40 degrees! We recently installed an industrial exhaust fan, which has pulled the temperatures down to 25 degrees. We see significant improvement in speeds now! I'm guessing the SSDs were getting too hot?

2) Dated RAID controller. Although our RAID controller is capable of 6GBs, reviews I've read online suggests it is unlikely to make the make of the SSD capabilities. A newer RAID controller would speed things up significantly, so thats the next step.

Cheers
mcneilyo is offline   Reply With Quote

Old   November 29, 2016, 23:14
Default
  #5
Member
 
Join Date: May 2012
Posts: 64
Rep Power: 7
alpha754293 is on a distinguished road
Quote:
Originally Posted by mcneilyo View Post
Thanks for your thoughts. You are definitely correct - i7s are significantly faster.

We're pretty sure we've solved the problem now. Turns out its probably two things:

1) Heat. We had a very poor ventilation system set up in our server cabinet. Temperatures in there were getting to as high as 40 degrees! We recently installed an industrial exhaust fan, which has pulled the temperatures down to 25 degrees. We see significant improvement in speeds now! I'm guessing the SSDs were getting too hot?

2) Dated RAID controller. Although our RAID controller is capable of 6GBs, reviews I've read online suggests it is unlikely to make the make of the SSD capabilities. A newer RAID controller would speed things up significantly, so thats the next step.

Cheers
The SSDs generate very little heat. I've got a whole bunch of them in my systems.

It's much more likely that your processors were the culprit rather than the SSDs. (Unless you were just kidding about the SSDs generating that much heat, i.e. sarcasm.)

Umm...it depends. i7 have a higher clock speed but if you multiply the number of cores by clock speed, you might be able to do more with your Xeons at the same time. (i.e. 4 CPUs * 10 cores/CPU * 2.4 GHz = 96 GHz. Conversely, with my three Core i7s (two of the run with 6 cores, one runs with only 4 because I burned out a core, and dropped it down to 4 active cores), so 16 cores * 3.5 GHz = 56 GHz.)

So it depends. For single threaded tasks, yes, my Core i7s would likely be faster than the Xeons. But if you have a task that you can parallelise over 40 cores, you'll win in overall computational power.

re: RAID
Umm...it depends. A single SSD can fairly readily max out a SATA or SAS 6 Gbps connection, and it won't take much to saturate even a PCIe 3.0 x4 or x8 bus. (On an x4 link, 5 SSDs will do it. x8, about 10-ish SSDs would saturate the bus.)

So either it depends on how much data you're generating with your simulations (either that you're keeping or temporary) or that you can write all of that data in less than a second (at which point, if each system either had only a single PCIe SSD - that might be sufficient).

In some of my testing with Ansys Mechanical using the direct solver (which produces the largest amount of writes), there wasn't much difference in terms of using a single SATA 6 Gbps SSD vs. a PCIe 3.0 x4 SSD.

So...again it depends.

But like I said - I wouldn't use VMs if I were you. Of course, ultimate how you configure your system is entirely up to you, but based on my own experiences with it - the translations required to run the computations in a VM puts a significant (like 90% overhead) on the run/simulation.

I ran the VM to test how to set up a MPI environment (multiple nodes), so I used only a very small test case since I am only learning how to configure it, but then once I develop the deployment process, then I can apply that on native hardware and run everything at full speed.

Just my $.02.
alpha754293 is offline   Reply With Quote

Old   October 24, 2017, 01:28
Default
  #6
New Member
 
Neil
Join Date: Oct 2016
Posts: 5
Rep Power: 3
mcneilyo is on a distinguished road
I'm back... Thanks for your 2c. Its appreciated.

Decided to buy the newest threadripper 1950x (32 logical threads) to hopefully replace these slow 4870s. Its been overclocked to 4.0GHz base speed (liquid cooled), running on direct SATA connections to 6 SSDs/HDDs. All in native Windows 10.

So far I've found its not as fast as I would have hoped. Running 20 models spread evenly about the system takes around 3x longer than running a single instance on an old i7. I don't think hardrive writing is an issue now. It seems write speeds are relatively slow, which is understandable considering the equations are not being solved particularly quickly. Models running on the SSDs solve in a similar runtime to the HDDs

I tried a myriad of things to speed things up on the old 4870s, so I will give that a shot and report back (e.g cpu affinity setting). I was also wondering if it was the model executable itself. Maybe the solver is sharing resources.. Its intel based. I'll ask the developer.
mcneilyo is offline   Reply With Quote

Old   October 24, 2017, 07:21
Default
  #7
Member
 
Join Date: May 2012
Posts: 64
Rep Power: 7
alpha754293 is on a distinguished road
Quote:
Originally Posted by mcneilyo View Post
I'm back... Thanks for your 2c. Its appreciated.

Decided to buy the newest threadripper 1950x (32 logical threads) to hopefully replace these slow 4870s. Its been overclocked to 4.0GHz base speed (liquid cooled), running on direct SATA connections to 6 SSDs/HDDs. All in native Windows 10.

So far I've found its not as fast as I would have hoped. Running 20 models spread evenly about the system takes around 3x longer than running a single instance on an old i7. I don't think hardrive writing is an issue now. It seems write speeds are relatively slow, which is understandable considering the equations are not being solved particularly quickly. Models running on the SSDs solve in a similar runtime to the HDDs

I tried a myriad of things to speed things up on the old 4870s, so I will give that a shot and report back (e.g cpu affinity setting). I was also wondering if it was the model executable itself. Maybe the solver is sharing resources.. Its intel based. I'll ask the developer.
Not that surprising. The newest AMD Threadripper, if I understood it correctly, still uses a single FPU shared between the two "cores" (in one "module") and as a result, it's maximum computational throughput is still only about a middle-of-the-range Core i7 by now. I don't know WHAT the product development people were thinking at AMD. The point of a CPU at its heart is to do computations, and by putting only one FPU per two ALU, they've severely limited it's capability to do that. I don't know if Threadripper has fixed that or not. From the benchmark results, it doesn't look like it.

As I mentioned before, for your old 4870s, try running it natively.

Oh, BTW, Windows 10 is about 2% slower (on average) than Windows 7 for computationally intensive workloads (in my testing). Generally not significant enough to notice, but it is there.

Also, the problem that I had with Windows 10 was that I couldn't completely turn off the updates so there would be ones that would still sneak through/sneak by and then reboot the system causing me to lose work. That's another reason why I stayed in favour of Win7 over Win10.

Also check to make sure that your HyperThreading is off. Most computationally intensive workloads doesn't like the competition for FPU resources when HTT is enabled.
alpha754293 is offline   Reply With Quote

Old   October 24, 2017, 07:50
Default
  #8
Senior Member
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 1,620
Rep Power: 26
flotus1 will become famous soon enoughflotus1 will become famous soon enough
Quote:
Originally Posted by alpha754293 View Post
Not that surprising. The newest AMD Threadripper, if I understood it correctly, still uses a single FPU shared between the two "cores" (in one "module") and as a result, it's maximum computational throughput is still only about a middle-of-the-range Core i7 by now. I don't know WHAT the product development people were thinking at AMD
AMD dropped the concept of shared execution units with the new "Zen" architecture. They learned their lesson after the disaster with their "Bulldozer" architecture. I dont't know where you would get the information about shared FPUs with Zen. It is incorrect.

Edit to add to the original question of this thread:
Parallel efficiencies of <100% even with several single-threaded applications could be a symptom of the following issues:
  • memory bandwidth bottleneck
  • thermal throttling
  • file i/o bottlenecks
  • cache-based execution with one thread vs. memory-based execution with many threads
If the workload requires high memory bandwidth, comparing 20 instances on a 16-core CPU with 4 memory channels ("Threadripper") to a single-instance on a 4-core CPU with two memory channels ("old I7") is rather pointless. The same applies to file i/o bottlenecks.
However, I would not expect a TR-1950x to be faster overall compared to your older setup with 4 Xeon CPUs. That is 16 cores/4 memory channels vs. 40 cores/16 memory channels. Even if the TR cores and memory channels were twice as fast, the quad-Xeon system is still the winner.
Check if your Xeon processors run at 2.4GHz under full load.
__________________
Please do not send me CFD-related questions via PM

Last edited by flotus1; October 24, 2017 at 09:42.
flotus1 is offline   Reply With Quote

Old   October 24, 2017, 09:07
Default
  #9
Member
 
Join Date: May 2012
Posts: 64
Rep Power: 7
alpha754293 is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
AMD dropped the concept of shared execution units with the new "Zen" architecture. They learned their lesson after the disaster with their "Bulldozer" architecture. I dont't know where you would get the information about shared FPUs with Zen. It is incorrect.
Whoops! My bad.

Thank you for that.

On that end then, it's a shame/pity that their FPU performance still isn't able to beat Intel's.

Bummer.
alpha754293 is offline   Reply With Quote

Old   November 14, 2017, 21:26
Default
  #10
New Member
 
Neil
Join Date: Oct 2016
Posts: 5
Rep Power: 3
mcneilyo is on a distinguished road
Thanks for the input. I think we've finally found the bottleneck. It's our memory channels. A quick test shows that running 2 models on a 2 channel i7 runs optimally, then running 3-4 models halves the model run times.

Back the the software - I'm running MODFLOW, which was complied using Intel fortran (old school). It runs using ONE thread. Hyper threading yields no improvement, and GPU parallelisation using developer versions is limited (restricted to 1 model run per GPU at the moment).

The Xeons are OK, but their clock speeds makes them run comparatively slowly, even when limiting runs to physical cores or available memory channels. EDIT - yes they seem to be running at 2.4 GHz under full load. Still haven't removed virtualization, so that will be the last step.

Now that I know memory channels are the bottle neck, this should help me work out what would be the cheapest, fastest setup to consider when expanding the CPU bank. I think the threadripper is probably an inefficient choice...
mcneilyo is offline   Reply With Quote

Old   November 14, 2017, 22:19
Default
  #11
Member
 
Join Date: May 2012
Posts: 64
Rep Power: 7
alpha754293 is on a distinguished road
Quote:
Originally Posted by mcneilyo View Post
Thanks for the input. I think we've finally found the bottleneck. It's our memory channels. A quick test shows that running 2 models on a 2 channel i7 runs optimally, then running 3-4 models halves the model run times.

Back the the software - I'm running MODFLOW, which was complied using Intel fortran (old school). It runs using ONE thread. Hyper threading yields no improvement, and GPU parallelisation using developer versions is limited (restricted to 1 model run per GPU at the moment).

The Xeons are OK, but their clock speeds makes them run comparatively slowly, even when limiting runs to physical cores or available memory channels. EDIT - yes they seem to be running at 2.4 GHz under full load. Still haven't removed virtualization, so that will be the last step.

Now that I know memory channels are the bottle neck, this should help me work out what would be the cheapest, fastest setup to consider when expanding the CPU bank. I think the threadripper is probably an inefficient choice...
re: Xeons

The thing that I have found to be an advantage for Xeons is ECC Registed RAM and that because many enterprise server/workstation platforms, you CAN'T overclock them like you can with consumer-grade i7s, so you get long term reliability out of them.

Depends on what's important to you - speed vs. reliability.
alpha754293 is offline   Reply With Quote

Reply

Tags
servers, vcpu, windows 10

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
New Servers Launched for CFD Online Website pete Site News & Announcements 0 June 12, 2015 06:09
Setup processing servers chattphotos STAR-CCM+ 0 October 10, 2013 20:33
gmsh 2.6.0 conversion to OpenFoam 160 rosswin Open Source Meshers: Gmsh, Netgen, CGNS, ... 0 March 5, 2013 08:34
Making a Cluster with servers at different locations? Mobz OpenFOAM Programming & Development 1 February 27, 2013 18:16
Multiblocks Rider OpenFOAM 16 April 12, 2012 03:00


All times are GMT -4. The time now is 05:22.