HPC system setup

canopus · September 27, 2020, 09:02

I am trying to work out a HPC system with USD 12,000 with a bias towards the EPYC processors.

The configuration I have in mind.

Compute node

Dual - AMD EPYC 7452 each with 32 Core, 2.35 GHz, 128 MB Cache

Memory 256 GB (16GB x 16 Nos.) ECC DDR4 3200MHz

240GB Enterprise SSD

Master node

Single - AMD EPYC Rome 7252 8-Core 3.1 GHz

Memory 128 GB (16GB x 8 Nos.) ECC DDR4 3200MHz

4 X 4TB SATA Enterprise Hard Disk

CentOS

I intend to start with 1 /2 compute nodes and add on latter as I get budget.

Due to budget constraints Gigabit Ethernet instead of Infiniband.

I have few questions specific/generic

1. Is ethernet really a bottleneck if I have to run across two compute nodes?

2. In that case is it better to go for a workstation if I don't plan to use more than 64 cores for a single run?

3. Should I have SSD or HDD on compute node?

4. Is single processor on Master node a problem if I just want to launch and manage the compute nodes?

5. Neglecting the price factor is 1 cpu with 64core better than 2 cpus with 32core?

Any other comments/suggestions will be highly appreciated.

flotus1 · September 27, 2020, 10:15

Quote:

1. Is ethernet really a bottleneck if I have to run across two compute nodes?
2. In that case is it better to go for a workstation if I don't plan to use more than 64 cores for a single run?
3. Should I have SSD or HDD on compute node?
4. Is single processor on Master node a problem if I just want to launch and manage the compute nodes?
5. Neglecting the price factor is 1 cpu with 64core better than 2 cpus with 32core?

1) That still depends on the application. "HPC" is a pretty large field of computing. But in general, the fewer nodes you need to connect, the smaller the penalty for slower interconnects like Gigabit Ethernet. I would probably try to get 10Gigabit Ethernet at least, many server boards nowadays have that built-in.

2) Once more: depends on the application. If your application is compute-bound and all you need is 64 cores, there is definitely no need to distribute these cores across multiple machines. If your applications are more on the memory-bound side of the fence, splitting the cores across multiple machines/CPUs could increase performance, despite having to deal with node interconnects.

3) If these drives only need to hold the operating system for each node, it doesn't matter too much. If these drives are supposed to double as fast local storage for each node, SSDs are definitely the way to go.

4) Not a problem at all. In fact, if the master node only needs to handle node access and a central storage system, an Epyc CPU and 128GB of RAM are total overkill, and thus a waste of budget.

5) And again: depends on the type of HPC application you want to run. Compute-bound: a single CPU is fine. Memory bandwidth bound: multiple CPUs are better, because memory bandwidth is a shared resource, that increases with the amount of CPUs.

canopus · September 27, 2020, 10:47

Many thanks for the fast reply.

Very sorry for forgetting the application!
I intend to do LES and eventually DNS of reacting i.e. combustion flows.
The software that I intend to use is FLUENT (academic license), OpenFOAM and a FORTRAN MPI code.

Quote:

Once more: depends on the application. If your application is compute-bound and all you need is 64 cores, there is definitely no need to distribute these cores across multiple machines. If your applications are more on the memory-bound side of the fence, splitting the cores across multiple machines/CPUs could increase performance, despite having to deal with node interconnects.

I think with 256GB (4GB per core) I will be more compute bound than memory bound. Please correct me if wrong.

Quote:

If these drives only need to hold the operating system for each node, it doesn't matter too much. If these drives are supposed to double as fast local storage for each node, SSDs are definitely the way to go.

In case of FLUENT i think I/O is from master node only. For OpenFOAM, the compute nodes write locally. But I/O speed is not crucial as I guess. The SSD is to safeguard against crashes as compared to HDD.

Quote:

Not a problem at all. In fact, if the master node only needs to handle node access and a central storage system, an Epyc CPU and 128GB of RAM are total overkill, and thus a waste of budget.

What can be the cheapest alternative?
(in case master node is doing the I/O)

flotus1 · September 27, 2020, 11:53

Both Ansys Fluent and OpenFOAM tend to become memory bandwidth bound somewhere between 2-4 cores per memory channel. In which case two 32-core Epyc CPUs are way better than a single 64-core CPU.
Since I don't know anything about your MPI parallel Fortran code, I can not comment on its computational intensity.
Side-note on computational intensity (roughly: floating point operations per byte moved from and to memory): it does not correlate with the amount of memory allocated per core. Let alone the amount of memory available per core, which is a hardware metric.

Quote:

The SSD is to safeguard against crashes as compared to HDD

Last time I was involved in purchasing a cluster, the vendor recommended HDDs as compute node OS drives against SSDs, for "increased reliability". But then again, I did not fully trust their expertise. Anyway, the OS should probably installed on a RAID1 of two drives to avoid downtime, so single-drive reliability is less important.
I don't know about OpenFOAM, but it can surely be configured to write its output to a central storage system. Otherwise, a 256GB SSD seems way too small.
While we are on the topic of storage: for LES, you definitely want fast interconnect between the nodes and storage. Don't go below 10Gigabit Ethernet. And a storage solution that can saturate the bandwidth here, which is around 1Gigabyte/s. With only Gigabit Ethernet, frequent writes of result files could slow down the calculation significantly.

Quote:

What can be the cheapest alternative?

The cheapest possible solution, especially with only 2 compute nodes: no head node at all. Instead, one of the compute nodes doubles as a head node. Once your cluster grows with increased budget, you can still add a dedicated head node later on. Maybe one that can also handle stuff like meshing and post-processing.

canopus · September 28, 2020, 08:45

Alex thanks a lot for your time and sharing from your vast experience!

Quote:

Both Ansys Fluent and OpenFOAM tend to become memory bandwidth bound somewhere between 2-4 cores per memory channel.

Thanks for letting me know this. Then 32 cores is on the higher side with 8 memory channels for AMD7452 but will not disappoint I guess.

Quote:

Side-note on computational intensity (roughly: floating point operations per byte moved from and to memory): it does not correlate with the amount of memory allocated per core. Let alone the amount of memory available per core, which is a hardware metric.

Didn't quite get this. Do you mean the variables that my code handles?
Or something else ?If then Any literature/links?

Quote:

Last time I was involved in purchasing a cluster, the vendor recommended HDDs as compute node OS drives against SSDs, for "increased reliability". But then again, I did not fully trust their expertise. Anyway, the OS should probably installed on a RAID1 of two drives to avoid downtime, so single-drive reliability is less important.

In addition to reliability, performance improvement in terms of speed was also suggested. After you mentioned, will it be a good idea to go for 2 X 480GB SSD in RAID1 for OS and 5 X 4TB SATA HDD (7.2k, 6.2Gbps) for data storage?

Quote:

While we are on the topic of storage: for LES, you definitely want fast interconnect between the nodes and storage. Don't go below 10Gigabit Ethernet. And a storage solution that can saturate the bandwidth here, which is around 1Gigabyte/s. With only Gigabit Ethernet, frequent writes of result files could slow down the calculation significantly.

Thanks for this valuable suggestion.
And I just found that 10Gbps is not a big price difference from 1Gbps.

Quote:

The cheapest possible solution, especially with only 2 compute nodes: no head node at all. Instead, one of the compute nodes doubles as a head node. Once your cluster grows with increased budget, you can still add a dedicated head node later on. Maybe one that can also handle stuff like meshing and post-processing.

OK. I have always used in the past systems with dedicated head node.
How does it work? can I submit job and still use the node?
Can I use PBS, SLURM etc on one of the compute nodes?

flotus1 · September 28, 2020, 14:15

Quote:

Didn't quite get this. Do you mean the variables that my code handles?
Or something else ?If then Any literature/links?

Search "roofline model" if you want to learn more about it. It was just a comment about the 4GB per core remark.
https://moodle.rrze.uni-erlangen.de/...ew.php?id=7173

Quote:

will it be a good idea to go for 2 X 480GB SSD in RAID1 for OS and 5 X 4TB SATA HDD (7.2k, 6.2Gbps) for data storage?

For storing the OS on a compute node, ~250GB should be plenty of space.
For central storage, you should really consult with an IT professional, who can also help you set it up. There are many things to consider, like data integrity, redundancy, backups, speed...
On the topic of speed: you won't get the sequential transfer speed needed to saturate 10Gigabit Ethernet out of 5 spinning hard drives. Unless you put them into RAID0. And I can not stress enough how bad of an idea that would be for the only central storage pool in a cluster.
Here is what I did: 6x8TB hard drives in RAID6 for mass storage with low performance. Plus 8TB of NVMe storage without redundancy if higher performance is needed. There are much better ways to do it, that's just all I could do with the restriction I had.
Again: ask an expert on storage solutions. Maybe your server vendor has some hints for you. He really should.

Quote:

OK. I have always used in the past systems with dedicated head node.
How does it work? can I submit job and still use the node? Can I use PBS, SLURM etc on one of the compute nodes?

Since it's your cluster, you absolutely can. Restricting access to busy nodes is only necessary on large clusters with many users that can not be trusted.
You can use any queuing system you like. As far as I know, none of these systems has an inherent restriction against computing on the "head" node, i.e. having no dedicated head node at all. Maybe you are thinking of larger commercial and academic clusters, where the head node is excluded from the queuing system.

canopus · September 30, 2020, 01:46

Quote:

Search "roofline model" if you want to learn more about it. It was just a comment about the 4GB per core remark.
https://moodle.rrze.uni-erlangen.de/...ew.php?id=7173

Thanks for sharing the material.
Will go through and see if I can understand it!

Quote:

For storing the OS on a compute node, ~250GB should be plenty of space.
For central storage, you should really consult with an IT professional, who can also help you set it up. There are many things to consider, like data integrity, redundancy, backups, speed...

I think FLUENT requires OS plus the software to be installed but not needed for OpenFOAM. So space to account for the application.

Quote:

On the topic of speed: you won't get the sequential transfer speed needed to saturate 10Gigabit Ethernet out of 5 spinning hard drives. Unless you put them into RAID0. And I can not stress enough how bad of an idea that would be for the only central storage pool in a cluster.
Here is what I did: 6x8TB hard drives in RAID6 for mass storage with low performance. Plus 8TB of NVMe storage without redundancy if higher performance is needed. There are much better ways to do it, that's just all I could do with the restriction I had.

Many vendors are suggesting SSD.
In addition to the SSD for OS & applications, for the central storage I had planned 5 X 4TB disks in RAID5.
But have to rethink and figure out what to do after the suggestions.

Quote:

You can use any queuing system you like. As far as I know, none of these systems has an inherent restriction against computing on the "head" node, i.e. having no dedicated head node at all. Maybe you are thinking of larger commercial and academic clusters, where the head node is excluded from the queuing system.

Yes I have been using this type of system at Universities.
I plan to use Ganglia/ PBS /SLURM
Also is it worth to pay for enterprise OS?

Just found this from https://www.ozeninc.com/ansys-system...ents/#tab-id-5

A headless server is a specialized machine meant for the sole purpose of computation. The server form factor, as well as the removed need for graphics capability, allows for maximization of computational ability. This form factor generally requires workstations capable of pre and postprocessing models. While users can manually remote into the machine, copy files over and press solve, the setup and usage of Remote Solve Manager is highly recommended to automate this process. Network speed is an important consideration, especially for transferring large result files.

Looks another option if I want to restrict to one node at (i.e. 64 cores) at a time

flotus1 · September 30, 2020, 12:02

Quote:

A headless server is a specialized machine meant for the sole purpose of computation. The server form factor, as well as the removed need for graphics capability, allows for maximization of computational ability. This form factor generally requires workstations capable of pre and postprocessing models. While users can manually remote into the machine, copy files over and press solve, the setup and usage of Remote Solve Manager is highly recommended to automate this process. Network speed is an important consideration, especially for transferring large result files.

Looks another option if I want to restrict to one node at (i.e. 64 cores) at a time

We are getting into semantics here. Any cluster node without a graphics card could be called a headless server.

Quote:

Also is it worth to pay for enterprise OS?

That's up to you. What you pay for is support, the OS itself does not make much of a difference.

canopus · October 7, 2020, 14:52

Just found that AMD EPYC 7551 is half the price of 7542.

Both of them have 32 cores. But

AMD Epyc 7542 AMD Epyc 7551

Frequency 2.35GHz 2.00 GHz

Turbo (1 Core) 3.40 GHz 3.00 GHz

Turbo (All Cores) 3.20 GHz 2.55 GHz

Architecture Rome (Zen 2) Zen

Memory DDR4-3200 DDR4-2666

Memory channels 8 8

ECC Yes Yes

L3 Cache 128.00 MB 64.00 MB

PCIe version 4.0 3.0

So the flip side of 7551 is lower CPU speed, lower RAM speed, and half size L3 cache.

Q. How much L3 cache will make difference?

Q. What is all core turbo speed? Do we get that when all cores are loaded heavily?

Finally, considering half the cost but same number of cores is it worth to switch to 7551? Or am I missing something?

flotus1 · October 7, 2020, 16:36

You can get 1st gen Epyc CPUs with 32 cores for 600$ and less on ebay. Retail, not engineering samples.
But there is a good reason why 1st gen suddenly dropped in price when 2nd gen was launched, and prices are still falling: almost nobody wants them, because 2nd gen is just so much better.
1) Instructions per clock is significantly higher. Meaning that at the same core frequency, 2nd gen will be faster. How much depends on the code.
2) Core clock speeds are higher.
3) Support for faster memory, which definitely impacts the codes you intend to run
4) Last not least: simpler NUMA topology.

The last one is the main reason why most people don't want 1st gen any more. For software that isn't NUMA-aware (or for operating systems with an over-zealous scheduler

), this can have a huge performance impact.
For your applications, it is not a deal-breaker. In fact, configuring 2nd gen Epyc CPUs in NPS4 mode (results in 4 NUMA nodes per CPU, just like 1st gen had natively) is recommended for software that uses MPI+DD. The performance increase vs NPS1 can be around 10%.
The downside: operations like e.g. mesh generation can run slower with this many NUMA nodes. I definitely see that on my workstation (2x Epyc 7551). As soon as my grid generator uses more memory than one NUMA node provides, performance drops by about 50%.
It's a tradeoff you need to make: longer mesh generation time for large meshes vs longer simulation run times.

Quote:

Q. How much L3 cache will make difference?

Nearly impossible to tell without profiling, but it certainly contributes to the lead of 2nd gen

Quote:

Q. What is all core turbo speed? Do we get that when all cores are loaded heavily?

In theory: yes
In practice: actual clock speed can be higher or lower than that. Depending on the type of code that is run, the motherboard, bios settings, cooling...
I have seen a few reports from people with 2nd gen Epyc CPUs that run faster than the advertised all-core turbo frequency. But don't count on that.
It's not that important anyway, because with 32 cores, memory bandwidth starts to become a limiting factor. So higher clock speeds do not translate 1:1 to higher performance.

You can take a look at the pinned thread in this sub-forum. For the OpenFOAM test case used there, 2nd gen beats 1st gen Epyc by about 25%. Maybe 30% with NPS4.
I know the price difference between the CPUs looks like a lot, but in my opinion, it is worth it when you are not on an ultra-tight budget. Look at it this way: The CPUs may cost 100% more, but the total price increase for a whole workstation or cluster node is much less.

canopus · October 14, 2020, 14:06

For the 10G interconnect what type of switch will be better? Managed one or unmanaged?

What other factors one should consider while going for a switch?

I assume that number of ports should be equal to master + compute nodes. Any suggestions for a 8 port budget switch?

September 27, 2020, 09:02	HPC system setup	#1
canopus Member SM Join Date: Dec 2010 Posts: 97 Rep Power: 15	I am trying to work out a HPC system with USD 12,000 with a bias towards the EPYC processors. The configuration I have in mind. Compute node Dual - AMD EPYC 7452 each with 32 Core, 2.35 GHz, 128 MB Cache Memory 256 GB (16GB x 16 Nos.) ECC DDR4 3200MHz 240GB Enterprise SSD Master node Single - AMD EPYC Rome 7252 8-Core 3.1 GHz Memory 128 GB (16GB x 8 Nos.) ECC DDR4 3200MHz 4 X 4TB SATA Enterprise Hard Disk CentOS I intend to start with 1 /2 compute nodes and add on latter as I get budget. Due to budget constraints Gigabit Ethernet instead of Infiniband. I have few questions specific/generic 1. Is ethernet really a bottleneck if I have to run across two compute nodes? 2. In that case is it better to go for a workstation if I don't plan to use more than 64 cores for a single run? 3. Should I have SSD or HDD on compute node? 4. Is single processor on Master node a problem if I just want to launch and manage the compute nodes? 5. Neglecting the price factor is 1 cpu with 64core better than 2 cpus with 32core? Any other comments/suggestions will be highly appreciated.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
which windows operating system for multi-core system?	Chris Lee	Hardware	11	December 14, 2014 07:26
need MS HPC Pack to run in "parallel" mode	Chris Lee	SU2	0	November 24, 2014 20:05
Operating System for Ansys 13 HPC	makkks	Hardware	22	August 7, 2014 06:43
plz rply urgent regrding vof model for my system	garima chaudhary	FLUENT	1	July 20, 2007 08:37
Need ideas-fuel discharge system	Jan	FLUENT	1	October 10, 2006 23:05

October 7, 2020, 14:52		#9
canopus Member SM Join Date: Dec 2010 Posts: 97 Rep Power: 15	Just found that AMD EPYC 7551 is half the price of 7542. Both of them have 32 cores. But AMD Epyc 7542 AMD Epyc 7551 Frequency 2.35GHz 2.00 GHz Turbo (1 Core) 3.40 GHz 3.00 GHz Turbo (All Cores) 3.20 GHz 2.55 GHz Architecture Rome (Zen 2) Zen Memory DDR4-3200 DDR4-2666 Memory channels 8 8 ECC Yes Yes L3 Cache 128.00 MB 64.00 MB PCIe version 4.0 3.0 So the flip side of 7551 is lower CPU speed, lower RAM speed, and half size L3 cache. Q. How much L3 cache will make difference? Q. What is all core turbo speed? Do we get that when all cores are loaded heavily? Finally, considering half the cost but same number of cores is it worth to switch to 7551? Or am I missing something?

October 14, 2020, 14:06		#11
canopus Member SM Join Date: Dec 2010 Posts: 97 Rep Power: 15	For the 10G interconnect what type of switch will be better? Managed one or unmanaged? What other factors one should consider while going for a switch? I assume that number of ports should be equal to master + compute nodes. Any suggestions for a 8 port budget switch?