CFD Online Discussion Forums - HPC Cluster Configuration Question

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - HPC Cluster Configuration Question (https://www.cfd-online.com/Forums/hardware/206474-hpc-cluster-configuration-question.html)

HPC Cluster Configuration Question

We’re debating the hardware selection of a ~128 core, CFD cluster for ANSYS fluent. Due to Fluent’s licensing scheme we are core count limited so the focus is on per-thread performance. Another quirk of the program is that a single gpu is equivalent to a single cpu core from the licensing perspective, i.e. 120 cpu cores + 8 Telsa V100 would be the same license draw as 128 cpu cores. We engage in CFD problems ranging from combustion modeling (~6 million cell, 20 gas species, ~27 total equations) to some thermal mixing models (~5 million cell, ~9 equations) to isothermal flow (higher cell count but 6 equations). Looking through benchmarks unfortunately they seem heavily focused on comparing a cpu versus cpu, but for us a 32 core cpu is not equivalent to a 16 core cpu because it would take twice as many licenses.

In summary our questions are:

Excluding Xeon platinums that seem outside our price consideration, would a Xeon Gold 6144 (3.5Ghz, 8 cores) be the best per-core chip to get? Looking at availability we may need to consider 6134 (3.2Ghz, 8 cores).
Assuming we get a 6144 or 6134, what would be a good ratio of cpu’s to gpu’s considering the type of CFD jobs we run? It’s very difficult to find benchmarking comparisons of an optimum ratio of those two products. The HPC vendors are suggesting cluster nodes with 2 cpu’s per gpu, so with a 6144 or 6134 that would be 16 cpu cores for 1 gpu. However, they don’t have any real benchmarking to support the ratio. There is costs limitations here too, we couldn't fund 128 gpu's.
Is AVX-512 a significant speed-up, i.e. should we be focused on products with AVX-512? (which a Xeon Gold 6144 or 6134 would have)
Is there any benchmarking comparing K40 to P100 to V100? Even at a 2 cpu to 1 gpu ratio the GPU’s quickly become a dominant cost of the entire cluster and it’s not clear if V100 is worth the premium. Also I read some issues on RAM size, do we need to pick maximum RAM on the cards to prevent that bottlenecking?
Is there any kind of benchmarking on Intel omni-path fabric? The vendors are generally not recommending omni-path fabric for a cluster this small.

Thanks for any advice, once we get this built we’d be happy to post some benchmarking.

I might not be qualified to answer all of your questions, especially about GPUs. But I can share my opinion on some of them. You should definitely get in touch with Ansys and your hardware vendor before buying so you can hold their feet to the fire in case performance is not optimal.

Quote:

Excluding Xeon platinums that seem outside our price consideration, would a Xeon Gold 6144 (3.5Ghz, 8 cores) be the best per-core chip to get? Looking at availability we may need to consider 6134 (3.2Ghz, 8 cores).

Xeon Platinum are only an option if you need more than 4 CPUs in a shared memory system. So Xeon Gold 61xx is the way to go here since you probably want 2 CPUs per node.
Gold 6144 is a compelling option here thanks to high all-core turbo (4.1GHz) and high cache per core.
Gold 6146 might also be an option. It only costs around 300$ more. Since it has 12 cores, you can either use it to scale down the number of nodes or run simulations on 8/12 cores to get slightly better performance per core.
6134 could be used to save some money on CPUs, but given the total cost of the cluster and the licenses this is probably not top priority.

Quote:

Assuming we get a 6144 or 6134, what would be a good ratio of cpu’s to gpu’s considering the type of CFD jobs we run? It’s very difficult to find benchmarking comparisons of an optimum ratio of those two products. The HPC vendors are suggesting cluster nodes with 2 cpu’s per gpu, so with a 6144 or 6134 that would be 16 cpu cores for 1 gpu. However, they don’t have any real benchmarking to support the ratio. There is costs limitations here too, we couldn't fund 128 gpu's.

We had some benchmarks posted lately with up to 4 GPUs in a dual-socket node: https://www.cfd-online.com/Forums/ha...-fluent-3.html
Showing no scaling at all for tiny cases, but considerable scaling with 4 GPUs for larger cases.
What you should definitely check first: do your particular simulations benefit from GPU acceleration at all. Either try it with hardware you have or contact Ansys.

Quote:

Is AVX-512 a significant speed-up, i.e. should we be focused on products with AVX-512? (which a Xeon Gold 6144 or 6134 would have)

AVX512 is not particularly beneficial for these bandwidth-limited calculations. But since all relevant Skylake-SP CPUs have it, you don't really need to worry about it.

Quote:

Is there any benchmarking comparing K40 to P100 to V100? Even at a 2 cpu to 1 gpu ratio the GPU’s quickly become a dominant cost of the entire cluster and it’s not clear if V100 is worth the premium. Also I read some issues on RAM size, do we need to pick maximum RAM on the cards to prevent that bottlenecking?

Concerning the VRAM size: if your models don't fit into VRAM, you can not run the case with GPU acceleration. Not a bottleneck, but simply a no-no.
K40 is seriously outdated, I don't think anyone still sells these in a new cluster.
Other than that, I have not come across any benchmark comparing P100 to V100 for Ansys Fluent GPU acceleration. Sorry.

Quote:

Is there any kind of benchmarking on Intel omni-path fabric? The vendors are generally not recommending omni-path fabric for a cluster this small.

Apart from Intels own benchmarking, Infiniband seems to perform better in all benchmarks I have seen. So I would choose IB for the node interconnect. There is a reason why it is the quasi-standard for HPC clusters.