CFD Online Discussion Forums - Help for buid WS or mini server (256 physical core or more)

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Help for buid WS or mini server (256 physical core or more) (https://www.cfd-online.com/Forums/hardware/245588-help-buid-ws-mini-server-256-physical-core-more.html)

Help for buid WS or mini server (256 physical core or more)

Hi all,
I want to get a workstation or mini server for run OpenFOAM with 256-512 physical core.
I dont have much experience for run parallel on cluster (i tested it with 1GBps ethernet cable and its 10 time slower)
Therefore, i would like to build a configuration like workstation and run it like single computer. However, the lastest CPU has only 64 cores per socket and 2 sockets maximum per motherboard.
Is there any solution for this problem?

If not, mini server will be fine but i need support for setup server.
My budget it 200k$.

I found a configuration in internet, it have 4 nodes in 1 rack, each node have 2 cpu. Do i need setup cluster and infiniband connection between each node?

Quote:

Gigabyte Servers Gigabyte H262-Z62 - 2U - AMD EPYC - 24x NVMe - 8 x 1GbE LAN ports - 2200W Redundant
Processor 8x AMD EPYC 7763 - 64 Cores 2.45GHz/3.5GHz, 256MB Cache (280Watt)
Memory 64x 32GB 3200MHz DDR4 ECC Registered DIMM Module
M.2 Internal Solid State Drives 4x 2000GB Samsung 970 EVO NVME M.2 SSD
Total Price: £86,728.46
Lease From: £1,834.31 per month over 5 years, or £2,834.29 over 3 years.

The only reasonable way to run this many threads is with a cluster. Even if there were CPUs available with 128 or 256 cores, you would not get what you expect. Scaling on such high core counts within a single CPU is pretty flat. I.e. going from 32 cores to 64 cores, you might expect to double the performance. But all you get is maybe a 10-15% increase. Shared CPU resources, especially memory bandwidth, create a bottleneck.
See also: https://www.cfd-online.com/Forums/ha...dware-wip.html

What you need is a cluster made from nodes with 2x32-core CPUs. Maybe 48-core CPUs if you want to increase density at the cost of lower maximum performance. And a shared file system that all nodes can access.
For 512 cores total, that's 8 compute nodes. 10 Gigabit Ethernet CAN work with this relatively low amount of nodes, But I would rather opt for a proper HPC interconnect like Infiniband here.
There is just no way around this that would make sense.
The business you buy these computers from will be happy to assist you in choosing the right adapters and a switch for the node interconnect.
Setting that up to run OpenFOAM in parallel requires some research, but it's not rocket science. There are guides online, and you can certainly ask here if you run into problems. Or you can make some room in the 200k$ budget to hire someone to walk you through the setup.

Quote:

Originally Posted by flotus1 (Post 837516)

Thank you for reply,

Before hiring an HPC consultant, I would like a preliminary configuration that I will use, mainly in terms of CPU and Memory (I prefer a workstation because it is simple and easy to use).

About memory bandwidth bottleneck, currently, We are using an HPC with E5-2696v4 cpu (with 76.8Gb/s maximum memory bw and 0.7 tflops for fp32 with 20 core). Its mean we will get 110 gb/s bw for 1 tflops.
With epyc 7763 cpu (208Gb/s bw and 5tflops), we have 41.6 Gb/s per tflops.
Is this the cause of it? And what is the optimal number of memory banwidth per tflops for CFD?

I have another question. Will 10Gigabit ethernet (1.25 Gb/s data transfer) is the bottleneck when transfer data between nodes?

Quote:

I prefer a workstation because it is simple and easy to use

I read that between the lines ;)
But again, it is just not feasible at the moment. Even if we had CPUs with this many cores.
The next generation of Epyc CPUs will have effectively twice the memory bandwidth, allowing two times as many cores to be used somewhat efficiently. But that's still nowhere near 512 cores in a shared memory system.

Quote:

I would like a preliminary configuration that I will use, mainly in terms of CPU and Memory

Configuration for each compute node: 2x AMD Epyc 7543, 16x16GB DDR4-3200 reg ECC
These are 32-core CPUs, so 64 cores per node. Which makes 8 of these nodes for 512 cores.

Quote:

With epyc 7763 cpu (208Gb/s bw and 5tflops), we have 41.6 Gb/s per tflops.
Is this the cause of it? And what is the optimal number of memory banwidth per tflops for CFD?

More or less: yes.
Theoretical flops numbers can be tricky. Especially if they include AVX instructions. These can only be leveraged by highly optimized codes, which happen to be a good fit for vectorization. Real-world CFD codes will never get close to those numbers, even if we remove the memory bottleneck.
A much easier rule of thumb: 2-4 cores per memory channel. The Epyc 7543 sits right at the upper limit with 4 cores per memory channel, which is acceptable for open-source codes like OpenFOAM.

Quote:

I have another question. Will 10Gigabit ethernet (1.25 Gb/s data transfer) is the bottleneck when transfer data between nodes?

Like I said, 10Gigabit Ethernet can work for what you have in mind. But I would not recommend it. You have the budget for Infiniband.
What we talked about so far is "intra-node scaling". I.e. how much speedup you get from increasing the number of threads in a single node. And why there is a limit here.
The node interconnect determines "inter-node scaling". I.e. how much speedup you get by going from 1 to 2, 4, 8... compute nodes. Ethernet works ok for a low number of nodes that are relatively slow. The more nodes you connect, and the faster they are, the better the interconnect should be. If it's too slow, the same thing happens as within a node when running out of memory bandwidth: you get less-than-ideal inter node scaling. Imagine you buy 8 of these nodes, but only get maximum performance equivalent to 6 of them, because the node interconnect is too slow. Not good.
With 8 of these fairly high-performance nodes, Infiniband is the way to go.
Node interconnects aren't just about sequential transfer rates. Latency is also important, and this is where Infiniband is way ahead of Ethernet.

Many thank for your information.
It helped me understand a lot.

Recently, new Fluent version can run with native GPU (no data transfer between CPU and GPU) for unstructured grid. Is it better to build a configuration that includes the GPU than just the CPU and will OpenFOAM do the same as Fluent?

I found Rapid CFD run on GPU similar with Fluent but dont have update since 3 years ago (I haven't tried it yet).

You will have to ask about the plans for GPU acceleration in the OpenFOAM part of the forum. I assume you read the section on GPU acceleration in the thread I linked earlier: https://www.cfd-online.com/Forums/ha...dware-wip.html
My opinion on GPU acceleration hasn't changed much. Unless you are absolutely sure that the stuff you need OpenFOAM to do benefits enough from GPU acceleration NOW, don't invest into GPUs.

Its not GPU acceleration. The solver run native in GPU, no data transfer between CPU-GPU (this is bottleneck of GPU acceleration)
That is new release 2022R2 version with limit solver, but i can not find any paper about native solver in GPU (except LBM solver)

Terminology isn't the problem here.
It's that codes running on GPUs don't have feature parity with CPU codes. If you have all you need with the current GPU implementation, and it actually runs faster: great.
But that's something you will have to verify yourself.

I get that you want to avoid a cluster, because it all seems complicated.
But If you want to avoid distributed memory by going multi-GPU with OpenFOAM, you are jumping out of the frying pan right into the fire.