CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Help for buid WS or mini server (256 physical core or more)

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By flotus1

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   October 14, 2022, 03:08
Default Help for buid WS or mini server (256 physical core or more)
  #1
Member
 
Nguyen Trong Hiep
Join Date: Aug 2018
Posts: 48
Rep Power: 7
hiep.nguyentrong is on a distinguished road
Hi all,
I want to get a workstation or mini server for run OpenFOAM with 256-512 physical core.
I dont have much experience for run parallel on cluster (i tested it with 1GBps ethernet cable and its 10 time slower)
Therefore, i would like to build a configuration like workstation and run it like single computer. However, the lastest CPU has only 64 cores per socket and 2 sockets maximum per motherboard.
Is there any solution for this problem?

If not, mini server will be fine but i need support for setup server.
My budget it 200k$.

I found a configuration in internet, it have 4 nodes in 1 rack, each node have 2 cpu. Do i need setup cluster and infiniband connection between each node?
Quote:
Gigabyte Servers Gigabyte H262-Z62 - 2U - AMD EPYC - 24x NVMe - 8 x 1GbE LAN ports - 2200W Redundant
Processor 8x AMD EPYC 7763 - 64 Cores 2.45GHz/3.5GHz, 256MB Cache (280Watt)
Memory 64x 32GB 3200MHz DDR4 ECC Registered DIMM Module
M.2 Internal Solid State Drives 4x 2000GB Samsung 970 EVO NVME M.2 SSD
Total Price: £86,728.46
Lease From: £1,834.31 per month over 5 years, or £2,834.29 over 3 years.
hiep.nguyentrong is offline   Reply With Quote

Old   October 14, 2022, 05:12
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,398
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
The only reasonable way to run this many threads is with a cluster. Even if there were CPUs available with 128 or 256 cores, you would not get what you expect. Scaling on such high core counts within a single CPU is pretty flat. I.e. going from 32 cores to 64 cores, you might expect to double the performance. But all you get is maybe a 10-15% increase. Shared CPU resources, especially memory bandwidth, create a bottleneck.
See also: General recommendations for CFD hardware [WIP]

What you need is a cluster made from nodes with 2x32-core CPUs. Maybe 48-core CPUs if you want to increase density at the cost of lower maximum performance. And a shared file system that all nodes can access.
For 512 cores total, that's 8 compute nodes. 10 Gigabit Ethernet CAN work with this relatively low amount of nodes, But I would rather opt for a proper HPC interconnect like Infiniband here.
There is just no way around this that would make sense.
The business you buy these computers from will be happy to assist you in choosing the right adapters and a switch for the node interconnect.
Setting that up to run OpenFOAM in parallel requires some research, but it's not rocket science. There are guides online, and you can certainly ask here if you run into problems. Or you can make some room in the 200k$ budget to hire someone to walk you through the setup.
wkernkamp likes this.
flotus1 is offline   Reply With Quote

Old   October 17, 2022, 03:33
Default
  #3
Member
 
Nguyen Trong Hiep
Join Date: Aug 2018
Posts: 48
Rep Power: 7
hiep.nguyentrong is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
The only reasonable way to run this many threads is with a cluster. Even if there were CPUs available with 128 or 256 cores, you would not get what you expect. Scaling on such high core counts within a single CPU is pretty flat. I.e. going from 32 cores to 64 cores, you might expect to double the performance. But all you get is maybe a 10-15% increase. Shared CPU resources, especially memory bandwidth, create a bottleneck.
See also: General recommendations for CFD hardware [WIP]

What you need is a cluster made from nodes with 2x32-core CPUs. Maybe 48-core CPUs if you want to increase density at the cost of lower maximum performance. And a shared file system that all nodes can access.
For 512 cores total, that's 8 compute nodes. 10 Gigabit Ethernet CAN work with this relatively low amount of nodes, But I would rather opt for a proper HPC interconnect like Infiniband here.
There is just no way around this that would make sense.
The business you buy these computers from will be happy to assist you in choosing the right adapters and a switch for the node interconnect.
Setting that up to run OpenFOAM in parallel requires some research, but it's not rocket science. There are guides online, and you can certainly ask here if you run into problems. Or you can make some room in the 200k$ budget to hire someone to walk you through the setup.
Thank you for reply,

Before hiring an HPC consultant, I would like a preliminary configuration that I will use, mainly in terms of CPU and Memory (I prefer a workstation because it is simple and easy to use).

About memory bandwidth bottleneck, currently, We are using an HPC with E5-2696v4 cpu (with 76.8Gb/s maximum memory bw and 0.7 tflops for fp32 with 20 core). Its mean we will get 110 gb/s bw for 1 tflops.
With epyc 7763 cpu (208Gb/s bw and 5tflops), we have 41.6 Gb/s per tflops.
Is this the cause of it? And what is the optimal number of memory banwidth per tflops for CFD?

I have another question. Will 10Gigabit ethernet (1.25 Gb/s data transfer) is the bottleneck when transfer data between nodes?
hiep.nguyentrong is offline   Reply With Quote

Old   October 17, 2022, 05:25
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,398
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
I prefer a workstation because it is simple and easy to use
I read that between the lines
But again, it is just not feasible at the moment. Even if we had CPUs with this many cores.
The next generation of Epyc CPUs will have effectively twice the memory bandwidth, allowing two times as many cores to be used somewhat efficiently. But that's still nowhere near 512 cores in a shared memory system.

Quote:
I would like a preliminary configuration that I will use, mainly in terms of CPU and Memory
Configuration for each compute node: 2x AMD Epyc 7543, 16x16GB DDR4-3200 reg ECC
These are 32-core CPUs, so 64 cores per node. Which makes 8 of these nodes for 512 cores.

Quote:
With epyc 7763 cpu (208Gb/s bw and 5tflops), we have 41.6 Gb/s per tflops.
Is this the cause of it? And what is the optimal number of memory banwidth per tflops for CFD?
More or less: yes.
Theoretical flops numbers can be tricky. Especially if they include AVX instructions. These can only be leveraged by highly optimized codes, which happen to be a good fit for vectorization. Real-world CFD codes will never get close to those numbers, even if we remove the memory bottleneck.
A much easier rule of thumb: 2-4 cores per memory channel. The Epyc 7543 sits right at the upper limit with 4 cores per memory channel, which is acceptable for open-source codes like OpenFOAM.

Quote:
I have another question. Will 10Gigabit ethernet (1.25 Gb/s data transfer) is the bottleneck when transfer data between nodes?
Like I said, 10Gigabit Ethernet can work for what you have in mind. But I would not recommend it. You have the budget for Infiniband.
What we talked about so far is "intra-node scaling". I.e. how much speedup you get from increasing the number of threads in a single node. And why there is a limit here.
The node interconnect determines "inter-node scaling". I.e. how much speedup you get by going from 1 to 2, 4, 8... compute nodes. Ethernet works ok for a low number of nodes that are relatively slow. The more nodes you connect, and the faster they are, the better the interconnect should be. If it's too slow, the same thing happens as within a node when running out of memory bandwidth: you get less-than-ideal inter node scaling. Imagine you buy 8 of these nodes, but only get maximum performance equivalent to 6 of them, because the node interconnect is too slow. Not good.
With 8 of these fairly high-performance nodes, Infiniband is the way to go.
Node interconnects aren't just about sequential transfer rates. Latency is also important, and this is where Infiniband is way ahead of Ethernet.
flotus1 is offline   Reply With Quote

Old   October 18, 2022, 05:05
Default
  #5
Member
 
Nguyen Trong Hiep
Join Date: Aug 2018
Posts: 48
Rep Power: 7
hiep.nguyentrong is on a distinguished road
Many thank for your information.
It helped me understand a lot.

Recently, new Fluent version can run with native GPU (no data transfer between CPU and GPU) for unstructured grid. Is it better to build a configuration that includes the GPU than just the CPU and will OpenFOAM do the same as Fluent?

I found Rapid CFD run on GPU similar with Fluent but dont have update since 3 years ago (I haven't tried it yet).
hiep.nguyentrong is offline   Reply With Quote

Old   October 18, 2022, 07:10
Default
  #6
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,398
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
You will have to ask about the plans for GPU acceleration in the OpenFOAM part of the forum. I assume you read the section on GPU acceleration in the thread I linked earlier: General recommendations for CFD hardware [WIP]
My opinion on GPU acceleration hasn't changed much. Unless you are absolutely sure that the stuff you need OpenFOAM to do benefits enough from GPU acceleration NOW, don't invest into GPUs.
flotus1 is offline   Reply With Quote

Old   October 18, 2022, 07:43
Default
  #7
Member
 
Nguyen Trong Hiep
Join Date: Aug 2018
Posts: 48
Rep Power: 7
hiep.nguyentrong is on a distinguished road
Its not GPU acceleration. The solver run native in GPU, no data transfer between CPU-GPU (this is bottleneck of GPU acceleration)
That is new release 2022R2 version with limit solver, but i can not find any paper about native solver in GPU (except LBM solver)
hiep.nguyentrong is offline   Reply With Quote

Old   October 18, 2022, 08:02
Default
  #8
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,398
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Terminology isn't the problem here.
It's that codes running on GPUs don't have feature parity with CPU codes. If you have all you need with the current GPU implementation, and it actually runs faster: great.
But that's something you will have to verify yourself.

I get that you want to avoid a cluster, because it all seems complicated.
But If you want to avoid distributed memory by going multi-GPU with OpenFOAM, you are jumping out of the frying pan right into the fire.

Last edited by flotus1; October 19, 2022 at 03:52.
flotus1 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
stop when I run in parallel Nolwenn OpenFOAM 36 March 21, 2021 04:56
mpirun, best parameters pablodecastillo Hardware 18 November 10, 2016 12:36
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 05:36
OpenFOAM 13 Intel quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 13 February 5, 2008 05:26
OpenFOAM 13 AMD quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 1 November 10, 2007 23:23


All times are GMT -4. The time now is 15:57.