Self-built InfiniteBand cluster?

Freewill1 · July 28, 2023, 05:28

Hi all,

We have five workstations at hand, each with
1. 1×32-core AMD EPYC 7532 CPU
2. 8×16GB DDR4 3200 RAM
3. 1×1TB Samsung 980 Pro SSD for storage

Each of them tried hard to make the most use of the 8-ch RAM support of the AMD EPYC CPU by having a relatively low core count and filling eight RAM slots. Not very outdated so far?

Now, we need to setup a small-scale cluster to make parellall solving of our own CFD code based on
1. a finite-volume discratization of the Navier-Sokes equations with a SIMPLE-like algorithm on a structured mesh.
2. a distribution parallel strategy with a message passing interface (MPI) to enable exchange of the halo region of the FV mesh after domain decomposition. Each decomposed domain solves the equations independently with one CPU core or more.

So far, all the programming and testing of the MPI parallelism of the code have been confined to one machine mentioned above to minic a cluster with intra-node data exchange only.

We intend to break through this hardware limitation in the near future and expect a good (near-linear) core number-speedup scaling up to ~100 CPU cores.

To achieve this goal, I know that the inter-node rate of the data exchange would be the bottleneck (similar to that caused by the inter-node RAM bandwidth), so the InfiniBand (IB) network should be as fast as possible (i.e., as low latancy/high bandwidth as possible).

Therefore, a cluster consisting of the machines above (each would serves as a node in the cluster) and a small IB network would be what we want.

The budget shouldn't be too high (<$800 for each node)

A 100 or 200Gbps IB network is perhaps the good choice.

Theoretically, there may be serveral ways of easy setup of the network:
Case-1: a three-node cluster with ring topology,
Case-2: a three-node cluster with a star topology,
Case-3: a five-node cluster with ring topology,
Case-4: a five-node cluster with a star topology.

See the my illustrative image here for clearity:
https://www.cfd-online.com/Forums/at...1&d=1690535873

Note that
Cases 1,3 with a ring topology avoid the use of a expensive 100/200G IB switch,
Case 1 features direct node-to-node connection,
Cases 2,5 with a star topology require the use of the IB switch and can thus be ruled out

I have no idea about how the IB network work and how to build a efficent one with limit budget.

Here are my questions:
1. Can the way of Cases 1 and 3 works without a IB switch, which is too expensive?

2. Can the 100 or 200Gbps IB network, regardless the use of a switch, achieve the goal of ~100-core near-linear scaling (suppose the parallel algrithm of code is efficent enough)?

3. If the inter-node latency is more important than the bandwidth, is the older low-latency 40 or the 56Gbps network competent?

Here are some Ansys demonstrations using the AMD EPYC CPUs and a 100 or 200Gbps IB network, which show good scaling:
https://www.cfd-online.com/Forums/at...1&d=1690535832

https://www.cfd-online.com/Forums/at...1&d=1690535908

Thanks!

-----------------------------------------------
Prices of IB hardware for reference (US$, from eBay)

40G QDR:
Switch: Mellanox IS5023 18-port, US$100
NIC: Mellanox ConnectX-3 dual port, US$25

56G FDR:
Switch: Mellanox SX6036 36-port: US$100
NIC: Mellanox Connect-IB dual port: US$50

100G EDR, PCIe 4.0x16/3.0x16:
Switch: Mellanox SB7800 36-port: US$1,700
NIC: Mellanox ConnectX-5 dual port: US$300

200G HDR, PCIe 4.0x16/3.0x16:
Switch: NVidia Mellanox QM8700, US$5,000 each
NIC: NVidia Mellanox ConnectX-6 dual port, US$600

400G NDR, PCIe 5.0x16/4.0x16:
Switch:
Switch: NVidia Mellanox QM9700, US$19,000
NIC: NVidia Mellanox ConnectX-7 dual port, US$800

*NIC: the Network Interface Card

wkernkamp · July 28, 2023, 13:10

Quote:

Originally Posted by Freewill1

Hi, all,

We have five workstations at hand, each with
1. 1×32-core AMD EPYC 7532 CPU
2. 8×16GB DDR4 3200 RAM
3. 1×1TB Samsung 980 Pro SSD for storage

Here are my questions:
1. Can the way of Cases 1 and 3 works without a IB switch, which is too expensive?

2. Can the 100 or 200Gbps IB network, regardless the use of a switch, achieve the goal of ~100-core near-linear scaling (suppose the parallel algrithm of code is efficent enough)?

3. If the inter-node latency is more important than the bandwidth, is the older low-latency 40 or the 56Gbps network competent?

Thanks!

Q1: Yes an infiniband ring config works. I have done it.
Q2: Yes, the 100 or 200Gbps IB network can achieve linear scaling because your cluster is very small.
Q3: Yes, the older 40 or the 56Gbps network will be fine. In fact, you probably will approach your performance goal with 1Gbps ethernet. This is so, because normally, the network has to share only the boundary vectors between nodes. So the network bandwidth requirement is much less than the local core to dram bandwidth for a solution iteration.

If you misconfigure the cluster and each node has to reread grid info from a single node at every iteration the traffic would be larger. Normally, repeated reads of the same info from a disk or shared volume ends up cached in core memory so it has to be read only once and not every iteration.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fine/Turbo cannot open grid built by IGG, Configuration file problem?	wkjshon	Fidelity CFD	6	March 29, 2016 02:09
Compile Thirdparty-2.3.0	seav	OpenFOAM Installation	9	March 23, 2014 16:08
ParaView 3.8.0 problem on debian	Unseen	OpenFOAM Installation	4	August 16, 2010 10:26
Compilation error with paraview	quartzian	OpenFOAM Installation	0	September 21, 2008 08:32
CFX 5.6 Built	Neser	CFX	2	December 15, 2004 22:21