CFD Online Discussion Forums - Would you suggest Infiniband?

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Would you suggest Infiniband? (https://www.cfd-online.com/Forums/hardware/177656-would-you-suggest-infiniband.html)

Would you suggest Infiniband?

Hello,

We currently have a Cisco cluster which is made up of 4 nodes. Each node has 2x Intel E5-2630v3 CPU which are 8 cores each@2.4GHz. There's 64GB or RAM in each node (8x8GB). The nodes are in a Cisco rack which I believe has a 2x10GB/s connection between the nodes.
One of the nodes is also used as the headnode.

I am looking to upgrade the cluster and have around £20,000 to spend which I'd like to get the best performance for the money.

The cluster is used mostly for a mix of Star-CCM+ and Converge CFD in-cylinder simulations with detailed chemistry solver.

I was thinking was of purchasing 4 more nodes with 2x E5-2630 v4 CPU (10 cores @ 2.4GHz) in each node, giving an additional 80 cores. Along with 8x8GB of RAM per node. I believe that the additional nodes can be installed in the same Cisco rack and use the 2x10GB/s connection.

Do you think that my plan is a sensible way of proceeding? Would we benefit from spending some of the money on Infiniband connection between the nodes?
We are limited to purchasing Cisco hardware.

Thanks for your time.

Hello,

For the price range you have in mind and tech data you have mentioned,

I would recommend to get in touch with Totalsim, www.totalsim.co.uk.They are based n Brackley.

My company have bought a similar cluster configuration from them,which has worked two years without a single glitch.

Best regards,
Borian

Jonny,
You always want to remove bottlenecks where they exist and in my experience, network latency and bandwidth is a common bottleneck.
If you want to test just how much, run a job in STAR-CCM+ . Run that job on 16 cores, specifically filling one node completely. I suggest a single-phase simulation with relatively simple physics as these scale well and if you keep it relatively light - say 2 million cells - you're not pushing the RAM too hard either.

Then rerun that job again on 16 cores, but this time 8 cores on one node, and 8 cores on another. If you can, fully reserve both nodes so no other jobs run on it. (You could submit a job with 32 cores - i.e. both nodes full - but specify a machine file for STAR-CCM+ so it only runs on 8 cores on each node)

Time each run. STAR-CCM+ includes reports now that allow you to report the simulation or iteration elapsed time (but ideally you'd write a simple Java macro that:
- gets system time
- runs simulation
- gets system time
- computes run time from (end-start) time.
This latter method again reduces overhead and prospect of additional bandwidth being used across the network.

Do this test a few times in each configuration.

As the only difference is running on network versus running 'locally'* this should give you an idea of how much of a bottleneck the network is. I'd be surprised if it's none at all. Bear in mind more complex cases with combustion where there will be lots of inter-CPU communication will exacerbate this bottleneck significantly.

In my opinion, if you can get Infiniband, then do so. But as with most things, this is always going to be case/system dependent!

*using the consideration that two CPUs in one node is equivalent to one local run - that's not really the case, but I'm trying to keep this test simple!