CFD Online Discussion Forums

CFD Online Discussion Forums (
-   CFX (
-   -   CFX distributed computing: Cluster design Q (

Joe July 10, 2006 12:18

CFX distributed computing: Cluster design Q
4 single socket systems vs. 2 dual socket systems ...

I'm interested in others experiences with distributed MPICH solves over multiple single socket (i.e. 1 CPU) vs. multiple dual socket (i.e. 2 CPU) systems.

Lets assume the following idealised situation: 4 identical CPUs, say Opteron 285s. Identical amounts of ram per CPU, say 1 GB. Identical motherboards, say one of the Tyan dual socket Opteron motherboards.

"Single socket used" configuration: 4 separate boxes with: 1 CPU per motherboard (odd I know ... humour me), 1 GB RAM per motherboard. Boxes linked via GigE with a switch with decent backplane bandwidth.

"Dual sockets used" configuration: 2 separate boxes with: 2 CPUs per motherboard, 2 GB RAM per motherboard. Boxes linked via GigE with a switch with decent backplane bandwidth.

Lets say I would use the distributed MPICH CFX solver option.

Lets assume I'm modelling a long pipe whose mesh is partitioned into 4 partitions longitudinally.

Q1: Would there be any difference in solving time between these two cluster configurations?

Q2: Is CFX coded such that physically adjacent mesh partitions are placed together on unified memory machines (i.e. 2 CPUs on a 2 socket motherboard) i.e. would the first two mesh partitions be placed together on the first 2 CPU box with the last two mesh partitions placed together on the second 2 CPU box, or are the partitions randomly assigned to CPUs in a distributed cluster?

I would have guessed that the introduction of an Ethernet link anywhere in the system would negate the theroretical advantages of using two dual CPU boxes?

I ask this because typically single socket CPUs and motherboards are a good deal cheaper than dual socket motherboards / CPUs.

Any thoughts / experiences?

Charles July 10, 2006 15:15

Re: CFX distributed computing: Cluster design Q
I get the impression that CFX is not that badly affected by the "slow" ethernet interconnect. The pressure based coupled solver does a lot of work per iteration, so computing time lost while exchanging information between grid partitions does not seem to be too much of an issue. By contrast, Fluent is very sensitive to the latency of the interconnect used, and I must assume that this is due to a much smaller amount of work being done per iteration with the segregated solver, and hence more frequent exchange of information. To put it differently, on a 21 x dual CPU cluster with Gb ethernet interconnect, I've found that it is easy to get say 95% average usage per CPU for CFX, but difficult to get more than 70% usage for Fluent.

Joe July 10, 2006 16:22

Re: CFX distributed computing: Cluster design Q
Fascinating. So on a 42 CPU cluster youre getting virtually linear scale up? Damn thats not to be sneezed at :) Are you using commodity gige switches?

PS: I'm amazed that fluent is still on the SIMPLE solver.

Joe July 10, 2006 16:41

Re: CFX distributed computing: Cluster design Q
Two more Q if I may:

What OS are you using on the cluster? 64bit linux? Which distributed CFX solver type do you use?

Charles July 10, 2006 17:35

Re: CFX distributed computing: Cluster design Q
No, not quite linear scaling, but still pretty useful CPU usage. The point is that the Fluent CPU usage would drop off much quicker, but it is strongly problem size dependent. You can fiddle with clusters forever without doing any real work, or you can just get on with it. So, instead of putting time into doing scaling graphs for various model sizes, I've found it useful to monitor CPU usage, it seems to be a good way of checking that you're not wasting resources.

As I understand it Fluent now have a pressure-based coupled solver which is claimed to need far fewer iterations, so a guess would be that it would scale better with commodity hardware.

Mike July 11, 2006 08:39

Re: CFX distributed computing: Cluster design Q
I agree with Charles that an ethernet connection doesn't really slow things down much. An exception to this is if you start getting up to 4 sockets per machine, then there's just not enough bandwidth for 4 partitions per connection. Two dual socket machines versus four single socket machine should be very close in speed. I expect there's a small theoretical difference somewhere (reduced memory latency, hypertransport,... not sure to be honest), but any difference should be hardly noticable. I belive the partitions are assigned in the order that the hosts are given when starting the run. So if you did: cfx5solve -def file.def -par-dist 'host1*2,host2*2' then partitions 1 and 2 would go to host 1 and partitions 3 and 4 would go to host 2. I don't think there's any guarentee that partition 1 is adjacent to partition 2, but if you look at "Real Partition Number" in CFX-Post for a few cases then I think this does tend to be the case. Mike

Joe July 11, 2006 08:53

Re: CFX distributed computing: Cluster design Q
Thanks for the partition info... very interesting. I'll test that once the cluster is up and running.

What OS are you running your clusters on? Which CFX solver type?

Mike July 11, 2006 10:21

Re: CFX distributed computing: Cluster design Q
Running on Linux RedHat 9 with the 32-bit solver Mike

All times are GMT -4. The time now is 13:45.