CFX-5.7.1(Linux) Parallel - 4 CPU Machine

May 25, 2005, 14:33

Please could someone tell me if I'm setting this up right. I'm testing out CFX-5.7.1 on a Sun Fire V40z opteron machine with 4 CPU's (local). The system is running redhat9.

The parallel CFX installation instructions go on about installing the rsh service etc, however, is this necessary if you are using a 4 CPU (local) machine, and not separate machines? I haven't as yet set up the rsh service but I have modified the number of CPU's specified in the host.ccl file and set this to 4.

I've tried a couple of benchmark test cases and I'm finding the speed up is pretty poor using both PVM and MPICH (local) solvers. I'm getting a 2x speed up using 4CPU's. My problem consists of 200,000 hex, i.e. 50,000 hex per partition. I'm going to try a much bigger problem tomorrow since the manual does state that good speed ups are only usually found for hex meshes where each partition is greater than 75,000 nodes.

Have I missed something with regard to the set up of the parallel solve capability on this Linux system?

I have noticed that each PVM or MPICH process is running at 25%. Does this really mean each CPU is really running a 100% i.e. 25% of the total system resource.

I'm a tad confused; the Windows parallel installation seems much more straightforward! Any tips would be really helpful, along with anyone else's observations with regard to parallel speed up on similar machines.

James

May 25, 2005, 18:29

Hi James,

Do you have multiple domains or interfaces? They reduce parallel efficiency quite a bit.

Regards, Glenn Horrocks

May 25, 2005, 18:48

Glenn,

My model contains one domain, with no interfaces and is a single phase incompressable flow. The domain is made up of pure hex cells.

I always thought that local parallel solves were always better than distributed!

James

May 26, 2005, 18:31

Hi,

No, distributed parallel is better for Intel x86 and the AMD machines I have seen so far. These machines share to some extent a single path from the CPU to memory and that creates a bottleneck. This bottleneck is more significant than the inter-process bottleneck due to a network connection for distributed parallel. The local parallel on the AMD opteron is better than x86, but distributed parallel is better still for both chips.

Regards Glenn Horrocks

May 29, 2005, 09:48

Local parallel on a quad Opteron is not a bad call. Each CPU has its own bank of memory, so bandwidth shouldn't be an issue. The Opteron architecture supports Non Uniform Memory Access (NUMA) whereby in principle you can can restrict each CPU to use primarily its own bank of memory, which should remove the memory bottleneck. In practice this is not supported on all operating systems, so there may be some crossing over of memory access, especially if you don't have process affinity set to restrict each process to one CPU. Nevertheless, I have benchmarked a quad barebones Opteron, running a realistic case (1.2 million cells) under Linux (no NUMA, no process affinity set), and got the following speed-up ratios:

1 process 1.00 2 processes 1.99 3 processes 2.89 4 processes 3.49

I suspect that with NUMA and process affinity the speedup would have been a bit better, but I didn't check that. In my experience a quad opteron with about 8 GB of RAM is a really, really useful piece of equipment, provided you can keep it somewhere where noise is not an issue!

May 31, 2005, 21:49

Note in Charles case that the linear speedup starts to fall off after 2 processes (i.e. 3 procs only gives 2.89X and 4 procs only 3.49). Less than 500K nodes per process is sub-optimal, and your performance really drops off. Beyond that limit, the processes spend more overhead communicating than they do solving each piece of the problem (too many cooks in the kitchen).

James, it's not surprising that your smallish problem shows a sub-optimal speed-up. Try a 2 million node problem and you'll get 4X.

As an aside, I know that the Intel machines are coming with something called "Advanced multi-threading". It effectively lets multiple (four or more) threads run on a two CPU box. The default machine state (as delivered) has AMT turned on with four threads, even if the machine only has two CPU's in it! All diagnostic tools from windows or Linux show that the machine has 4 CPU's when it really only has two! This means each CFX process only runs on 1/2 a CPU, even a single serial process. Luckily it can be turned off in the CMOS readily. I don't know if your chip set has this feature, but don't be fooled into thinking you've got 4 CPU' if there are only two in the box.

Jeff

June 14, 2005, 18:03

Hi Jeff,
This "Advanced multi-threading" (AMT) you've spoken about. Does this has anything related (or subsequent) to the "Hyper-Threading" technology (HT)?

I found this discussion quite interesting actually. The place I work has 1 cluster with several nodes. Each node has 2 CPU's. Since the HT is on by default, my linux box (CentOS 3 installation) sees 4 CPU's per node.

I remember one test I made. I have runned a test case with 4 million nodes (tetra/prism), SS, k-eps, very simple set-up. And I solved my test case as follows:

a) 4 CPU's, 2 processes per node - HT on b) 8 CPU's, 4 processes per node - HT on c) 4 CPU's, 2 processes per node - HT off

The machines were the same and I started from the same node all test cases. I could get 10% speed-up of b) over a). No big differences between a) and c).

Anyway, in my test case I did not observed this bottleneck issue in the memory you've said in one of your previous messages. Well, of course that I should test more cases but the way I tested here had convinced me to leave this hardware feature on, which actually is already by default.

The only thing that has worried me a bit was that I should convince my bosses to pay a bit more in parallel feature licenses if they want to get this 10% speed-up in the simulations.

Best wishes, S.W.

May 25, 2005, 14:33	CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#1
James Date Guest Posts: n/a	Please could someone tell me if I'm setting this up right. I'm testing out CFX-5.7.1 on a Sun Fire V40z opteron machine with 4 CPU's (local). The system is running redhat9. The parallel CFX installation instructions go on about installing the rsh service etc, however, is this necessary if you are using a 4 CPU (local) machine, and not separate machines? I haven't as yet set up the rsh service but I have modified the number of CPU's specified in the host.ccl file and set this to 4. I've tried a couple of benchmark test cases and I'm finding the speed up is pretty poor using both PVM and MPICH (local) solvers. I'm getting a 2x speed up using 4CPU's. My problem consists of 200,000 hex, i.e. 50,000 hex per partition. I'm going to try a much bigger problem tomorrow since the manual does state that good speed ups are only usually found for hex meshes where each partition is greater than 75,000 nodes. Have I missed something with regard to the set up of the parallel solve capability on this Linux system? I have noticed that each PVM or MPICH process is running at 25%. Does this really mean each CPU is really running a 100% i.e. 25% of the total system resource. I'm a tad confused; the Windows parallel installation seems much more straightforward! Any tips would be really helpful, along with anyone else's observations with regard to parallel speed up on similar machines. James

May 25, 2005, 18:29	Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#2
Glenn Horrocks Guest Posts: n/a	Hi James, Do you have multiple domains or interfaces? They reduce parallel efficiency quite a bit. Regards, Glenn Horrocks

May 25, 2005, 18:48	Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#3
James Date Guest Posts: n/a	Glenn, My model contains one domain, with no interfaces and is a single phase incompressable flow. The domain is made up of pure hex cells. I always thought that local parallel solves were always better than distributed! James

May 26, 2005, 18:31	Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#4
Glenn Horrocks Guest Posts: n/a	Hi, No, distributed parallel is better for Intel x86 and the AMD machines I have seen so far. These machines share to some extent a single path from the CPU to memory and that creates a bottleneck. This bottleneck is more significant than the inter-process bottleneck due to a network connection for distributed parallel. The local parallel on the AMD opteron is better than x86, but distributed parallel is better still for both chips. Regards Glenn Horrocks

May 29, 2005, 09:48	Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#5
Charles Guest Posts: n/a	Local parallel on a quad Opteron is not a bad call. Each CPU has its own bank of memory, so bandwidth shouldn't be an issue. The Opteron architecture supports Non Uniform Memory Access (NUMA) whereby in principle you can can restrict each CPU to use primarily its own bank of memory, which should remove the memory bottleneck. In practice this is not supported on all operating systems, so there may be some crossing over of memory access, especially if you don't have process affinity set to restrict each process to one CPU. Nevertheless, I have benchmarked a quad barebones Opteron, running a realistic case (1.2 million cells) under Linux (no NUMA, no process affinity set), and got the following speed-up ratios: 1 process 1.00 2 processes 1.99 3 processes 2.89 4 processes 3.49 I suspect that with NUMA and process affinity the speedup would have been a bit better, but I didn't check that. In my experience a quad opteron with about 8 GB of RAM is a really, really useful piece of equipment, provided you can keep it somewhere where noise is not an issue!

May 31, 2005, 21:49	Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#6
Jeff Guest Posts: n/a	Note in Charles case that the linear speedup starts to fall off after 2 processes (i.e. 3 procs only gives 2.89X and 4 procs only 3.49). Less than 500K nodes per process is sub-optimal, and your performance really drops off. Beyond that limit, the processes spend more overhead communicating than they do solving each piece of the problem (too many cooks in the kitchen). James, it's not surprising that your smallish problem shows a sub-optimal speed-up. Try a 2 million node problem and you'll get 4X. As an aside, I know that the Intel machines are coming with something called "Advanced multi-threading". It effectively lets multiple (four or more) threads run on a two CPU box. The default machine state (as delivered) has AMT turned on with four threads, even if the machine only has two CPU's in it! All diagnostic tools from windows or Linux show that the machine has 4 CPU's when it really only has two! This means each CFX process only runs on 1/2 a CPU, even a single serial process. Luckily it can be turned off in the CMOS readily. I don't know if your chip set has this feature, but don't be fooled into thinking you've got 4 CPU' if there are only two in the box. Jeff

June 14, 2005, 18:03	Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine	#7
Stevie Wonder Guest Posts: n/a	Hi Jeff, This "Advanced multi-threading" (AMT) you've spoken about. Does this has anything related (or subsequent) to the "Hyper-Threading" technology (HT)? I found this discussion quite interesting actually. The place I work has 1 cluster with several nodes. Each node has 2 CPU's. Since the HT is on by default, my linux box (CentOS 3 installation) sees 4 CPU's per node. I remember one test I made. I have runned a test case with 4 million nodes (tetra/prism), SS, k-eps, very simple set-up. And I solved my test case as follows: a) 4 CPU's, 2 processes per node - HT on b) 8 CPU's, 4 processes per node - HT on c) 4 CPU's, 2 processes per node - HT off The machines were the same and I started from the same node all test cases. I could get 10% speed-up of b) over a). No big differences between a) and c). Anyway, in my test case I did not observed this bottleneck issue in the memory you've said in one of your previous messages. Well, of course that I should test more cases but the way I tested here had convinced me to leave this hardware feature on, which actually is already by default. The only thing that has worried me a bit was that I should convince my bosses to pay a bit more in parallel feature licenses if they want to get this 10% speed-up in the simulations. Best wishes, S.W.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
RSH problem for parallel running in CFX	Nicola	CFX	5	June 18, 2012 18:31
Distributed Parallel on dual core remote machine	Justin	CFX	1	February 3, 2008 17:23
UDF problem in 2 CPU machine	David	FLUENT	2	April 18, 2004 09:02
CFX, NT parallel, Linux, best platform	Heiko Gerhauser	CFX	1	August 21, 2001 09:46
PC vs. Workstation	Tim Franke	Main CFD Forum	5	September 29, 1999 15:01