CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > CFX

CFX-5.7.1(Linux) Parallel - 4 CPU Machine

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   May 25, 2005, 14:33
Default CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #1
James Date
Guest
 
Posts: n/a
Please could someone tell me if I'm setting this up right. I'm testing out CFX-5.7.1 on a Sun Fire V40z opteron machine with 4 CPU's (local). The system is running redhat9.

The parallel CFX installation instructions go on about installing the rsh service etc, however, is this necessary if you are using a 4 CPU (local) machine, and not separate machines? I haven't as yet set up the rsh service but I have modified the number of CPU's specified in the host.ccl file and set this to 4.

I've tried a couple of benchmark test cases and I'm finding the speed up is pretty poor using both PVM and MPICH (local) solvers. I'm getting a 2x speed up using 4CPU's. My problem consists of 200,000 hex, i.e. 50,000 hex per partition. I'm going to try a much bigger problem tomorrow since the manual does state that good speed ups are only usually found for hex meshes where each partition is greater than 75,000 nodes.

Have I missed something with regard to the set up of the parallel solve capability on this Linux system?

I have noticed that each PVM or MPICH process is running at 25%. Does this really mean each CPU is really running a 100% i.e. 25% of the total system resource.

I'm a tad confused; the Windows parallel installation seems much more straightforward! Any tips would be really helpful, along with anyone else's observations with regard to parallel speed up on similar machines.

James
  Reply With Quote

Old   May 25, 2005, 18:29
Default Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #2
Glenn Horrocks
Guest
 
Posts: n/a
Hi James,

Do you have multiple domains or interfaces? They reduce parallel efficiency quite a bit.

Regards, Glenn Horrocks
  Reply With Quote

Old   May 25, 2005, 18:48
Default Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #3
James Date
Guest
 
Posts: n/a
Glenn,

My model contains one domain, with no interfaces and is a single phase incompressable flow. The domain is made up of pure hex cells.

I always thought that local parallel solves were always better than distributed!

James
  Reply With Quote

Old   May 26, 2005, 18:31
Default Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #4
Glenn Horrocks
Guest
 
Posts: n/a
Hi,

No, distributed parallel is better for Intel x86 and the AMD machines I have seen so far. These machines share to some extent a single path from the CPU to memory and that creates a bottleneck. This bottleneck is more significant than the inter-process bottleneck due to a network connection for distributed parallel. The local parallel on the AMD opteron is better than x86, but distributed parallel is better still for both chips.

Regards Glenn Horrocks
  Reply With Quote

Old   May 29, 2005, 09:48
Default Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #5
Charles
Guest
 
Posts: n/a
Local parallel on a quad Opteron is not a bad call. Each CPU has its own bank of memory, so bandwidth shouldn't be an issue. The Opteron architecture supports Non Uniform Memory Access (NUMA) whereby in principle you can can restrict each CPU to use primarily its own bank of memory, which should remove the memory bottleneck. In practice this is not supported on all operating systems, so there may be some crossing over of memory access, especially if you don't have process affinity set to restrict each process to one CPU. Nevertheless, I have benchmarked a quad barebones Opteron, running a realistic case (1.2 million cells) under Linux (no NUMA, no process affinity set), and got the following speed-up ratios:

1 process 1.00 2 processes 1.99 3 processes 2.89 4 processes 3.49

I suspect that with NUMA and process affinity the speedup would have been a bit better, but I didn't check that. In my experience a quad opteron with about 8 GB of RAM is a really, really useful piece of equipment, provided you can keep it somewhere where noise is not an issue!
  Reply With Quote

Old   May 31, 2005, 21:49
Default Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #6
Jeff
Guest
 
Posts: n/a
Note in Charles case that the linear speedup starts to fall off after 2 processes (i.e. 3 procs only gives 2.89X and 4 procs only 3.49). Less than 500K nodes per process is sub-optimal, and your performance really drops off. Beyond that limit, the processes spend more overhead communicating than they do solving each piece of the problem (too many cooks in the kitchen).

James, it's not surprising that your smallish problem shows a sub-optimal speed-up. Try a 2 million node problem and you'll get 4X.

As an aside, I know that the Intel machines are coming with something called "Advanced multi-threading". It effectively lets multiple (four or more) threads run on a two CPU box. The default machine state (as delivered) has AMT turned on with four threads, even if the machine only has two CPU's in it! All diagnostic tools from windows or Linux show that the machine has 4 CPU's when it really only has two! This means each CFX process only runs on 1/2 a CPU, even a single serial process. Luckily it can be turned off in the CMOS readily. I don't know if your chip set has this feature, but don't be fooled into thinking you've got 4 CPU' if there are only two in the box.

Jeff
  Reply With Quote

Old   June 14, 2005, 18:03
Default Re: CFX-5.7.1(Linux) Parallel - 4 CPU Machine
  #7
Stevie Wonder
Guest
 
Posts: n/a
Hi Jeff,
This "Advanced multi-threading" (AMT) you've spoken about. Does this has anything related (or subsequent) to the "Hyper-Threading" technology (HT)?


I found this discussion quite interesting actually. The place I work has 1 cluster with several nodes. Each node has 2 CPU's. Since the HT is on by default, my linux box (CentOS 3 installation) sees 4 CPU's per node.

I remember one test I made. I have runned a test case with 4 million nodes (tetra/prism), SS, k-eps, very simple set-up. And I solved my test case as follows:


a) 4 CPU's, 2 processes per node - HT on b) 8 CPU's, 4 processes per node - HT on c) 4 CPU's, 2 processes per node - HT off

The machines were the same and I started from the same node all test cases. I could get 10% speed-up of b) over a). No big differences between a) and c).

Anyway, in my test case I did not observed this bottleneck issue in the memory you've said in one of your previous messages. Well, of course that I should test more cases but the way I tested here had convinced me to leave this hardware feature on, which actually is already by default.

The only thing that has worried me a bit was that I should convince my bosses to pay a bit more in parallel feature licenses if they want to get this 10% speed-up in the simulations.

Best wishes, S.W.
  Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
RSH problem for parallel running in CFX Nicola CFX 5 June 18, 2012 18:31
Distributed Parallel on dual core remote machine Justin CFX 1 February 3, 2008 18:23
UDF problem in 2 CPU machine David FLUENT 2 April 18, 2004 09:02
CFX, NT parallel, Linux, best platform Heiko Gerhauser CFX 1 August 21, 2001 09:46
PC vs. Workstation Tim Franke Main CFD Forum 5 September 29, 1999 15:01


All times are GMT -4. The time now is 06:35.