CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

48 Core Cluster - GigE Network

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 8, 2011, 12:22
Question 48 Core Cluster - GigE Network
  #1
Member
 
Ronald A. Lau
Join Date: Jul 2009
Location: Chicago
Posts: 30
Rep Power: 16
ronaldalau is on a distinguished road
Send a message via Skype™ to ronaldalau
I have a 48 core cluster made up of 4 servers, each with dual 6 core CPUs (Intel) on a GigE network. OS is Windows HPC 2008 R2, CFD software is Fluent v13.

When I use 24 cores on a parallel job, everything is great. CPU usage and network usage is very high. 100 iterations in 20 minutes.

When I use 36 cores, both CPU and network usage drop to near nothing, and its 6 hours for 100 iterations.

We have fixed all configuration issues, and each server is now identical in drivers and config.

Every benchmark I find published on the web for GigE stops at 24 cores. Is GigE just not capable of handling mpi between more than 24 cores?
ronaldalau is offline   Reply With Quote

Old   September 9, 2011, 14:24
Default
  #2
Senior Member
 
Robert
Join Date: Jun 2010
Posts: 117
Rep Power: 16
RobertB is on a distinguished road
We used to have a 64 core (16*4) system with Gig-E and it worked OK when running all cores if not exactly linearly (using STAR-CD at the time) I forget the exact numbers but say 75% parallel efficiency.

We now have a bigger cluster with infiband and that does scale better.

Does it matter how you distribute the 36 cores among the 48 available? It seems strange that the CPUs and network go to zero, could you have some hardware or cabling issues? Does it matter which 24 cores you pick or the machines they are on?

Do you run bonded Gig-E which would double your nominal throughput?

Do you get a choice as to which MPI you run? On the STAR series of codes the hpmpi seems to work best and is most controllable.

Are you running hyperthreading?

Some thoughts, I too hate these types of problem.
RobertB is offline   Reply With Quote

Old   September 9, 2011, 14:50
Default
  #3
Member
 
Ronald A. Lau
Join Date: Jul 2009
Location: Chicago
Posts: 30
Rep Power: 16
ronaldalau is on a distinguished road
Send a message via Skype™ to ronaldalau
I did a series of tests previously that did what you describe, using different servers, checking server config, etc. We have 4 identical servers, each configured identically from network mappings to hardware drivers. There was one server that had to un-set hyperthreading, but that was corrected before I ran the tests.

Each server is dual socket, with 6 core Intel Xeons in each socket. Any combination of 24 cores is ok, but any combination of 32 cores is really bad.

The MPI on Windows HPC Server 2008 R2 is "msmpi", so I don't doubt that it could be the issue.

The network is capable of passing enough data. When there are 24 cores, the network is passing ~5Mb/sec from each server, but when its 32 cores, it drops to <500kb/sec. From what I have read, it has to do with latency, not bandwidth. GigE latency is on the order of 50 microseconds, and IfiniBand on the order of 1 microsecond.

I guess I'll just have to wait for Ansys Tech Support to let me know what performance they get.
ronaldalau is offline   Reply With Quote

Reply

Tags
cluster, fluent, hpc

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
solving a conduction problem in FLUENT using UDF Avin2407 Fluent UDF and Scheme Programming 1 March 13, 2015 03:02
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
4 core & 8 core: same 64bit XP, Xeon, x5570, 2.93GHZ leeruoyun CFX 4 August 19, 2009 01:47
Serial Job Jumping from Core to Core Will FLUENT 2 August 25, 2008 15:21
Distributed Parallel on dual core remote machine Justin CFX 1 February 3, 2008 18:23


All times are GMT -4. The time now is 16:46.