CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Parallel performance

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
Old   August 29, 2006, 22:02
Default I have a case of about 1M cell
  #1
liu
Senior Member
 
Xiaofeng Liu
Join Date: Mar 2009
Location: State College, PA, USA
Posts: 118
Rep Power: 8
liu is on a distinguished road
I have a case of about 1M cells. I run it parallel on 32 partitions (16nodes X 2cores = 32).
I want to simulate a process which takes about 1800s in reality.
I run the case on the supercomputer for 12hours. It only simulates about 600s.

One thing I noticed is the time information in the log file:
at time step n:
ExecutionTime = 25185.6 s ClockTime = 43755 s
at time step n+1:
ExecutionTime = 25194.1 s ClockTime = 43769 s

The clocktime is almost twice that of execution time. Does execution time means CPU time and clocktime means CPU time plus communication time between nodes? Does it mean the program spent a lot of time just on waiting for data transfer?
__________________
Xiaofeng Liu, Ph.D., P.E.,
Assistant Professor
Department of Civil and Environmental Engineering
Penn State University
223B Sackett Building
University Park, PA 16802


Web: http://water.engr.psu.edu/liu/
liu is offline   Reply With Quote

Old   August 29, 2006, 22:07
Default For time step n+1: ExecutionT
  #2
liu
Senior Member
 
Xiaofeng Liu
Join Date: Mar 2009
Location: State College, PA, USA
Posts: 118
Rep Power: 8
liu is on a distinguished road
For time step n+1:
ExecutionTime = 8.5s ClockTime=14s
So about 5.5s is spent on communication?
__________________
Xiaofeng Liu, Ph.D., P.E.,
Assistant Professor
Department of Civil and Environmental Engineering
Penn State University
223B Sackett Building
University Park, PA 16802


Web: http://water.engr.psu.edu/liu/
liu is offline   Reply With Quote

Old   August 30, 2006, 03:16
Default The 'missing' time is probably
  #3
Super Moderator
 
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,416
Rep Power: 16
mattijs is on a distinguished road
The 'missing' time is probably spent waiting for communication. This is due to imperfect balancing and just purely the communication time and latency. What interconnect do you have?
mattijs is offline   Reply With Quote

Old   August 30, 2006, 07:00
Default It's a common problem: You nee
  #4
Member
 
Ola Widlund
Join Date: Mar 2009
Location: Sweden
Posts: 87
Rep Power: 8
olwi is on a distinguished road
It's a common problem: You need a lot of cells in each partition for the cpu:s to spend more time iterating than waiting... From my experience using Fluent on a 16 node (32 cpu) cluster, you should have 100.000 to 200.000 cells in each partition to get decent parallel efficiency. I'm rather surprised it went so "well" for you! (We have a ordinary Gigabit ethernet; With a high-speed interconnect it would be better, but you still loose a lot.)

/Ola
olwi is offline   Reply With Quote

Old   October 13, 2006, 14:34
Default Hi, Xiaofeng, What computer
  #5
Senior Member
 
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 9
hsieh is on a distinguished road
Hi, Xiaofeng,

What computer you were running the parallel case?
I recently tested my cluster (2 dual CPU workstations + 2 dual core workstation). When I used all CPUs/cores, that is a totoal of 8, I got 45% - 50% efficiency (executionTime/ClockTime). When I used on 1 CPU (or 1 core) from each workstation, I got 65% - 70% efficiency. However, executionTime in the 4 CPU run is longer than the 8 CPU/core case. So, in real time, the 8 CPU run is still "slightly" faster than the 4 CPU run.

I was told that even a 70% efficiency is not good. Each workstation has 1 gigabit NIC connected to a Linksys gigabit switch (SD2008) which support non-blocking/Jumo Frames. I mgiht want to try out GAMMA. But, is there a way to improve efficiency without GAMMA? What is typical parallel efficiency people get? Any suggestion?

pei
hsieh is offline   Reply With Quote

Old   October 14, 2006, 07:07
Default Hi, I was looking at the be
  #6
Senior Member
 
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 9
hsieh is on a distinguished road
Hi,

I was looking at the benchmark results posted on the OpenFOAM wiki. I noticed that for the interFoam case (case #4) when ran on the Waltons cluster, the 3-CPU run and the 4-CPU run actually were 50% slower than the serial run (1 CPU). Is this real?

pei
hsieh is offline   Reply With Quote

Old   October 15, 2006, 16:06
Default Hi Pei! (about case #4 on the
  #7
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,915
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Hi Pei!
(about case #4 on the Wiki)
Yep. I'm afraid so. The case is just too small (18MB according to the table on the top, don't know how many cells right now). If you look on the other small cases on that machine: they don't scale that good either. (partly the network on that machine can be blamed but not totally)
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Old   October 16, 2006, 15:36
Default Hi, Bernard, How is memory
  #8
Senior Member
 
Pei-Ying Hsieh
Join Date: Mar 2009
Posts: 317
Rep Power: 9
hsieh is on a distinguished road
Hi, Bernard,

How is memory determined? The case I am testing is about 1,158,000 hex cells. I am trying to find out what could be the cause(s) of the low executionTime/ClockTime ratio.

I ran a case on a dual core AMD workstation, the ratio between executionTime/ClockTime is about 1. However, the speed up is only about 1.3. This could be due to both cores accessing the same memory bus. I am hoping to improve the efficiency of the executionTime/ClockTime ratio. Any suggestion?

Pei
hsieh is offline   Reply With Quote

Old   October 17, 2006, 10:04
Default Hi Pi! @memory usage: For t
  #9
Assistant Moderator
 
Bernhard Gschaider
Join Date: Mar 2009
Posts: 3,915
Rep Power: 40
gschaider will become famous soon enoughgschaider will become famous soon enough
Hi Pi!

@memory usage: For the benchmark cases the memory usage was "measured" by getting the amount of residential memory every 5 seconds from the operating system and reporting the maximum value that occured durnig the benchmark.

In general I think the rule of thumb is approx 800bytes/cell (double precision). More if you use additional models


Bernhard
__________________
Note: I don't use "Friend"-feature on this forum out of principle. Ah. And by the way: I'm not on Facebook either. So don't be offended if I don't accept your invitation/friend request
gschaider is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel performance ivandipia CFX 6 January 29, 2009 16:26
Performance of interFoam running in parallel hsieh OpenFOAM Running, Solving & CFD 8 September 14, 2006 09:15
ANSYS CFX 10.0 Parallel Performance for Windows XP Saturn CFX 4 August 13, 2006 12:27
Parallel Performance of Fluent Soheyl FLUENT 2 October 30, 2005 07:11
Parallel performance hsing OpenFOAM Running, Solving & CFD 16 August 30, 2005 14:38


All times are GMT -4. The time now is 06:33.