CFD Online Discussion Forums - Parallel performance

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)

- - Parallel performance (https://www.cfd-online.com/Forums/openfoam-solving/59965-parallel-performance.html)

liu	August 29, 2006 22:02

I have a case of about 1M cell

I have a case of about 1M cells. I run it parallel on 32 partitions (16nodes X 2cores = 32).
I want to simulate a process which takes about 1800s in reality.
I run the case on the supercomputer for 12hours. It only simulates about 600s.

One thing I noticed is the time information in the log file:
at time step n:
ExecutionTime = 25185.6 s ClockTime = 43755 s
at time step n+1:
ExecutionTime = 25194.1 s ClockTime = 43769 s

The clocktime is almost twice that of execution time. Does execution time means CPU time and clocktime means CPU time plus communication time between nodes? Does it mean the program spent a lot of time just on waiting for data transfer?

liu	August 29, 2006 22:07

For time step n+1: ExecutionT

For time step n+1:
ExecutionTime = 8.5s ClockTime=14s
So about 5.5s is spent on communication?

mattijs

August 30, 2006 03:16

The 'missing' time is probably

The 'missing' time is probably spent waiting for communication. This is due to imperfect balancing and just purely the communication time and latency. What interconnect do you have?

olwi	August 30, 2006 07:00

It's a common problem: You nee

It's a common problem: You need a lot of cells in each partition for the cpu:s to spend more time iterating than waiting... From my experience using Fluent on a 16 node (32 cpu) cluster, you should have 100.000 to 200.000 cells in each partition to get decent parallel efficiency. I'm rather surprised it went so "well" for you! (We have a ordinary Gigabit ethernet; With a high-speed interconnect it would be better, but you still loose a lot.)

/Ola

hsieh

October 13, 2006 14:34

Hi, Xiaofeng, What computer

Hi, Xiaofeng,

What computer you were running the parallel case?
I recently tested my cluster (2 dual CPU workstations + 2 dual core workstation). When I used all CPUs/cores, that is a totoal of 8, I got 45% - 50% efficiency (executionTime/ClockTime). When I used on 1 CPU (or 1 core) from each workstation, I got 65% - 70% efficiency. However, executionTime in the 4 CPU run is longer than the 8 CPU/core case. So, in real time, the 8 CPU run is still "slightly" faster than the 4 CPU run.

I was told that even a 70% efficiency is not good. Each workstation has 1 gigabit NIC connected to a Linksys gigabit switch (SD2008) which support non-blocking/Jumo Frames. I mgiht want to try out GAMMA. But, is there a way to improve efficiency without GAMMA? What is typical parallel efficiency people get? Any suggestion?

pei

hsieh

October 14, 2006 07:07

Hi, I was looking at the be

Hi,

I was looking at the benchmark results posted on the OpenFOAM wiki. I noticed that for the interFoam case (case #4) when ran on the Waltons cluster, the 3-CPU run and the 4-CPU run actually were 50% slower than the serial run (1 CPU). Is this real?

pei

gschaider

October 15, 2006 16:06

Hi Pei! (about case #4 on the

Hi Pei!
(about case #4 on the Wiki)
Yep. I'm afraid so. The case is just too small (18MB according to the table on the top, don't know how many cells right now). If you look on the other small cases on that machine: they don't scale that good either. (partly the network on that machine can be blamed but not totally)

hsieh

October 16, 2006 15:36

Hi, Bernard, How is memory

Hi, Bernard,

How is memory determined? The case I am testing is about 1,158,000 hex cells. I am trying to find out what could be the cause(s) of the low executionTime/ClockTime ratio.

I ran a case on a dual core AMD workstation, the ratio between executionTime/ClockTime is about 1. However, the speed up is only about 1.3. This could be due to both cores accessing the same memory bus. I am hoping to improve the efficiency of the executionTime/ClockTime ratio. Any suggestion?

Pei

gschaider

October 17, 2006 10:04

Hi Pi! @memory usage: For t

Hi Pi!

@memory usage: For the benchmark cases the memory usage was "measured" by getting the amount of residential memory every 5 seconds from the operating system and reporting the maximum value that occured durnig the benchmark.

In general I think the rule of thumb is approx 800bytes/cell (double precision). More if you use additional models

Bernhard

All times are GMT -4. The time now is 05:53.