I have a case of about 1M cell
I have a case of about 1M cells. I run it parallel on 32 partitions (16nodes X 2cores = 32).
I want to simulate a process which takes about 1800s in reality. I run the case on the supercomputer for 12hours. It only simulates about 600s. One thing I noticed is the time information in the log file: at time step n: ExecutionTime = 25185.6 s ClockTime = 43755 s at time step n+1: ExecutionTime = 25194.1 s ClockTime = 43769 s The clocktime is almost twice that of execution time. Does execution time means CPU time and clocktime means CPU time plus communication time between nodes? Does it mean the program spent a lot of time just on waiting for data transfer? |
For time step n+1:
ExecutionT
For time step n+1:
ExecutionTime = 8.5s ClockTime=14s So about 5.5s is spent on communication? |
The 'missing' time is probably
The 'missing' time is probably spent waiting for communication. This is due to imperfect balancing and just purely the communication time and latency. What interconnect do you have?
|
It's a common problem: You nee
It's a common problem: You need a lot of cells in each partition for the cpu:s to spend more time iterating than waiting... From my experience using Fluent on a 16 node (32 cpu) cluster, you should have 100.000 to 200.000 cells in each partition to get decent parallel efficiency. I'm rather surprised it went so "well" for you! (We have a ordinary Gigabit ethernet; With a high-speed interconnect it would be better, but you still loose a lot.)
/Ola |
Hi, Xiaofeng,
What computer
Hi, Xiaofeng,
What computer you were running the parallel case? I recently tested my cluster (2 dual CPU workstations + 2 dual core workstation). When I used all CPUs/cores, that is a totoal of 8, I got 45% - 50% efficiency (executionTime/ClockTime). When I used on 1 CPU (or 1 core) from each workstation, I got 65% - 70% efficiency. However, executionTime in the 4 CPU run is longer than the 8 CPU/core case. So, in real time, the 8 CPU run is still "slightly" faster than the 4 CPU run. I was told that even a 70% efficiency is not good. Each workstation has 1 gigabit NIC connected to a Linksys gigabit switch (SD2008) which support non-blocking/Jumo Frames. I mgiht want to try out GAMMA. But, is there a way to improve efficiency without GAMMA? What is typical parallel efficiency people get? Any suggestion? pei |
Hi,
I was looking at the be
Hi,
I was looking at the benchmark results posted on the OpenFOAM wiki. I noticed that for the interFoam case (case #4) when ran on the Waltons cluster, the 3-CPU run and the 4-CPU run actually were 50% slower than the serial run (1 CPU). Is this real? pei |
Hi Pei!
(about case #4 on the
Hi Pei!
(about case #4 on the Wiki) Yep. I'm afraid so. The case is just too small (18MB according to the table on the top, don't know how many cells right now). If you look on the other small cases on that machine: they don't scale that good either. (partly the network on that machine can be blamed but not totally) |
Hi, Bernard,
How is memory
Hi, Bernard,
How is memory determined? The case I am testing is about 1,158,000 hex cells. I am trying to find out what could be the cause(s) of the low executionTime/ClockTime ratio. I ran a case on a dual core AMD workstation, the ratio between executionTime/ClockTime is about 1. However, the speed up is only about 1.3. This could be due to both cores accessing the same memory bus. I am hoping to improve the efficiency of the executionTime/ClockTime ratio. Any suggestion? Pei |
Hi Pi!
@memory usage: For t
Hi Pi!
@memory usage: For the benchmark cases the memory usage was "measured" by getting the amount of residential memory every 5 seconds from the operating system and reporting the maximum value that occured durnig the benchmark. In general I think the rule of thumb is approx 800bytes/cell (double precision). More if you use additional models Bernhard |
All times are GMT -4. The time now is 05:53. |