CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Large test case for running OpenFoam in parallel (https://www.cfd-online.com/Forums/openfoam-solving/59488-large-test-case-running-openfoam-parallel.html)

fhy August 16, 2007 14:44

Hi, I am testing parallel
 
Hi,

I am testing parallel feature of OpenFOAM. The test case I used is icoFoam/cavity. However, I did not observe any speedup. The execution times are

sequential: 0.27s
4 cpus : 0.63s
8 cpus : 0.7 s

It might be because it is a small case.

I am wondering whether there are any large cases which I can try. I see there are quite some cases in the tutorial directory. Can someone suggest one which is large enough to test the parallel feature?

Or is there any publicly available OpenFoam test cases for benchmark purpose?

Thanks,
Huiyu

msrinath80 August 16, 2007 16:03

Just change the node density i
 
Just change the node density in constant/polyMesh/blockMeshDict. Increase the number of control volumes!

jens_klostermann August 17, 2007 03:18

Hi We did some Benchmarks
 
Hi

We did some Benchmarks based on the pyFoamBech cases from the wiki (Thanks to Bernhard) see http://openfoamwiki.net/index.php/Be...ks_standard_v1

With OF-1.3 depending on the case, we got speedups up to 128 cores.

Jens

fhy August 17, 2007 14:29

Hi Srinath and Jens, Thanks
 
Hi Srinath and Jens,

Thanks for your reply and suggestions!

I increased the node density from (20,20,1) to (100, 100,1) and decrease the deltaT to 0.001 to satisfy CFL condition (of the icoFoam/cavity case). With longer runtime, I did observe speedup. However, parallel efficiency dropped to < 50% with 8+ processors.

I checked the benchmark wiki page. Most parallel results only have maximal 4 cpu result.

Jens, can you point me some cases that scales with large number of processors? I am looking for some test cases that can scales to 64 cpus.

I understand that scaling is depending on application as well as system. That's why I am looking for some cases which already show good scaling in other systems. I want to make sure the bad scaling in my local system is not due to application.

Thanks,
Huiyu

paka August 17, 2007 14:50

In my case it is a rasInterFoa
 
In my case it is a rasInterFoam solver. The cluster which I'm using allows me to use 8 nodes each having 8 processors.

Frankly speaking the more processors I use, the much better efficiency I obtain. For example using my Mac G5, single processor, computation takes about couple days. With a cluster, 4 nodes give 1.5h, and 8 give something a bit more than 10 minutes.

Try that case.

Krystian

connclark August 17, 2007 17:13

When I was in school a grad st
 
When I was in school a grad student was doing cluster performance experiments on VHDL language chip simulations. He found that going from 1 to 2 to 4 nodes that total compute time was reduced. He attributed it to more cache hits due to smaller data sets.

fhy August 20, 2007 18:05

Hi Krystian, What test case
 
Hi Krystian,

What test case did you run with rasInterFoam solver? Is it the default damBreak case in the tutorials? Did you change anything in the controlDict and blockMeshDict?

Did you use the damBreak case for interFoam in standardBench_v1.cfg ?

Thanks,
Huiyu

fhy August 20, 2007 18:22

Hi Jens and Bernhard, I dow
 
Hi Jens and Bernhard,

I downloaded PyFoam-0.4.0, but did not find benchFoam.py under examples/. Where can I find it? Thanks.

I did find standardBench_v1.cfg under examples/data.
Based on it, I modified the interForm/damBreak. However the sequential run only takes about 180s to finish, which is significantly less than the baseline (1605.82s) in standardBench_v1.cfg. Something must be wrong here. I just want to make sure I am running the correct benchmark.

The following is what I did for modification based on the configure file, please let me know which steps are wrong.

step 1: Modify blockMeshDict, blocks section

blocks
(
hex (0 1 5 4 12 13 17 16) (46 16 1) simpleGrading (1 1 1)
hex (2 3 7 6 14 15 19 18) (38 16 1) simpleGrading (1 1 1)
hex (4 5 9 8 16 17 21 20) (46 84 1) simpleGrading (1 1 1)
hex (5 6 10 9 17 18 22 21) (8 84 1) simpleGrading (1 1 1)
hex (6 7 11 10 18 19 23 22) (38 84 1) simpleGrading (1 1 1)
);

step 2. modify controlDict

endTime 0.5;

deltaT 0.0005;

writeControl adjustableRunTime;

writeInterval 0.1;

step 3: Generate Mesh
blockMesh . damBreak

step4: reset gammar
setFields . damBreak

step5. run it
interFoam . damBreak

I am running with OpenFoam-1.4 on AMD Opteron(tm) Processor 285, 2.6Ghz, 8GB RAM, SLES10.

Thanks,
Huiyu

jens_klostermann August 21, 2007 00:59

Hi Huiyu, 1. benchFoam.py i
 
Hi Huiyu,

1. benchFoam.py is now pyFoamBench.py
2. suggestion for cases: oodles pitzDaily and interFoam dambreak should have a good efficiency

Jens

fhy August 21, 2007 14:54

Hi Jens, Thanks for your su
 
Hi Jens,

Thanks for your suggestion.

In the benchmark wiki page, I saw you submitted result on Opti250 with OpenFoam 1.2 standard. I am wondering whether you have tried the benchmark with OpenFoam 1.4.

My problem is that I tried OpenFoam1.4 compiled with gcc 4.1.0 on a SLES 10, AMD Opteron(tm) Processor 285, 2.6Ghz, 8GB RAM machine. The sequential version interFoam/damBreak (as in the benchmark v1 configuration) finishes around 180s.
Your submission is 588.91s on a Opteron 250 2.4Ghz 4GB Ram system. The big runtime difference can not be explained by the difference in the system.

So I am wondering whether different OpenFoam versions contribute to the difference. However, it is still hard to believe it accounts for all the rest difference. Is there a way to verify the result of the benchmark?

Huiyu

gschaider August 21, 2007 16:57

Hi Huiyu! As far as I know
 
Hi Huiyu!

As far as I know interFoam was rewritten in a major way from 1.3 to 1.4 (completely new algorithm). Propably this is the cause for the big difference.

Bernhard

fhy August 21, 2007 18:00

Hi Bernhard, Thanks for the
 
Hi Bernhard,

Thanks for the info!

I wonder how many solvers got rewritten from 1.2 to 1.4.

I just run PitzDaily with oodles using benchmark_v1 configuration. And got 151.9s (Wall clock), while Jens's submission on Opti252 Dual-Opteron with OF1.2 is 232.47s.

Has anyone run the benchmark suite using OF1.4?

The reason I am so concerned about the runtime is if it is too short, it won't be a good case for parallel runs. Although I can modify the case to make it run longer, there won't be any data to compare from the benchmark wiki page.

Huiyu

jens_klostermann August 22, 2007 02:42

Hi Huiyu, just started the
 
Hi Huiyu,

just started the benchmark_v1 again for sequential version interFoam/damBreak (as in the benchmark v1 configuration) I got 168.6 s at the same machine I mentioned on the wiki. So this is quite a speedup!!

I will publish some more results later this week. In the wiki.

If there is some interest in the community for benchmarking and I am willing to share my experience!
Maybe we should colaborate and form some kind of benchmark group? I think the pyFoamBench is a good starting point.

Jens

fhy August 22, 2007 13:48

Hi Jens, Thanks a lot for t
 
Hi Jens,

Thanks a lot for testing!

I think it is a great idea to form a benchmark group! I am very interested in benchmarking the parallel features of OpenFoam. Reading through some posts in this forum, I did find there are people interested in benchmarking for difference reasons, such as InfiniBand vs GigE comparision, procurement reference and so on.

I do appreciate the benchmark wiki page and PyFoamBench, and will update benchmarking results when I finish some tests.

The current benchmark suite is a good start. It may be too small for parallel runs with the significant speedup due to version updates.

Huiyu

lakeat August 27, 2009 03:18

Hi, I hope I am not too late to join the discussion.
It seems no followup of this concern after OpenFOAM-1.4.
Why?

lakeat August 27, 2009 04:45

My personal Computer:
Case 9 (Please correct me if I was wrong),
http://openfoamwiki.net/index.php/Be...ks_standard_v1
which means, I have modified the case according to standardBench_v2.cfg file.

Version: OpenFOAM-1.6
Case: tutorials/incompressible/pisoFoam/pitzDaily
Application: pisoFoam
MPI: openmpi
I use precompiled one from the official release.

SUSE LINUX
Release 11.2
Gnome 2.26.2

Memory 3.9G
Processor 0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Processor 1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Processor 2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Processor 3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

That is 4 cores.

Code:

NP  Time(s)  Speedup      Speedup|baseline(880s)
1    105        1                8.381
2    55          1.909          16
4    34          3.088          25.882


Any ideas?

madad2005 August 27, 2009 04:57

That seems normal to me.

You'll never get 4 times speedup by using 4 cores, especially with quad-cores. All four cores are on one processor (or two dual-cores stuck together like the Kentsfield quad-cores are) and that single processor has access to its RAM via a single Front-Side Bus. The FSB allows the passage of information between the cpu cache and the RAM itself.

Now, if four separate cores are trying to access the RAM storage you are going to be limited on your FSB speed (not sure what the baseline value is). If you had a dual cpu motherboard with two dual cores and the same core-to-RAM ratio, then you would see improved performance as the FSB limit has been improved upon.

I have the same cpu at home and one way to improve these numbers (without spending any money) is to perform overclocking, but this can be hazardous if not done properly. Nevertheless, you would then be able to increase FSB speeds (but keeping the same cpu speed) and you should see an increase in performance.

lakeat August 27, 2009 06:09

No, that's my only cpu, you are not gonna to persuade me overclocking, I will be burnt out if it's burnt out.

Quote:

You'll never get 4 times speedup by using 4 cores,
Google the following paper and you'll see the superlinear behavier.
And I want that!

Super-linear speed-up of a parallel multigrid. Navier-Stokes solver on Flosolver

lakeat August 30, 2009 22:06

Anyone experience very very good scalability using OpenFOAM, like super linear speedup?

What is the best report of scalability using OpenFOAM so far?

According to Amdahl's Law, the maximum speedup will be restricted by the fraction of the codes that can not be parallelized, so I would eagerly want to know what is this fraction of OpenFOAM? Thank you!

carsten September 22, 2009 04:13

Hi lakeat!

Speedup is not very easy to define. It is not only a property of the programm code (as you assume) but also a property of the machine you're using and the testcase.

Amdahl's law is not really applicable here. OpenFoam is based on the idea of domain partitioning and should not (or to a negligible extend) suffer from Amdahl's law.

For singlecore clusters scaling is mainly limited by the speed of the network between the nodes (latency beeing the culprit). The smaller the partitioned domains, the worse the impact.

For multicore systems scaling is additionally hindered by the transfer of data from the cores to the main memory. Multiple cores on a single CPU all have to share a bus to the memory and this can hurt execution speed badly.

Further more the size of the problem is relevant. Small meshes only run well on small number of CPUs, bigger meshes on larger numbers. There is usually a "sweet spot" (cells/core) where a code performs best. The is depending on the machine you're using (interconnects, cache sizes, cores per CPU, bus system, ...)

For our machine (HP-Xeon Cluster with Gigabit-Ethernet) the code performs best with 50000 cells/core. Setting 4 cores as the reference (speedup 4), we see superlinear speedup of 42 for 32 cores for a testcase with 1.6 mio cells. This is mainly due to cache effects, i.e. lot of the data fits into the CPU caches for this number of cores. For even larger number of cores, the speedup is poorer (56 for 64 cores, 55!!! for 128 cores). This is due to the poor interconnects.

Hope this helps.

Bye,

Carsten Thorenz

lakeat September 22, 2009 04:28

ya, it's helpful, thank you, Carsten!
I found CPU caches are of great importance for the speed.
Can you share about your cpu info, Carsten? Thanks a lot

lakeat September 22, 2009 05:18

Quote:

Originally Posted by stidulsis (Post 230093)
I am glad to find your site - now I know what a good one looks like.
Very good topic to share with us. Great info.:o
** spam link removed **

SPAM?????:confused::eek:

7islands September 22, 2009 11:13

You can report the spam following this guide (click the Report Post icon at the left of each post, which I have done this time already).

T

puneet336 April 6, 2019 09:55

Quote:

Originally Posted by fhy (Post 198900)
Hi Srinath and Jens,

Thanks for your reply and suggestions!

I increased the node density from (20,20,1) to (100, 100,1) and decrease the deltaT to 0.001 to satisfy CFL condition (of the icoFoam/cavity case). With longer runtime, I did observe speedup. However, parallel efficiency dropped to < 50% with 8+ processors.

I checked the benchmark wiki page. Most parallel results only have maximal 4 cpu result.

Jens, can you point me some cases that scales with large number of processors? I am looking for some test cases that can scales to 64 cpus.

I understand that scaling is depending on application as well as system. That's why I am looking for some cases which already show good scaling in other systems. I want to make sure the bad scaling in my local system is not due to application.

Thanks,
Huiyu


Hi,
Tried your setting with 100,100,1 and deltaT 0.001, it worked.
However i am looking for a longer benchmark - 1000,1000,1 in which Courant number increase seems to be causing floating point exception.

Could you please advice here ?


All times are GMT -4. The time now is 19:18.