Hi, I am testing parallel
I am testing parallel feature of OpenFOAM. The test case I used is icoFoam/cavity. However, I did not observe any speedup. The execution times are
4 cpus : 0.63s
8 cpus : 0.7 s
It might be because it is a small case.
I am wondering whether there are any large cases which I can try. I see there are quite some cases in the tutorial directory. Can someone suggest one which is large enough to test the parallel feature?
Or is there any publicly available OpenFoam test cases for benchmark purpose?
Just change the node density i
Just change the node density in constant/polyMesh/blockMeshDict. Increase the number of control volumes!
Hi We did some Benchmarks
We did some Benchmarks based on the pyFoamBech cases from the wiki (Thanks to Bernhard) see http://openfoamwiki.net/index.php/Be...ks_standard_v1
With OF-1.3 depending on the case, we got speedups up to 128 cores.
Hi Srinath and Jens, Thanks
Hi Srinath and Jens,
Thanks for your reply and suggestions!
I increased the node density from (20,20,1) to (100, 100,1) and decrease the deltaT to 0.001 to satisfy CFL condition (of the icoFoam/cavity case). With longer runtime, I did observe speedup. However, parallel efficiency dropped to < 50% with 8+ processors.
I checked the benchmark wiki page. Most parallel results only have maximal 4 cpu result.
Jens, can you point me some cases that scales with large number of processors? I am looking for some test cases that can scales to 64 cpus.
I understand that scaling is depending on application as well as system. That's why I am looking for some cases which already show good scaling in other systems. I want to make sure the bad scaling in my local system is not due to application.
In my case it is a rasInterFoa
In my case it is a rasInterFoam solver. The cluster which I'm using allows me to use 8 nodes each having 8 processors.
Frankly speaking the more processors I use, the much better efficiency I obtain. For example using my Mac G5, single processor, computation takes about couple days. With a cluster, 4 nodes give 1.5h, and 8 give something a bit more than 10 minutes.
Try that case.
When I was in school a grad st
When I was in school a grad student was doing cluster performance experiments on VHDL language chip simulations. He found that going from 1 to 2 to 4 nodes that total compute time was reduced. He attributed it to more cache hits due to smaller data sets.
Hi Krystian, What test case
What test case did you run with rasInterFoam solver? Is it the default damBreak case in the tutorials? Did you change anything in the controlDict and blockMeshDict?
Did you use the damBreak case for interFoam in standardBench_v1.cfg ?
Hi Jens and Bernhard, I dow
Hi Jens and Bernhard,
I downloaded PyFoam-0.4.0, but did not find benchFoam.py under examples/. Where can I find it? Thanks.
I did find standardBench_v1.cfg under examples/data.
Based on it, I modified the interForm/damBreak. However the sequential run only takes about 180s to finish, which is significantly less than the baseline (1605.82s) in standardBench_v1.cfg. Something must be wrong here. I just want to make sure I am running the correct benchmark.
The following is what I did for modification based on the configure file, please let me know which steps are wrong.
step 1: Modify blockMeshDict, blocks section
hex (0 1 5 4 12 13 17 16) (46 16 1) simpleGrading (1 1 1)
hex (2 3 7 6 14 15 19 18) (38 16 1) simpleGrading (1 1 1)
hex (4 5 9 8 16 17 21 20) (46 84 1) simpleGrading (1 1 1)
hex (5 6 10 9 17 18 22 21) (8 84 1) simpleGrading (1 1 1)
hex (6 7 11 10 18 19 23 22) (38 84 1) simpleGrading (1 1 1)
step 2. modify controlDict
step 3: Generate Mesh
blockMesh . damBreak
step4: reset gammar
setFields . damBreak
step5. run it
interFoam . damBreak
I am running with OpenFoam-1.4 on AMD Opteron(tm) Processor 285, 2.6Ghz, 8GB RAM, SLES10.
Hi Huiyu, 1. benchFoam.py i
1. benchFoam.py is now pyFoamBench.py
2. suggestion for cases: oodles pitzDaily and interFoam dambreak should have a good efficiency
Hi Jens, Thanks for your su
Thanks for your suggestion.
In the benchmark wiki page, I saw you submitted result on Opti250 with OpenFoam 1.2 standard. I am wondering whether you have tried the benchmark with OpenFoam 1.4.
My problem is that I tried OpenFoam1.4 compiled with gcc 4.1.0 on a SLES 10, AMD Opteron(tm) Processor 285, 2.6Ghz, 8GB RAM machine. The sequential version interFoam/damBreak (as in the benchmark v1 configuration) finishes around 180s.
Your submission is 588.91s on a Opteron 250 2.4Ghz 4GB Ram system. The big runtime difference can not be explained by the difference in the system.
So I am wondering whether different OpenFoam versions contribute to the difference. However, it is still hard to believe it accounts for all the rest difference. Is there a way to verify the result of the benchmark?
Hi Huiyu! As far as I know
As far as I know interFoam was rewritten in a major way from 1.3 to 1.4 (completely new algorithm). Propably this is the cause for the big difference.
Hi Bernhard, Thanks for the
Thanks for the info!
I wonder how many solvers got rewritten from 1.2 to 1.4.
I just run PitzDaily with oodles using benchmark_v1 configuration. And got 151.9s (Wall clock), while Jens's submission on Opti252 Dual-Opteron with OF1.2 is 232.47s.
Has anyone run the benchmark suite using OF1.4?
The reason I am so concerned about the runtime is if it is too short, it won't be a good case for parallel runs. Although I can modify the case to make it run longer, there won't be any data to compare from the benchmark wiki page.
Hi Huiyu, just started the
just started the benchmark_v1 again for sequential version interFoam/damBreak (as in the benchmark v1 configuration) I got 168.6 s at the same machine I mentioned on the wiki. So this is quite a speedup!!
I will publish some more results later this week. In the wiki.
If there is some interest in the community for benchmarking and I am willing to share my experience!
Maybe we should colaborate and form some kind of benchmark group? I think the pyFoamBench is a good starting point.
Hi Jens, Thanks a lot for t
Thanks a lot for testing!
I think it is a great idea to form a benchmark group! I am very interested in benchmarking the parallel features of OpenFoam. Reading through some posts in this forum, I did find there are people interested in benchmarking for difference reasons, such as InfiniBand vs GigE comparision, procurement reference and so on.
I do appreciate the benchmark wiki page and PyFoamBench, and will update benchmarking results when I finish some tests.
The current benchmark suite is a good start. It may be too small for parallel runs with the significant speedup due to version updates.
Hi, I hope I am not too late to join the discussion.
It seems no followup of this concern after OpenFOAM-1.4.
My personal Computer:
Case 9 (Please correct me if I was wrong),
which means, I have modified the case according to standardBench_v2.cfg file.
I use precompiled one from the official release.
Processor 0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Processor 1: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Processor 2: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Processor 3: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
That is 4 cores.
That seems normal to me.
You'll never get 4 times speedup by using 4 cores, especially with quad-cores. All four cores are on one processor (or two dual-cores stuck together like the Kentsfield quad-cores are) and that single processor has access to its RAM via a single Front-Side Bus. The FSB allows the passage of information between the cpu cache and the RAM itself.
Now, if four separate cores are trying to access the RAM storage you are going to be limited on your FSB speed (not sure what the baseline value is). If you had a dual cpu motherboard with two dual cores and the same core-to-RAM ratio, then you would see improved performance as the FSB limit has been improved upon.
I have the same cpu at home and one way to improve these numbers (without spending any money) is to perform overclocking, but this can be hazardous if not done properly. Nevertheless, you would then be able to increase FSB speeds (but keeping the same cpu speed) and you should see an increase in performance.
No, that's my only cpu, you are not gonna to persuade me overclocking, I will be burnt out if it's burnt out.
And I want that!
Super-linear speed-up of a parallel multigrid. Navier-Stokes solver on Flosolver
Anyone experience very very good scalability using OpenFOAM, like super linear speedup?
What is the best report of scalability using OpenFOAM so far?
According to Amdahl's Law, the maximum speedup will be restricted by the fraction of the codes that can not be parallelized, so I would eagerly want to know what is this fraction of OpenFOAM? Thank you!
Speedup is not very easy to define. It is not only a property of the programm code (as you assume) but also a property of the machine you're using and the testcase.
Amdahl's law is not really applicable here. OpenFoam is based on the idea of domain partitioning and should not (or to a negligible extend) suffer from Amdahl's law.
For singlecore clusters scaling is mainly limited by the speed of the network between the nodes (latency beeing the culprit). The smaller the partitioned domains, the worse the impact.
For multicore systems scaling is additionally hindered by the transfer of data from the cores to the main memory. Multiple cores on a single CPU all have to share a bus to the memory and this can hurt execution speed badly.
Further more the size of the problem is relevant. Small meshes only run well on small number of CPUs, bigger meshes on larger numbers. There is usually a "sweet spot" (cells/core) where a code performs best. The is depending on the machine you're using (interconnects, cache sizes, cores per CPU, bus system, ...)
For our machine (HP-Xeon Cluster with Gigabit-Ethernet) the code performs best with 50000 cells/core. Setting 4 cores as the reference (speedup 4), we see superlinear speedup of 42 for 32 cores for a testcase with 1.6 mio cells. This is mainly due to cache effects, i.e. lot of the data fits into the CPU caches for this number of cores. For even larger number of cores, the speedup is poorer (56 for 64 cores, 55!!! for 128 cores). This is due to the poor interconnects.
Hope this helps.
|All times are GMT -4. The time now is 22:11.|