CFD Online Discussion Forums - How can I run OpenFOAM to benchmark/compare two environment performance

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM (https://www.cfd-online.com/Forums/openfoam/)

- - How can I run OpenFOAM to benchmark/compare two environment performance (https://www.cfd-online.com/Forums/openfoam/180692-how-can-i-run-openfoam-benchmark-compare-two-environment-performance.html)

ZhouYoung

November 28, 2016 08:44

How can I run OpenFOAM to benchmark/compare two environment performance

Hello everyone, I'm a new comer here, thankyou for your attention.

I was asked to test the performance of OpenFOAM on a cloud cluster! But I have no any experience on these, and I think OpenFoam is a kind of simulate platform just like Matlab. So I run a typical model like pitzDaily or cavity on a cloud cluster and then judge or compare the performance with running time ?

And here I encounter two questions :
1. If I want to run a model (modify one under tutorials), I need to increase the computing work then I could put it on a hundreds CPU cluster. So, if there is some big model under tutorials ? or how could I modify a big enough model as soon as possible?

2. Currently I could run interFoam with foamJob, but I will get error with "/usr/lib64/openmpi/bin/mpirun --hostfile ~/hostfil -np 6 simpleFoam -case /root/OpenFOAM/root-3.0.0/run/pitzDaily -parallel" :
--> FOAM FATAL ERROR:
Cannot read "/root/OpenFOAM/root-3.0.0/run/pitzDaily/system/decomposeParDict"
Any one have idea about this ? Thank you !

floquation

November 28, 2016 09:41

Quote:

Originally Posted by ZhouYoung (Post 627208)

I was asked to test the performance of OpenFOAM on a cloud cluster! But I have no any experience on these, and I think OpenFoam is a kind of simulate platform just like Matlab. So I run a typical model like pitzDaily or cavity on a cloud cluster and then judge or compare the performance with running time ?

For benchmarking you'll be interested in the execution time of a given simulation. Accuracy shouldn't matter, as the simulation result should be independent of the hardware, and I reckon you want to compare different hardwares?

Quote:

Originally Posted by ZhouYoung (Post 627208)

And here I encounter two questions :
1. If I want to run a model (modify one under tutorials), I need to increase the computing work then I could put it on a hundreds CPU cluster. So, if there is some big model under tutorials ? or how could I modify a big enough model as soon as possible?

Most tutorials are designed to be small as to be run quickly. If you want to increase its runtime, the easiest method is to either force it to perform more iterations (lower the value of "tolerance" in system/fvSolution or lower the value of "maxDeltaT" in system/controlDict as to limit the timestep), or to use a finer mesh (constant/polyMesh/blockMeshDict). Evidently, 3D simulations will require considerably more time than 2D simulations.
The most effective method will be to use a finer mesh.

With such a 'sufficiently fine' mesh, any tutorial will do. For example, the "damBreak" case of the "interFoam" application.

Quote:

Originally Posted by ZhouYoung (Post 627208)

2. Currently I could run interFoam with foamJob, but I will get error with "/usr/lib64/openmpi/bin/mpirun --hostfile ~/hostfil -np 6 simpleFoam -case /root/OpenFOAM/root-3.0.0/run/pitzDaily -parallel" :
--> FOAM FATAL ERROR:
Cannot read "/root/OpenFOAM/root-3.0.0/run/pitzDaily/system/decomposeParDict"
Any one have idea about this ? Thank you !

You have to decompose the mesh before running a simulation on multiple processors. Have a look at the following tutorial (especially Sec. 2.3.11 "running in parallel"):
http://cfd.direct/openfoam/user-guid...x7-630002.3.11

elvis

November 28, 2016 15:00

http://openfoamwiki.net/index.php/Sig_HPC

what about taking testcases from the SIG turbomachinery
http://openfoamwiki.net/index.php/Sig_Turbomachinery

http://openfoamwiki.net/index.php/Si...ion_test_cases

ZhouYoung

November 29, 2016 03:13

Quote:

Originally Posted by floquation (Post 627220)

Yes, I just mean this! I need a big OpenFOAM job that I can easily compare two environment performance with the finish time! :D

Quote:

Originally Posted by floquation (Post 627220)

Yes, I just tried these (use the dambreak model, change tolerance to 1e-15 and maxDeltaT to 0.00001) , and then the runtime of interFoam on 4CPU extend to 18mins, quite useful and easily to me !

But for your metioned finer mesh, firstly I found there is no constant/polyMesh/blockMeshDict under damBreak, is that the same as system/blockMeshDict ?
Are you mean I add more elements in the blockMeshDict ? or there is some parameter that I could change easily ??

Quote:

Originally Posted by floquation (Post 627220)

OK, thankyou very much ! Currently I could run the foamJob on two machine, I think I could read the tutorial and the foamJob script to find what's wrong to me !!!
Thank you very much again！

ZhouYoung

November 29, 2016 03:15

Quote:

Originally Posted by elvis (Post 627270)

These seem a little old, I'm using the OF3.0.0 version, I don't know if these are compatible :confused:

floquation

November 29, 2016 03:30

Quote:

Originally Posted by ZhouYoung (Post 627344)

But for your metioned finer mesh, firstly I found there is no constant/polyMesh/blockMeshDict under damBreak, is that the same as system/blockMeshDict ?
Are you mean I add more elements in the blockMeshDict ? or there is some parameter that I could change easily ??

It appears that the location of that file was changed in a recent version, so yes, that is the same file.

Find lines like these:

Code:

blocks

(

    hex (0 1 5 4 12 13 17 16) (23 8 1) simpleGrading (1 1 1)

    hex (2 3 7 6 14 15 19 18) (19 8 1) simpleGrading (1 1 1)

    hex (4 5 9 8 16 17 21 20) (23 42 1) simpleGrading (1 1 1)

    hex (5 6 10 9 17 18 22 21) (4 42 1) simpleGrading (1 1 1)

    hex (6 7 11 10 18 19 23 22) (19 42 1) simpleGrading (1 1 1)

);

And increase the bold numbers to use more cells = a finer mesh.
(Note, these numbers are linked: 19+4=23 and 42=42, etc. So if you change something, simply multiply all numbers by 2 for example to retain those properties.)

ZhouYoung

December 4, 2016 07:58

Quote:

Originally Posted by floquation (Post 627348)

It appears that the location of that file was changed in a recent version, so yes, that is the same file.

Find lines like these:

Code:

blocks

(

    hex (0 1 5 4 12 13 17 16) (23 8 1) simpleGrading (1 1 1)

    hex (2 3 7 6 14 15 19 18) (19 8 1) simpleGrading (1 1 1)

    hex (4 5 9 8 16 17 21 20) (23 42 1) simpleGrading (1 1 1)

    hex (5 6 10 9 17 18 22 21) (4 42 1) simpleGrading (1 1 1)

    hex (6 7 11 10 18 19 23 22) (19 42 1) simpleGrading (1 1 1)

);

OK, I change the blocks as the userguide and using lower tolerance, now it run much longer! Thank you for your patient guide.

But here I have another questions, for the running time with the modified mode:
1. I change the tolerance to 1e-15 and maxDeltaT to 0.00001, firstly I run with the default setting (mpirun -np 4, and simple Coeffs 2 2 1), the result for runing on two VM （2 CPU per VM） sometimes 17mins sometime 25 mins（ a large span:confused:）.
And then, I change it to 12 np (2*6*1), it runs even longer! (about 28mins), and 9 np (3*3*1) runs 17mins too!
Quite strange, no rules to follow! So, what's the benefit for multi cores?

ps : my testing env is connected by Mellanox IB nic, I think it's not a network problem!

2. change the blocks setting with user guide, and then set tolerance to 1e-14.
To this mode, It runs longer than the upper model, and now I'm still testing it with different processor parallel. :rolleyes:

3. On the other way, I found a 3D model called motorBike (incompressible/pisoFoam/les/motorBike), and its directory is not very same as the cavity or damBreak, it support a "Allrun" script and run with 6 thread by default! From the Allrun script, I found it's using the runParallel function defined in bin/tools/RunFunctions, So I could easily change the definition of runParallel that I could run the motorBike model with mpi cluster ?
I add a "--hostfile hostfile" to the mpirun line of runParallel function :p

ZhouYoung

December 4, 2016 07:59

testing result strange

Quote:

Originally Posted by floquation (Post 627348)

It appears that the location of that file was changed in a recent version, so yes, that is the same file.

Find lines like these:

Code:

blocks

(

    hex (0 1 5 4 12 13 17 16) (23 8 1) simpleGrading (1 1 1)

    hex (2 3 7 6 14 15 19 18) (19 8 1) simpleGrading (1 1 1)

    hex (4 5 9 8 16 17 21 20) (23 42 1) simpleGrading (1 1 1)

    hex (5 6 10 9 17 18 22 21) (4 42 1) simpleGrading (1 1 1)

    hex (6 7 11 10 18 19 23 22) (19 42 1) simpleGrading (1 1 1)

);

OK, I change the blocks as the userguide and using lower tolerance, now it run much longer! Thank you for your patient guide.

But here I have another questions, for the running time with the modified mode:
1. I change the tolerance to 1e-15 and maxDeltaT to 0.00001, firstly I run with the default setting (mpirun -np 4, and simple Coeffs 2 2 1), the result for runing on two VM （2 CPU per VM） sometimes 17mins sometime 25 mins（ a large span:confused:）.
And then, I change it to 12 np (2*6*1), it runs even longer! (about 28mins), and 9 np (3*3*1) runs 17mins too!
Quite strange, no rules to follow! So, what's the benefit for multi cores?

ps : my testing env is connected by Mellanox IB nic, I think it's not a network problem!

2. change the blocks setting with user guide, and then set tolerance to 1e-14.
To this mode, It runs longer than the upper model, and the result goes the same. :rolleyes:
default 4 = 2*2 takes 28mins
try 9 = 3*3 takes 38mins
16 = 4*4 takes more than 680mins => I terminated it!

3. On the other way, I found a 3D model called motorBike (incompressible/pisoFoam/les/motorBike), and its directory is not very same as the cavity or damBreak, it support a "Allrun" script and run with 6 thread by default! From the Allrun script, I found it's using the runParallel function defined in bin/tools/RunFunctions, So I could easily change the definition of runParallel that I could run the motorBike model with mpi cluster ?
I add a "--hostfile hostfile" to the mpirun line of runParallel function :p

floquation

December 5, 2016 04:13

Quote:

Originally Posted by ZhouYoung (Post 628211)

1. I change the tolerance to 1e-15 and maxDeltaT to 0.00001, firstly I run with the default setting (mpirun -np 4, and simple Coeffs 2 2 1), the result for runing on two VM （2 CPU per VM） sometimes 17mins sometime 25 mins（ a large span:confused:）.
And then, I change it to 12 np (2*6*1), it runs even longer! (about 28mins), and 9 np (3*3*1) runs 17mins too!
Quite strange, no rules to follow! So, what's the benefit for multi cores?

ps : my testing env is connected by Mellanox IB nic, I think it's not a network problem!

I don't know much about hardware.
In terms of software: you are limiting your max $\Delta t$ to a very small value. It depends on the physics whether this value is indeed small or high, but for most cases it will be small.
As a consequence, I think that during a single timestep virtually nothing changes. Therefore, the processors finish calculating almost instantly and then have to communicate. If this is the case, adding more processors will simply yield to more communication without any added benefits, causing the simulation to be slower.
In other words, you must give the processors more tasks to do to outweigh the time spent on communication.

How to fix:
Set the following in system/controlDict:

Code:

runTimeModifiable yes;

adjustTimeStep  yes;

maxCo           0.5;

maxAlphaCo      0.5;

maxDeltaT       1;

Then OpenFoam will automatically choose the highest possible $\Delta t$ . This will yield more work per timestep and thereby decrease the relative time spent on communication.

Quote:

Originally Posted by ZhouYoung (Post 628211)

3. On the other way, I found a 3D model called motorBike (incompressible/pisoFoam/les/motorBike), and its directory is not very same as the cavity or damBreak, it support a "Allrun" script and run with 6 thread by default! From the Allrun script, I found it's using the runParallel function defined in bin/tools/RunFunctions, So I could easily change the definition of runParallel that I could run the motorBike model with mpi cluster ?
I add a "--hostfile hostfile" to the mpirun line of runParallel function :p

Those are the tutorial run functions. I personally never use them for my own cases - I write my own scripts - but I reckon they could be of use to you.

ZhouYoung

December 5, 2016 07:14

Still the same , result goes worse with more cores

Quote:

Originally Posted by floquation (Post 628345)

So the problem come back to the model too small again ?
The computing data size is not so much, and it will be better just running locally.

Quote:

Originally Posted by floquation (Post 628345)

How to fix:
Set the following in system/controlDict:

Code:

runTimeModifiable yes;

adjustTimeStep  yes;

maxCo           0.5;

maxAlphaCo      0.5;

maxDeltaT       1;

Then OpenFoam will automatically choose the highest possible $\Delta t$ . This will yield more work per timestep and thereby decrease the relative time spent on communication.

Ok, I've tried this on an public cloud:
fvSolution: tolerance change to 1e-14
controlDict : change as above
blockMesh : change as userguide (a little bigger)

AND then I run it on a 32core VM:
mpirun -np 16 (decomposePar set 4*4*1 ) only takes 32s!
mpirun -np 32 (decomposePar set 4*8*1) takes 44s !
and then I run it on 2*32vm cluster with -np 64 (8*8*1) : it takes 11mins! takes more than 20 times as much as 16 core!

So, the changing of $\Delta t$ does not save this phenomenon:(

Quote:

Originally Posted by floquation (Post 628345)

Those are the tutorial run functions. I personally never use them for my own cases - I write my own scripts - but I reckon they could be of use to you.

Ok, I will test this model for checking if the phenomenon still there!

floquation

December 5, 2016 08:36

Quote:

Originally Posted by ZhouYoung (Post 628363)

So the problem come back to the model too small again ?

I didn't say that.
I said the workload per timestep is too small, meaning too much communication is required for a parallel computation to be worth it.

Quote:

Originally Posted by ZhouYoung (Post 628363)

AND then I run it on a 32core VM:
mpirun -np 16 (decomposePar set 4*4*1 ) only takes 32s!
mpirun -np 32 (decomposePar set 4*8*1) takes 44s !
and then I run it on 2*32vm cluster with -np 64 (8*8*1) : it takes 11mins! takes more than 20 times as much as 16 core!

So, the changing of $\Delta t$ does not save this phenomenon:(

It does if you'd make the case sufficiently big. Your case runs within a minute - that's way too little of a workload.

Increase the number of cells by a big factor, say 10, while using the controlDict settings for $\Delta t$ I mentioned in my last post. Then the total workload should increase by a factor 1000 for a 2D simulation, which comprises of a factor ~10 more timesteps and a factor ~100 more work per timestep. As you can see, this will increase the work that each processor has to do by a greater factor (x100) than the increase in the number of times they must communicate (x10).
If that doesn't work, use a yet bigger factor.

In a 3D case this is yet more effective, as refining your mesh by a factor 10 will mean a factor 1000 times more work per communication step. If your case requires that much of work, it becomes beneficial to use more processors, as the time spent communicating becomes (in a relative sense) smaller and smaller.
In your case, I reckon some processors are doing work on as little as 10 cells, instead of 1,000,000 cells.

ZhouYoung

December 7, 2016 10:01

Quote:

Originally Posted by floquation (Post 628374)

I didn't say that.
I said the workload per timestep is too small, meaning too much communication is required for a parallel computation to be worth it.

OK, you mean, my model only change the tolerance, but the computing task and scale i small, even the communication cost more CPU power than the computing itself!

Quote:

Originally Posted by floquation (Post 628374)

take the default blockMesh as example:

Code:

blocks

(

    hex (0 1 5 4 12 13 17 16) (23 8 1) simpleGrading (1 1 1)

    hex (2 3 7 6 14 15 19 18) (19 8 1) simpleGrading (1 1 1)

    hex (4 5 9 8 16 17 21 20) (23 42 1) simpleGrading (1 1 1)

    hex (5 6 10 9 17 18 22 21) (4 42 1) simpleGrading (1 1 1)

    hex (6 7 11 10 18 19 23 22) (19 42 1) simpleGrading (1 1 1)

);

So, I should add more blocks ? If I modify the bold numbers below, does that mean the cell increase to 10*10 = 100 times? SO I get a bigger factor?

Code:

blocks

(

    hex (0 1 5 4 12 13 17 16) (230 80 1) simpleGrading (1 1 1)

    hex (2 3 7 6 14 15 19 18) (190 80 1) simpleGrading (1 1 1)

    hex (4 5 9 8 16 17 21 20) (230 420 1) simpleGrading (1 1 1)

    hex (5 6 10 9 17 18 22 21) (40 420 1) simpleGrading (1 1 1)

    hex (6 7 11 10 18 19 23 22) (190 420 1) simpleGrading (1 1 1)

);

Quote:

Originally Posted by floquation (Post 628374)

In a 3D case this is yet more effective, as refining your mesh by a factor 10 will mean a factor 1000 times more work per communication step. If your case requires that much of work, it becomes beneficial to use more processors, as the time spent communicating becomes (in a relative sense) smaller and smaller.
In your case, I reckon some processors are doing work on as little as 10 cells, instead of 1,000,000 cells.

Here I still working on the tutorials/incompressible/simpleFoam/motorBike model! besides, I also want to get larger factor with motorBike MODEL. is the changing above is OK ?
After I modify the blocks phases 10 times to the ux,uy,uz, motorBike running time increase from 6mins to 70mins on a 4U8G VM.

All times are GMT -4. The time now is 01:08.