CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Running, Solving & CFD

Large case parallel efficiency

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree22Likes

Reply
 
LinkBack Thread Tools Display Modes
Old   March 16, 2011, 10:52
Default
  #21
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
One more question concerning nCellsInCoarsestLevel, how to set it?

sqrt(mesh_in_total), or 10~30, or what?
Is SAMG much more better than GAMG? Any experience on that?
When will DICGaussSeidel be superior over GaussSeidel? Or there is even better options?

(Suppose this is an external incompressible flow with pisoFoam, have 10m grids pts, hex mesh).

Thanks
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   March 17, 2011, 10:40
Default
  #22
Senior Member
 
Olivier
Join Date: Jun 2009
Location: France, grenoble
Posts: 262
Rep Power: 9
olivierG is on a distinguished road
hello,

You can have a look at this interesting thread.
The answer is to keep a small value here (10-20).

Olivier
olivierG is offline   Reply With Quote

Old   March 17, 2011, 10:48
Default
  #23
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Since I have compared different nCellsInCoarsestLevel values both for a 5m grids case and also for a 10m grids case, I could not get any conclusion. For my 5m grids case, it seems 100 is better than setting it to 10~20, for 10m grids, case, I even find its good to set it around 3000.
That's why I feel kind of confused..
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email

Last edited by lakeat; March 17, 2011 at 11:11.
lakeat is offline   Reply With Quote

Old   March 18, 2011, 08:53
Default
  #24
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 409
Rep Power: 12
arjun is on a distinguished road
Quote:
Originally Posted by lakeat View Post
Since I have compared different nCellsInCoarsestLevel values both for a 5m grids case and also for a 10m grids case, I could not get any conclusion. For my 5m grids case, it seems 100 is better than setting it to 10~20, for 10m grids, case, I even find its good to set it around 3000.
That's why I feel kind of confused..
Large number is usually (almost always) better.

Anyway, good thing from that thread is that I designed a new multigrid scheme that is very efficient. That exploits the fact that direct solver size affects the solution rate.

Previously Bi-conjugate gradient preconditioned by AMG was the best combination I have tried. Now the timings with new method are at least two times faster than previous best. Will be testing it against W and F cycles in coming days.

:-D
arjun is offline   Reply With Quote

Old   March 18, 2011, 09:39
Default
  #25
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by arjun View Post
Large number is usually (almost always) better.
Why I found it is "not always", actually, too large a number will slow down the speed, so I feel confused, not knowing what is a optimized number.
Quote:
Anyway, good thing from that thread is that I designed a new multigrid scheme that is very efficient. That exploits the fact that direct solver size affects the solution rate.
Which one?
Quote:
Previously Bi-conjugate gradient preconditioned by AMG was the best combination I have tried.
You mean GAMG?
Quote:
Now the timings with new method are at least two times faster than previous best. Will be testing it against W and F cycles in coming days.
I'd like to hear that, I have heard of this tech for quite a while, but not quite sure, how fast it is comparing with GAMG, so could you pls elaborate on this? Any experiences? Much appreciated.
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   March 18, 2011, 09:49
Default
  #26
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
And also, It seems, scotch sometimes behaves better than metis... And I don;t know how to tune up the hierarchical method to find a optimized one, so I give up.
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   March 18, 2011, 10:34
Default
  #27
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 409
Rep Power: 12
arjun is on a distinguished road
Quote:
Originally Posted by lakeat View Post
Why I found it is "not always", actually, too large a number will slow down the speed, so I feel confused, not knowing what is a optimized number.
too large size of direct solver would take more time. You said you tried 3000, for me 3000 takes lot of time, but this lot of time is relative. For 10mllion case, it may be negligible, while for a case with 10000 cells, it may be too much.

Quote:
Originally Posted by lakeat View Post
Which one?
Difficult to answer because since it is new one.

Quote:
Originally Posted by lakeat View Post
You mean GAMG?


I'd like to hear that, I have heard of this tech for quite a while, but not quite sure, how fast it is comparing with GAMG, so could you pls elaborate on this? Any experiences? Much appreciated.
In 2007, there was a paper by Jasak where he tried BiCG with AMG as preconditioner. (Or was it CG with AMG!!).

Anyway the crux was that the solver where AMG used W cycle was the fastest. Now if you look at W cycle, what it does is that it tries to solve problem at coarser levels to better convergence compared to V cycle. In case of direct solver, you do not iterate and solve the coarse problem to machine precision.
In both cases you are trying to essentially doing the same thing, that is to solve coarse problem to as high convergence level as possible without taking much time.

What I improved is to replace W cycle by some scheme more robust and better converging (on coarser levels), that takes similar or less time (but much higher convergence).
So far I am able to see speed up, but I have yet to make serious tests and comparisons. Specially with large sizes and W cycles.


PS: I am using my c++ library but the same scheme could be applied to any AMG scheme.
lakeat likes this.
arjun is offline   Reply With Quote

Old   March 18, 2011, 10:38
Default
  #28
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 409
Rep Power: 12
arjun is on a distinguished road
Quote:
Originally Posted by lakeat View Post
And also, It seems, scotch sometimes behaves better than metis... And I don;t know how to tune up the hierarchical method to find a optimized one, so I give up.

Not related to openFOAM but i wrote a small parallel data exchange code, that i am using. Good thing about it is that no matter what you use, its parallel efficiency do not go down. So far, at least my code there is no issue of partitioning. It is roughly 5 times faster than fluent :-D . (needless to say i am happy and relaxing :-D).
arjun is offline   Reply With Quote

Old   March 18, 2011, 10:48
Default
  #29
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
too large size of direct solver would take more time. You said you tried 3000, for me 3000 takes lot of time, but this lot of time is relative. For 10mllion case, it may be negligible, while for a case with 10000 cells, it may be too much.
Agree, this is exactly what I have observed, I just hope there would be a better way to quickly find this number, so we do not need to tune up over and over. Coz I think it is realated to cpu ability, RAM available, simulation complexity, and also just as what you said, "Is it relatively negligible". So this makes me confused, if some one would propose a better and faster method for this number. Pls correct me if i am wrong.

Quote:
What I improved is to replace W cycle by some scheme more robust and better converging (on coarser levels), that takes similar or less time (but much higher convergence).
So far I am able to see speed up, but I have yet to make serious tests and comparisons. Specially with large sizes and W cycles.

Interesting, when would you share your work to public.
And I also hope that the improvement is not just channel-flow friendly or box-turbulence friendly.
Thanks
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   March 18, 2011, 10:58
Default
  #30
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by flavio_galeazzo View Post
I normally allocate the nodes for simulation aiming for 1 second per time step, which is a good value in the system I work with.
Hmmmmm, serious question, because I have no luck to get this "1 second per time step" after having tried different combinations. So, my question is this, theoretically, what's the best efficiency that I can expect from parallel computing give a single type of cluster architecture.


For my case, I am not fully sure, but having tried different cpus, I found 4s/timeStep is the best I can get...

ANY "formula"?
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   March 18, 2011, 10:59
Default
  #31
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by arjun View Post
Not related to openFOAM but i wrote a small parallel data exchange code, that i am using. Good thing about it is that no matter what you use, its parallel efficiency do not go down. So far, at least my code there is no issue of partitioning. It is roughly 5 times faster than fluent :-D . (needless to say i am happy and relaxing :-D).
WoW, would you mind elaborate on this, lol
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   March 18, 2011, 16:34
Default
  #32
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 409
Rep Power: 12
arjun is on a distinguished road
Quote:
Originally Posted by lakeat View Post

Interesting, when would you share your work to public.
And I also hope that the improvement is not just channel-flow friendly or box-turbulence friendly.
Thanks
nothing special, replaced W cycle by another BiCG solver that again uses V cycle AMG as preconditioner. It solves to very high convergence levels (1e6 times) . So now it is as if direct solver size is say 1million cells or less. It is done on coarse enough levels , so that there is no penalty due to another BiCG routine. In a way it is a poly-BiCG-AMG solver. It is now roughly 2 times faster than normal BiCG AMG.
arjun is offline   Reply With Quote

Old   March 18, 2011, 16:42
Default
  #33
Senior Member
 
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 409
Rep Power: 12
arjun is on a distinguished road
Quote:
Originally Posted by lakeat View Post
WoW, would you mind elaborate on this, lol
I was writing an immersed boundary code (SIMPLE algo), so it is Local refinement type. I do not partition mesh by metis or any other tool. I have written the grid generation that writes meshes for each partition. This way one can handle meshes from say 100million to 2-3 billion cells.
(I added few things to fluent mesh format, if others also follow that we can have a universal mesh format for parallel calculations. Other things are information related to parallel interfaces.).

Anyway the trick to efficiency is that when data is exchanged, there shall be no collision. Means that if a process is sending data to another process, no other process shall be sending data to it. The program takes care of it. This is why, data exchange remains highly efficient no matter how i partition.
arjun is offline   Reply With Quote

Old   April 7, 2011, 05:51
Default
  #34
Senior Member
 
Eugene de Villiers
Join Date: Mar 2009
Posts: 725
Rep Power: 13
eugene is on a distinguished road
Quote:
Originally Posted by lakeat View Post
Hello all,

2. I am still not clear how you make Hierarchical a better option than metis. Are you aware of any general rules, or do you have any experience that it is super better than metis. If not, im gonna stay with metis.

Thanks
Daniel,

We have run some additional tests and found metis/scotch is in general better than hierarchical. Which is different than what we thought previously. Hierarchical can be better under very specific circumstances where the cell distribution is favourable, but for cases with highly non-homogeneous cell density metis/scotch has shown up to 20% better performance in some instances. The only problem for us at the moment is that our version of snappyHexMesh does not generally work with parMetis under 1.6 - we have not tried to debug this yet. I hope the parallel scotch in 1.8 will rectify this.
lakeat likes this.
eugene is offline   Reply With Quote

Old   April 7, 2011, 09:24
Default
  #35
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Haha, I tried for many cases using hierarchical, and then I got an impression that it is too tricky to use hierarchical, so I gave up.

Btw, just curious, are you all able to make every time-step run within one physical second, even when you are doing some time-averaging operations?
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   April 8, 2011, 03:14
Default
  #36
Member
 
Flavio Galeazzo
Join Date: Mar 2009
Location: Karlsruhe, Germany
Posts: 30
Rep Power: 9
flavio_galeazzo is on a distinguished road
Quote:
Originally Posted by lakeat View Post
Btw, just curious, are you all able to make every time-step run within one physical second, even when you are doing some time-averaging operations?
I like to use the "1 second rule" for my simulations, but I can't follow it every time due to the avaiable ressorces. But all my simulations have scale well up to now, even with time averaging turned on, and I am confident that all of them could run within one physical second per time step. Of course, the system arquitecture have to be adequate. With a machine using Infiband, I got good speed up (say more than 80%) for up to 256 processors in 64 nodes.
lakeat likes this.
flavio_galeazzo is offline   Reply With Quote

Old   May 17, 2011, 14:30
Default
  #37
New Member
 
Join Date: Jan 2010
Posts: 23
Rep Power: 8
jdiorio is on a distinguished road
A (more basic) question:

Does anyone have a recommended method to determine if your system is performing optimally? I've run a simple case (3D cavity tut with 1M cells) repeatedly on a various number of processors (N=8,16,..,64) and I get very erratic results. The run times (even when using the same number of processors) can vary dramatically (>x4). The jobs are submitted through an LSF scheduler and I've tried a few things to test this issue, like submitting the job sequentially (e.g. 3 times in a row) on the same set of processors, or running three instances of the job simultaneously on three different sets of N-processors (if that makes sense). We added some code to Pstream to output the total amount of time each processor spent in MPI, and we find that MPI time can vary substantially across the cores, which we've been interpreting as some of the cores are running slower. However, this slow down isn't repeatable, and running the job on the same nodes doesn't reproduce the issue.

I apologize because my knowledge of MPI is essentially zero. Is this a hardware issue or a problem with how the cluster is constructed? Or is it some setting in MPI? I can provide more information about the details (cluster architecture, etc.) but wanted to describe the problem in general first. Running OF-1.6 using supplied openmpi-1.3.3. Any thoughts or avenues of investigation would be great.
jdiorio is offline   Reply With Quote

Old   May 17, 2011, 14:37
Default
  #38
Senior Member
 
lakeat's Avatar
 
Daniel WEI (老魏)
Join Date: Mar 2009
Location: Beijing, China
Posts: 689
Blog Entries: 9
Rep Power: 13
lakeat is on a distinguished road
Send a message via Skype™ to lakeat
Quote:
Originally Posted by jdiorio View Post
A (more basic) question:

Does anyone have a recommended method to determine if your system is performing optimally? I've run a simple case (3D cavity tut with 1M cells) repeatedly on a various number of processors (N=8,16,..,64) and I get very erratic results. The run times (even when using the same number of processors) can vary dramatically (>x4). The jobs are submitted through an LSF scheduler and I've tried a few things to test this issue, like submitting the job sequentially (e.g. 3 times in a row) on the same set of processors, or running three instances of the job simultaneously on three different sets of N-processors (if that makes sense). We added some code to Pstream to output the total amount of time each processor spent in MPI, and we find that MPI time can vary substantially across the cores, which we've been interpreting as some of the cores are running slower. However, this slow down isn't repeatable, and running the job on the same nodes doesn't reproduce the issue.

I apologize because my knowledge of MPI is essentially zero. Is this a hardware issue or a problem with how the cluster is constructed? Or is it some setting in MPI? I can provide more information about the details (cluster architecture, etc.) but wanted to describe the problem in general first. Running OF-1.6 using supplied openmpi-1.3.3. Any thoughts or avenues of investigation would be great.
Just want a double check,
1. What did you mean by saying "I get very erratic results", are you saying that the results each time are completely different?
2. Are there any other persons who is using your computing nodes or your memory, if so that would cause differences.
__________________
~
Daniel WEI
-------------
Boeing Research & Technology - China
Beijing, China
Email
lakeat is offline   Reply With Quote

Old   May 17, 2011, 15:35
Default
  #39
New Member
 
Join Date: Jan 2010
Posts: 23
Rep Power: 8
jdiorio is on a distinguished road
Thanks for the reply Daniel.

1. The results from the simulation (i.e. flow field) are the same every time. The run time (amount of time to do the same number of iterations) varies greatly.

2. Aware of this, and agreed that I can't really control for it. However, that's why I ran 3 jobs at the same time (i.e. same exact simulation, just on different nodes) because I figured these would be subject to the same network traffic at that time. Even these cases can be very different (~x2). Furthermore, I've gotten in the habit of checking the cluster load when I submit. I've submitted jobs when there's little to no load and they've taken much longer when network traffic was medium/high.

I'll add to that I've also tried looking at how the job is distributed (i.e. 8 cores on 1 node, 1 core on 8 different nodes, etc. Note: each node has 4 dual-core processors) and I don't see a definite pattern - that is, it's not like running a job with N=64 on 8 x 8 is better than some other distribution...
jdiorio is offline   Reply With Quote

Old   May 18, 2011, 13:27
Default
  #40
Member
 
Flavio Galeazzo
Join Date: Mar 2009
Location: Karlsruhe, Germany
Posts: 30
Rep Power: 9
flavio_galeazzo is on a distinguished road
Hi jdiorio,

If you cannot guarantee that the nodes work only for you, there is no surprise that your computation time varies greatly. One time the machine is working for your job only, and another time it is splitting the resources between X jobs.

If you are using the LSF scheduler, it is possible to reserve the nodes for your job only. Then your results will be consistent.
lakeat likes this.
flavio_galeazzo is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
Postprocessing large data sets in parallel evrikon OpenFOAM Post-Processing 27 June 22, 2011 03:34
Parelleling Efficiency kassiotis OpenFOAM 0 June 19, 2009 14:12
Parallel efficiency channel flow maka OpenFOAM Running, Solving & CFD 1 December 8, 2005 13:58
Post-processing of a large transient case Flav CD-adapco 2 September 28, 2004 06:19


All times are GMT -4. The time now is 15:11.