Large case parallel efficiency

lakeat · May 18, 2011, 13:38

I came across with this website, is anyone interested?
WSMP: Watson Sparse Matrix Package (Version 11.1.19)

And I'm curious, what's the largest case ever simulated with OpenFOAM? And how many cpus were used?

jdiorio · May 18, 2011, 14:50

Quote:

Originally Posted by flavio_galeazzo

Hi jdiorio,

If you cannot guarantee that the nodes work only for you, there is no surprise that your computation time varies greatly. One time the machine is working for your job only, and another time it is splitting the resources between X jobs.

If you are using the LSF scheduler, it is possible to reserve the nodes for your job only. Then your results will be consistent.

Thanks for the response Flavio.

But isn't this the purpose of the LSF scheduler (i.e. it will only assign processors that are available/not running other jobs)? Like I mentioned, this problem doesn't occur only during heavy load time, and I've monitored the cpu usage once during the runs once the machines have been assigned. But I'm certainly willing to give it a try - do you happen to know the bsub options to reserve the nodes solely for your job? I think the bsub -R option could do it, and request only machines that are lightly loaded. I could also try submitting to the priority queue on the cluster, see if that makes a difference.

lakeat · August 22, 2011, 09:13

fyi, I just saw the ppt by Dr. Jasak.

So, currently amg is not good for scalability? Then which one are you guys using?

alberto · August 22, 2011, 10:05

His statement is that the AMG scales worse as strategy than Krylov solvers (gradient methods, to use a probably more common term among students), but that AMG solvers require about 1/3 of the number of iterations.

You see this very clearly for example on the pressure equation, where GAMG significantly beats other methods.

I generally tend to use GAMG for pressure and conjugate gradients for other variables.

lakeat · August 22, 2011, 10:10

Quote:

AMG solvers require about 1/3 of the number of iterations.

Thank you. Is this rough estimation based on single processor or multi-processor? I am not talking about using Krylov method, I am just wondering if there's any better options than AMG solver, that works best so far. Thanks,

alberto · August 22, 2011, 10:19

Hi,

it is quite general. In my experience using GAMG on the pressure equation for large cases leads to very nice improvements in performance (much lower number of iterations: 1/3 is a bit on the pessimistic side, since in many cases the improvement is larger).

Best,

arjun · August 22, 2011, 20:58

Quote:

Originally Posted by lakeat

fyi, I just saw the ppt by Dr. Jasak.

So, currently amg is not good for scalability? Then which one are you guys using?

I have already read that ppt long time ago. My opinion is that it is better to say that *his* implementation of AMG does not have good scaling.

Here is small note about parallel AMG (it is also very personal opinion):

There are three major types of AMG (there are lot more types actually but based on use currently).

1. Additive corrective multigrid that openFOAM,Fluent, starCCM+ etc use.
2. Classic or Ruge- Stuben
3. Smoothed aggregation AMG

Here are major sources of problems in parallelization.
1. Performance of smoother degrades. For example Gauss Seidel would behave more like Jacobi smoother.
Here is one proposed remedy http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf
I will switch to this approach in coming months as i get time to implement this.

2. Set up of coarse levels. Some multigrids are easy to set up some are difficult to set up.
Out of the three given above #1 (additive corrective) is easiest to set up and #3 (smoothed) is most difficult to set up.

3. Performance degradation due to communication and very low level of coarse equations , typically smaller than number of processors given.

Few side notes in addition to above comments.

Multigrid #1 (openfoam's) generate a large number of coarse levels (typically 10 - 20 because each level is roughly Nfine/2 or Nfine/3 ). It means all the problems that I mentioned applies to each level.
Multigrid #2 and #3 are very difficult to implement but the drop in equations is much higher(factor of 8-10). Here is real example from of the set up:

Finest Level Cells = 130208083
Coarse Level Cells = 14420920
Coarse Level Cells = 267305
Single Proc Mat Size = 267305 at lev 2

(Note in this case I used 267305 equation at coarsest level). This means for 130million points i only need 3-4 levels.

Jasak's is saying that due to reason #3, communication issues and coarse level equations AMG performance will degrade.

My opinion is that communication issue is mainly dependent on communication algorithm. In MY case I have no complains. If he can improve this he could improve the efficiency.

Coarse level equations SEEMS to be a problem but in practical use i do not think that they would matter much. Here is why (again my opinion).

Imagine that you are running a calculation on large number of processors say 2000 processors. Then in a perfectly balanced set up your coarset level matrix would have equations ~ 2000.
This could be quickly solved by each processors and would not add much cost.
If partitioning was not good you would be looking at 2000x10 equations which also could be solved quickly at single processor. (I use BiCG + AMG single processor version to do so).

Further one should use #2 or #3 AMG instead of #1 AMG because of these reasons:

1. Less coarse levels are generated.
2. Much higher convergence compared to #1 AMG.
3. Takes much less memory.
4. Can be used for difficult problems like FEM etc. That is 1 AMG fits all.

This is why I no longer take interest in additive corrective AMG. It is easy but not so rewarding.

lakeat · August 22, 2011, 21:15

Wow, what a long post! Good! and Thank you. I would like to buy you a drink next time when we meet.

So far, ICCG works better than amg for me.

arjun · August 22, 2011, 21:52

Quote:

Originally Posted by lakeat

Wow, what a long post! Good! and Thank you. I would like to buy you a drink next time when we meet.

So far, ICCG works better than amg for me.

Thank you but I do not drink. If you ever visit japan let me know, my wife is very good cook ;-)

For me, so far these are the best performing solvers that I have written:

1. For cartesian grid based DNS solver. (Immersed boundary).
Converges in 1-3 iterations for very large cases like 3-4 billion cells. Uses SSOR + Direct solver (FFT + Block cyclic reduction). This is the fastest thing I have seen so far. I am happy that i designed it.

2. Full AMG for locally refined cartesian meshes , a solver for immersed boundary.
Again can converge in 2-3 iterations of BiCG stablized algorithm for 250 - 500 million cells.
Uses these things inside: Bi conjugate gradient, Gauss Seidel + FFT + Block cyclic reduction
(Means uses all type of solvers in ONE).

3. Smoothed Aggregation Preconditioned bi Conjugate Gradient method.
So far tested on 100-200 million points. Very fast but slower than #1 and #2

lakeat · August 29, 2011, 15:56

I am still testing a case ~100million mesh size, an unsteady problem for 1000 processors with infiniband support.

I am not sure which one is better, V or W, or F cycle? But last time I tried F cycle, there was obviously a floating point exception.
And also I am still not clear how to set minCoarseEqns and nMaxLevels. Could someone in the know shed me some light on this?
Thanks a lot.

lakeat · August 29, 2011, 22:20

By some tests so far for my smaller case (4M grids), V cycle is better.
minCoarseEqns I found <5 is best.

But still the best performance comes from using ICCG directly.

My goal is to reach One-Step-One-Second rule with Multigrid technique. So is there any FORMULA to set the minCoarseEqns etc.?

Thanks

arjun · August 30, 2011, 20:46

Quote:

Originally Posted by lakeat

My goal is to reach One-Step-One-Second rule with Multigrid technique. So is there any FORMULA to set the minCoarseEqns etc.?

Thanks

If you are running 100 million with 1000 processors then I think you will not reach 'one time step' in 1 second.

I am running a small test problem (not on openfoam) with 88 million cells (SIMPLE algo) , my timing with 32 processors is 70sec per time step (3 sub iterations, so per iteration 70/3 sec).

With 3 sub iterations , the estimated time for 100 mill cells with my solver (assuming everything linear scaling) is ~ 2.5 second per time step.

since my solver is not perfect, in theory you could achieve 1 step per second, but in practice i doubt that you will come even close to 1 time step in 5 - 10 seconds.

lakeat · August 31, 2011, 11:17

Quote:

Originally Posted by arjun

If you are running 100 million with 1000 processors then I think you will not reach 'one time step' in 1 second.

It would be desirable that some day, One-Second-One-Second could be realized, which is real-time simulation.

I am just feel surprised that in some situations directly using ICCG is better than using multigrid tech.

arjun · September 1, 2011, 01:13

Quote:

Originally Posted by lakeat

It would be desirable that some day, One-Second-One-Second could be realized, which is real-time simulation.

I am just feel surprised that in some situations directly using ICCG is better than using multigrid tech.

yes in small cases ICCG is better than multigrid.

What pre-conditioner are you using with ICCG. And is there any way for you to export that matrix in text form. If that happens I could try with one of my multigrid routines to see , whether multigrid can do it faster than what you are getting.

flavio_galeazzo · September 9, 2011, 04:13

Quote:

Originally Posted by lakeat

It would be desirable that some day, One-Second-One-Second could be realized, which is real-time simulation.

I am just feel surprised that in some situations directly using ICCG is better than using multigrid tech.

I would like to share my newest experience with the scalability of the linear solvers in OpenFoam. I used to run simulations using compressible solvers, and used the PCG linear solver for pressure and PBiCG for the other variables. The scalability was very good up to 256 processors, and I could get one second computational time per time step for a 14 million node grid.
Recently I moved to a incompressible solver, and in that case the GAMG linear solver was far superior than the PCG for the pressure, the other variables stayed with the PBiCG. However, to my surprise, the scalability was very poor this time. I got good results up to 32 processors, with about 10 seconds computational time per time step, and increasing the number of processors do not improved the computational time.
Reading now the ppt from Dr. Jasak, from the post of lakeat, it become clear that the problem is actually the GAMG linear solver. It is very unfortunate, since the GAMG linear solver is indeed very helpfull for the incompressible solver.

lakeat · September 9, 2011, 08:53

Quote:

Originally Posted by arjun

yes in small cases ICCG is better than multigrid.

What pre-conditioner are you using with ICCG. And is there any way for you to export that matrix in text form. If that happens I could try with one of my multigrid routines to see , whether multigrid can do it faster than what you are getting.

Sorry, I was busy with some other problems/proposals. If you could send me an email, I could send you the case.

lakeat · September 9, 2011, 09:22

@Flavio,

Thanks for sharing, and I have some similar reports here.

-1- A good scalability is critical for us, since in our case, we do hope a real-time analysis some day (complex geometries of course), but so far we are just looking for a One-Step-One-Second (OSOS) rule to be realized.
-2- I have a case with several millions hex cell, incompressible solver, hybrid turb model, it showed a very good scalability with PBICG for pressure, better than MG, this was tested on cpus up to ~100. But as I increased the mesh size to >10 million, the situation is no longer pleasant for me, the following item 3 and item 4 are what I found and did.
-3- The solving process in the beginning in order to remove the initial transient effect seems to be different from the "normal" solving process (big difference in matrix??) and deserved to be paid attention to. I need advice here. (My question here is "Will potentialFoam can really produce a better start here??") I am not clear with what happening with the matrix. But for sure, you DO NOT want to use the PBICG at the very beginning, it would be disastrous and totally unacceptable (even floating point exception, FPE), no matter how you change the relTol. To use a well tuned MG method, with V-cycle worked better for me as a start. (Sorry for my poor english, hope you could understand what Im saying)
-4- As the solving process continued for a while (tricky here), I tried to switch it back to the PBICG based on the experience got from my item 2 above, but this time, no avail, it would still ran for sometime and then crashed again somewhere sometime. I have no clue for this so far.

I do not have time to dive into the mechanism behind this, it would be great if someone would shed some lights for me here on,
1. WHY PBICG (for pressure) wins out than MG for some large cases concerning scalaility?
2. What is the next move, what is the up-to-date status the better solution to achieve better scalability in CFD world apart from OF (for solvers that are like those in OF that works well for incompressible complex geometries, not some special solvers as arjun have wriiten

).
3. Is there anyone who has achieved a better parallel efficiency would be willing to contribute more experience?

Thanks

flavio_galeazzo · September 9, 2011, 09:47

Hi lakeat,

Your experiences seems to be similar. Unfortunately I have no clue on how to improve the scalability of my incompressible solver. One workaround is to use a compressible solver for an incompressible case. It worked well for me.

lakeat · September 9, 2011, 11:43

Quote:

Originally Posted by flavio_galeazzo

Hi lakeat,

Your experiences seems to be similar. Unfortunately I have no clue on how to improve the scalability of my incompressible solver. One workaround is to use a compressible solver for an incompressible case. It worked well for me.

What! Why, why did it work better?

arjun · September 12, 2011, 20:04

Quote:

Originally Posted by lakeat

Sorry, I was busy with some other problems/proposals. If you could send me an email, I could send you the case.

I was also taking some break from CFD. Here is my email: arjun.yadav@yahoo.com

May 18, 2011, 13:38		#41
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	I came across with this website, is anyone interested? WSMP: Watson Sparse Matrix Package (Version 11.1.19) And I'm curious, what's the largest case ever simulated with OpenFOAM? And how many cpus were used? __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

August 22, 2011, 09:13		#43
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	fyi, I just saw the ppt by Dr. Jasak. So, currently amg is not good for scalability? Then which one are you guys using? __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

August 22, 2011, 10:05		#44
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	His statement is that the AMG scales worse as strategy than Krylov solvers (gradient methods, to use a probably more common term among students), but that AMG solvers require about 1/3 of the number of iterations. You see this very clearly for example on the pressure equation, where GAMG significantly beats other methods. I generally tend to use GAMG for pressure and conjugate gradients for other variables. __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

August 22, 2011, 10:19		#46
alberto Senior Member Alberto Passalacqua Join Date: Mar 2009 Location: Ames, Iowa, United States Posts: 1,912 Rep Power: 36	Hi, it is quite general. In my experience using GAMG on the pressure equation for large cases leads to very nice improvements in performance (much lower number of iterations: 1/3 is a bit on the pessimistic side, since in many cases the improvement is larger). Best, __________________ Alberto Passalacqua GeekoCFD - A free distribution based on openSUSE 64 bit with CFD tools, including OpenFOAM. Available as in both physical and virtual formats (current status: http://albertopassalacqua.com/?p=1541) OpenQBMM - An open-source implementation of quadrature-based moment methods. To obtain more accurate answers, please specify the version of OpenFOAM you are using.

August 22, 2011, 21:15		#48
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	Wow, what a long post! Good! and Thank you. I would like to buy you a drink next time when we meet. So far, ICCG works better than amg for me. __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

August 29, 2011, 15:56		#50
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	I am still testing a case ~100million mesh size, an unsteady problem for 1000 processors with infiniband support. I am not sure which one is better, V or W, or F cycle? But last time I tried F cycle, there was obviously a floating point exception. And also I am still not clear how to set minCoarseEqns and nMaxLevels. Could someone in the know shed me some light on this? Thanks a lot. __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email

August 29, 2011, 22:20		#51
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	By some tests so far for my smaller case (4M grids), V cycle is better. minCoarseEqns I found <5 is best. But still the best performance comes from using ICCG directly. My goal is to reach One-Step-One-Second rule with Multigrid technique. So is there any FORMULA to set the minCoarseEqns etc.? Thanks __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email Last edited by lakeat; August 29, 2011 at 23:44.

September 9, 2011, 09:22		#57
lakeat Senior Member Daniel WEI (老魏) Join Date: Mar 2009 Location: Beijing, China Posts: 689 Blog Entries: 9 Rep Power: 21	@Flavio, Thanks for sharing, and I have some similar reports here. -1- A good scalability is critical for us, since in our case, we do hope a real-time analysis some day (complex geometries of course), but so far we are just looking for a One-Step-One-Second (OSOS) rule to be realized. -2- I have a case with several millions hex cell, incompressible solver, hybrid turb model, it showed a very good scalability with PBICG for pressure, better than MG, this was tested on cpus up to ~100. But as I increased the mesh size to >10 million, the situation is no longer pleasant for me, the following item 3 and item 4 are what I found and did. -3- The solving process in the beginning in order to remove the initial transient effect seems to be different from the "normal" solving process (big difference in matrix??) and deserved to be paid attention to. I need advice here. (My question here is "Will potentialFoam can really produce a better start here??") I am not clear with what happening with the matrix. But for sure, you DO NOT want to use the PBICG at the very beginning, it would be disastrous and totally unacceptable (even floating point exception, FPE), no matter how you change the relTol. To use a well tuned MG method, with V-cycle worked better for me as a start. (Sorry for my poor english, hope you could understand what Im saying) -4- As the solving process continued for a while (tricky here), I tried to switch it back to the PBICG based on the experience got from my item 2 above, but this time, no avail, it would still ran for sometime and then crashed again somewhere sometime. I have no clue for this so far. I do not have time to dive into the mechanism behind this, it would be great if someone would shed some lights for me here on, 1. WHY PBICG (for pressure) wins out than MG for some large cases concerning scalaility? 2. What is the next move, what is the up-to-date status the better solution to achieve better scalability in CFD world apart from OF (for solvers that are like those in OF that works well for incompressible complex geometries, not some special solvers as arjun have wriiten ). 3. Is there anyone who has achieved a better parallel efficiency would be willing to contribute more experience? Thanks __________________ ~ Daniel WEI ------------- Boeing Research & Technology - China Beijing, China Email Last edited by lakeat; September 9, 2011 at 09:37.

September 9, 2011, 09:47		#58
flavio_galeazzo Member Flavio Galeazzo Join Date: Mar 2009 Location: Karlsruhe, Germany Posts: 34 Rep Power: 18	Hi lakeat, Your experiences seems to be similar. Unfortunately I have no clue on how to improve the scalability of my incompressible solver. One workaround is to use a compressible solver for an incompressible case. It worked well for me.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Postprocessing large data sets in parallel	evrikon	OpenFOAM Post-Processing	28	June 28, 2016 03:43
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
Parelleling Efficiency	kassiotis	OpenFOAM	0	June 19, 2009 14:12
Parallel efficiency channel flow	maka	OpenFOAM Running, Solving & CFD	1	December 8, 2005 12:58
Post-processing of a large transient case	Flav	Siemens	2	September 28, 2004 06:19