basic rules for partition domain for parallel run
Hi,
What will be the general rule of thumb for partitioning the computational domain for a parallel run, specific to flow with free surfaces (VOF method)? Thanks! phsieh2005 
Re: basic rules for partition domain for parallel
Well... I would say that the "general rule of thumb" for unstructured grids is by using a graph partitioning software like Metis ( http://glaros.dtc.umn.edu/gkhome/views/metis ) or others like: Jostle, Scotch, Chaco, Party, etc... but, in my opinion, Metis is the easiest to learn and use.
For Cartesian grids you don't need a graph partitioning software since you know exactly how the grid is formed, thus, you can define points and cells ranges according to the number of processors desired. Regards Renato. 
Re: basic rules for partition domain for parallel
Just a word of warning here, some VOF methods won't like it if a partition boundary is too close to the free surface. This happens more easily than you might think, because one would typically cluster the mesh quite close to the free surface. So, if gravity is in the Zdirection, partitioning by principal axis is OK, provided you suppress any split perpendicular to the zaxis. The problem with Metis is that you may be unlucky, and a split may coincide with the free surface. So I just tend to split perpendicular to the xaxis.

Re: basic rules for partition domain for parallel
adding to previous comments:
parallelizing free surface code depend on your modeling type: 1. in ther first kind, twophase approach that NS. Eq. are solved on entire domain with variable density method parallelizing is simple same as other methods, and needs static load balancing. 2. In the second approach, one phas NS. Eq. only solved in liquid phase and gas is assumed as ideal gas or in not considered. In this manner spatial domain is change continuously and you need dynamic load balancing, it is more difficult and tricky for acheiving good performance. 3. Also if you use adaptive refinement in any approach dynamic load balancing is essential. for any method a good load balancing tool is METIS library, it is very public and easy to use. 
Re: basic rules for partition domain for parallel
Hi,
Thank all of you for the feedback. I will study METIS next. I am using OpenFOAM, so, it is unstructured mesh and uses VOF to track the free surface (solves both liquid and air in one NS eq). No elment adaptivity. One thing I found is that, several times, solution diverged during the parallel run. But, a serial run was able to continue from where parallel run diverged. Also, the number of iterations is higher at each time step in parallel run than in serial. I guess that this is reasonable. Most of the cases I have, the free surfaces move everywhere inside the domain (like the dam break case or tank sloshing case). phsieh2005 
Re: basic rules for partition domain for parallel
Hi phsieh2005,
It seems there's something weird in your parallel algorithm. In fact, a parallel program must produce the same results when compared to its serial version. It's true that little discrepancies on the number of iterations are admissible but if you're facing convergence problems only when running in parallel, your algorithm has, probably, something wrong. In fact, we could consider a serial program as a parallel program running in only one processor ;o) Cheers Renato. 
Re: basic rules for partition domain for parallel
as you deal with free surface flow with twophase approach, your main convergence problem is related to solution of varable coefficient Poisson solver, especially when density ratio is large, as you said problem is related to parallel run it may has several reasons:
1. error in algorithm 2. problem in communication 3. your mistake to implementation note that difference in convergence rate is related to type of linear solver: if domain decomposition are based on shure or shuartze method undoubtly we have difference but when we have direct parallelizing CG or AMG solver only convergence rate difference must be raleted to truncation error that it must be small (you can test this with switching between single and double precision arithmatic) I propuse that run parallel code on single processor, asign more than one processor on single cpu, if problem is remained, it is not related to case 2. also change type of solver: CG, Bicgstab and AMG and study problem. do this for various grid resolution. the best way is sending your problem in OpenFOAM form and ask guide from its users. 
Re: basic rules for partition domain for parallel
>In fact, we could consider a serial program as a parallel program running in only one processor ;o)
That's true only in special cases, but certainly not in general (not even with single phase flow). It should be obvious that the computation of flow over a single continuous domain is not equivalent to the computation on separated subdomains. In the former case, you can use a fully implicit approach that strongly connects all flow regions (at least with a structured grid), while the latter method introduces artificial numerical boundaries. Both methods will usually not yield the same convergence rate, because blocktoblock information is usually not implemented on the lowest iteration loop (because the amount of communication would kill any benefit of parallelization!). This is less important on pointimplicit methods on unstructured grids, but can still have a noticable effect. Furthermore, on any type of grid, multigrid is often applied only within each subdomain, and therefore becomes less effective than over the whole domain. For these and multiple other reasons it's perfectly possible that you see a decline in convergence (and possibly even stability) as you move from single domain to domain decomposition. That doesn't prove that you code is not working. 
Re: basic rules for partition domain for parallel
so... I should consider myself a luck guy since I don't have big differences (4 iterations at most) between the parallel and serial version of my program (which is exactly the same code). In my case, I solve incompressible fluid flow and transport equations in an edgebased Finite Element Method. As I have the flow and the transport solvers I already ran free surface problems with an interface capturing approach (VOF or LS, it'll depend on the marking function chosen).
> It should be obvious that the computation of flow over a single continuous domain is not equivalent to the computation on separated subdomains < Why not?! Partitioned or not the global domain will be the same. The programmer only need to know where, when and what synchronize to get equivalent results. I'm saying that because I implemented a solver from scratch testing and comparing each operation in each algorithm of my program. It's true that some algorithms and methods are more well suited to message passing parallelism than others (some are practically infeasible) but, in general, the programmer should look for equivalent results in the serial and parallel version of the program. > because blocktoblock information is usually not implemented < I don't know about what kind of method you based this statement but I would say that it's the more usual way to do distributed parallelism  at least for those that are learning how a MPI program works. Of course that the amount of messages will be too large, but it's the easiest way to implement (doing communication in the inner loop) and you still get speedups for a modest number of processors. In a more formal approach you should replicate elements lying on the communication interface (ghost elements) in order to avoid the communication in hot loops  in iterative solvers, probably, the matrixvector product loop will be the more computationally expensive part of the program. These two approaches should give you equivalent results in your linear solver operations  matrixvector, vector norms, etc...  at least in my case, it always happen (thanks God!) This subject doesn't have a closed answer since we don't have a general way to implement distributed parallelism. It'll depend on several aspects like: method, type of mesh, algorithms, solvers, etc... There will be always a different algorithm doing different things to reach the same result. Regards Renato. 
Re: basic rules for partition domain for parallel
let's think of a 1D example for simplicity but without loss of generality. Let's say you have 10 grid nodes and you solve your differential equations with some finite difference method, using an implicit approach (e.g. thomas algorithm for 1D, central scheme). even with the direct method you'll need to perform iterations if your equations are nonlinear. you could either do all 10 poits in one shot per iteration, or you could do node 15 and nodes 610 in parallel. do you really think you'll get the same convergence? for the parallel computation you'll have to implement an extra step to connect the blocks. how do you do it? if you implement the communication within the thomas algorithm, you may get a very good convergence, but your parallel speedup is ridiculously poor! instead, you would run the thomas algorithm separately on both partitions, and then communicate across the interface.
that's what I mean by "communication is not in the smallest loop". does your code communicate across blocks after every change in every cell in every equation? I don't think so. maybe thinking of the 1D example, you will also understand how multigrid can be implemented far more efficiently on the whole domain. low frequency errors are more efficiently reduced if you take all 10 points, instead of using multigrid on the two separate domains. however, implementing multigrid across the interface again would kill parallel performance. of course, if I do not make use of any such acceleration techniques (or if they work poorly to begin with) domain decomposition will not negatively affect performance. if I were to use point Gauss Seidel in my 1D example, the interface would not be a problem, but point Gauss Seidel would not really be the optimal approach to begin with. the most efficient methods in CFD take advantage of the (partially) elliptic nature of the equations, for example with implicit solvers that strongly couple the whole flow field, multigrid across the whole domain, implicit residual smoothing, and so on... unless these methods are implemented poorly or not used at all, domain decomposition will most definitely have a noticeable effect. 
Re: basic rules for partition domain for parallel
Sorry, I don't know the method you cited but I could comment something about your example:
> do you really think you'll get the same convergence? < Not exactly the same convergence because in floating point arithmetic even the order of calculations has some influence in the final result, but the convergence would be almost the same. Of course, assuming that your computation is treating the interface data accordingly, as should be. > for the parallel computation you'll have to implement an extra step to connect the blocks. how do you do it? < Right! In this case it's accomplished by a MPI_ALLREDUCE for the interface node(s). In 1D with 2 partitions, it would be 1 node (one float or double per message per iteration). > but your parallel speedup is ridiculously poor! < What is ridiculously poor for you?! I have 2 methods implemented. The naivest method do a communication at each matrixvector product within GMRES> Which would kill the parallel performance as you said. The speedup you can check yourself http://www.nacad.ufrj.br/~rnelias/papers/CIL0423.pdf. Of course that this method is not good for massively parallel computations. For such cases I have the ghost elements implemented in order to avoid the interface communication, in matrixvector products, and the algorithm runs quite well (keeping the convergence, results and stability). > the most efficient methods in CFD take advantage of the (partially) elliptic nature of the equations, for example with implicit solvers that strongly couple the whole flow field ... < You destroyed the research plans of several guys out there ;o) If it were so simple, we should stop our studies, go home and read a newspaper looking for a job. I wouldn't like to particularize such discussion to one or another method. What I'm trying to say is that discrete methods will generally give us a linear system, this linear system can be solved in message passing parallelism by some kind of partitioning strategy. In **infinite arythmetic precision**, partitioned or not the result MUST be exactly the same (again... assuming that the computations are being done accordingly). Unfortunately, our computers only have a finite precision, which SHOULD lead us only to little differences in our results. Note: I'm not talking about performance, method or algorithm. Regards Renato. 
Re: basic rules for partition domain for parallel
to: phsieh2005
as an alternative way i propose to run another problem with parallel code, e.g. simple poisson eq. and check previous problem. I am interesting to knowe reason of this problem, please send final result in this form (is problem related to OpenFOAM kernel!!!) thanks. 
Re: basic rules for partition domain for parallel
never mind, you just don't get the point... as this wasn't the original question, just let it be... just don't make assertive diagnoses for others based on just your own code... the fact that phsieh2005 sees different convergence behavior for single domain versus domain decomposition does not prove or even indicate that there is something wrong with his method.

Re: basic rules for partition domain for parallel
Ok Mani... but, one "basic rule" in message passing parallelism (as the phsieh2005's subject suggests), for me, is that, at least, the parallel solver must be so stable as the serial version is, giving equivalent results, number of iterations and convergence. Otherwise my parallel solver would be good for nothing... (even showing wonderful timing performance).
ps: At least the software manual (OpenFoam in this case) should emphasize something about this weird convergence behavior in its parallel solver. 
Re: basic rules for partition domain for parallel
it wouldn't be so "weird" if you had understood what I tried to explain. maybe you'll actually find something about it in the manual. I haven't checked.
just for curiosity: what method are you using? unstructured? point gauss seidel? are you applying multigrid? if so, on the whole domain or just on the subdomains? if not, why not? 
Re: basic rules for partition domain for parallel
Renato and mani: i follow some threads, i think it is useless disscution,
you are partially right, because you talk about separate parallel algorithms, Renato mean: direct parallelizm of iterative solver (or explicit method) e.g. CG or ... this is equivalatn with parallel implementation of sparce matrix operations that if we neglect effect of floating point errors, result of parallel and serial are the same (such as number of iterations). Mani mean parallel implementation based on mathematical domain decomposition, e.g. schwartz method solution of problem on each domain (with overlap or witouth overlap) based on pde and cupling between subdomains with Numann or/and dirichlet bc that leads to iterative method until bc converge. note that if we use one subdomain iteration is not needed. In this manner convergence needs satisfaction of some criteria. As an exellent Ref. read Y. Saad book, in this book these methods described in details and any doubt is removed. Interesting: althouth we expect that parallel schwartz type methods leads to more iteration and performance leak but sometimes it is in reverse and they converge in lower number of iterations and leasd to this conclusion that we use tham in sequential calculation, as a good Ref. see presentation of D. B. Kothe, manager of Telluride project in Los Alamos: "Computational Manufacturing: Toward Simulating Metal Casting and Welding Processes", pages 7684. (if you don't find it give your contact i will send it to you). 
Re: basic rules for partition domain for parallel
You're right rt. Our arguments have been based on things completely different and we'll never come to any conclusion. Anyway, there are methods and methods and different ways to do the same thing.
Maybe, a short answer to the original question, made by phsieh2005, would be: > What will be the general rule of thumb for partitioning the computational domain < for unstructured grids any graph partitioning software will solve the problem! Metis, e.g. For the convergence problems: go through the OpenFoam user guide (or theory guide) and try to understand what is implemented there (and, please, share your conclusions here) []'s Renato. 
Re: basic rules for partition domain for parallel
eureka! that's exactly my point: what I was saying all along is that you cannot present limited experience on just one method as "general" truth. I was trying to give counter examples, somewhat unsuccessfully, but I think we finally agreed on the above notice.
I have to disagree that it is useless to discuss, though. It's a digression from the original question of this thread but still an interesting topic. I did take a look at your paper now, Renato, and it made some things a little clearer. The speed up results look good, but you didn't go beyond a dozen CPUs on a relatively large problem. Even with the maximum number of CPUs you tried, each subdomain has still a pretty large number of elements. In other words, the amount of time spent for serial processing of each subdomain is still far larger than the communication penalty. This is especially true for a computing intensive algorithm like FEM using GMRES. Speed up is primarily determined by the ratio of communication time versus solution time. Speed up can be excellent, if the communication is very fast, but ironically you can also get excellent speed up simply by spending a lot of time on the solution in each subdomain. More important is the question how far you can go with your domain decomposition before the speed up drops. Can the algorithm handle subdomains of a few cells, a few dozen cells, or does it break down already on a few hundred cells per subdomain? Every algorithm has its limit, and as you take yours to the limit you may realize that you need to rethink your strategy. Then convergence may become a more important issue. All of what I said in previous posts gains much more significance as the subdomains become smaller. Maybe you just haven't gone there, yet. 
Re: basic rules for partition domain for parallel
hi mani, note that i don't mean that this disscusion is not valuable but i think that this don't leads to conclusion.
in previous thread you describe somethink related to scalability (when increasing number of processors, or equivalently when number of nodes/element/working unit that are asigned for each processor are decreased algorithm don't miss its performance), this is very general and is function of algorithm, network type, behavior of solution and ... generally performance is decreased with this but usually we have anihilating factor that is reducing cache miss that is decreased with increasing data locality and decreasing size of data (some literature report more than 5 times speedup due to reducing cache miss), it is also hardware dependent and cache aware algorithm is also general issue. do you think that disscussion on such general categoriws leads to conclusion? also i am interested to knowe that what is details of your experience in this category. do you think that disscusion about this leads to conclusion 
Re: basic rules for partition domain for parallel
yes, that's right, you'll get the cache benefit as your subdomains are reduced just far enough to fit into cache. no question about that. that will lead to a local spike in your speedup vs. number of CPUs chart. however, depending on your cache size, you will still have some room for further reduction of block size, and that's when speedup goes down. the complete speedup vs. # CPUs chart can actually look quite bumpy, with several ups and downs, depending on your algorithm and architecture.
no, I didn't expect any definite conclusion. roads are for journeys, not for destinations. 
All times are GMT 4. The time now is 14:20. 