
[Sponsors] 
March 6, 2012, 07:52 
Optimizing the solution of tridiagonal linear systems

#1 
New Member
Carl Christian Kjelgaard Mikkelsen
Join Date: Mar 2012
Posts: 7
Rep Power: 7 
Greetings,
I work on the analysis and development of parallel algorithms for banded linear systems and matrix equations. After browsing some CFD textbooks and papers I am stuck with the impression that CFD relies heavily on solvers of linear systems which are both tridiagonal and diagonally dominant by rows. However, with very little practical experience in this field I could easily be wrong. My questions: 1) Within your own code roughly who much time is spent solving tridiagonal linear systems? 2) The ADI method for solving say a parabolic PDE requires the solution of many independent tridiagonal systems which are typically solved sequentially on different cores. Do you ever split the solution of a tridiagonal linear system over multiple cores? I imagine that this would be relevant on domains which are particularly long such as pipes, canals and blood vessels. 3) Which tridiagonal solver do you use? Background: Simple, I need to justify my continued existence to a funding agency, so I need to present some good applications of my work. It is possible to improve on LAPACK's sequential solvers. Time can be saved by eliminating the logical check for division by zero and reorganising the memory access pattern. I can remove about 25% of the run time for LAPACK's routine DPTSV which applies to system which are tridiagonal and symmetric positive definite. Removal of the logical check can be justified mathematically. The same principle can be successfully applied to the Thomas algorithm for systems which are strictly diagonally dominant by rows, but not necessarily symmetric positive definite. Tridiagonal solves can be done in parallel using, say, Sameh's Spike algorithms, ScaLAPACK's implementations of block cyclic reduction, or Sun's algorithms (PDD and others). However, in each case the parallel efficiency is below 50%. This is not bad, but recently, Hokpunna's interface splitting algorithm has achieved efficiencies above 90% for systems generated by parabolic PDE and certain spline interpolation problems. I believe that something similar can be achieved in a more general setting. Thank you for your attention! Carl Christian  Carl Christian Kjelgaard Mikkelsen, Ph.D. Umeå University Department of Computing Science SE90187 Umeå Sweden 

March 6, 2012, 12:31 

#2 
Senior Member
Martin Hegedus
Join Date: Feb 2011
Posts: 479
Rep Power: 12 
I hope someone with broader experience than I answers your question.
There are different types of CFD solvers. Some of the parameters which define a CFD code are explicit and implicit, and structured and unstructured. An implicit structured code is most likely to use a tridiagonal solver. Other important parameters are how the code is differentiated, i.e. central differencing, flux splitting, and flux difference splitting. Then there is the order of the spatial differencing, second order, fourth order, etc. The type and order of the differencing will determine the tridiagonal, pentadiagonal, etc. will need to be used. And ADI methods will need a block solver while diagonalized methods will need a scaler solver. So there is a lot which goes into a solver. Also, in general, the memory bus is the bottle neck, and not the CPU. So, to optimize a CFD code, one needs to look at the whole. There are several types of parallelization, machine, CPU, and/or GPU. In general, the block/scaler diagonal solver does not need to be parallelized for machines and CPUs. However, for massively parallel hardware, such as GPUs, the solver does need to be parallelized. And, from what I understand, that is the tricky part. Answers to your questions (my solver is a implicit structured central difference solver): 1) Not really sure. 10%? A good chunk of the overall CPU time is actually idle while memory is being brought in. For my solver, (which goes in the i, j, k direction where the most paging is done in k direction), for a square block, the k direction takes 4x longer than the i direction. 2) If my solver is running on one machine the domain does not need to be split. Each of the various threads (one for each CPU core) is dedicated to one diagonal solver. So, if I have 8 threads I've got 8 diagonal solvers running. My code is designed for CPUs and not GPUs. 3) I've written my own scaler and block (2x2, 3x3, and 5x5) tri and penta diagonal solvers. For each solver there is a periodic and nonperiodic version. The reason I wrote my own is that I would like my code to be easily portable and I've tried to limit the number of 3rd party packages which go into it. The code is more of a commercial code than a research one. 

March 7, 2012, 06:51 

#3 
New Member
Carl Christian Kjelgaard Mikkelsen
Join Date: Mar 2012
Posts: 7
Rep Power: 7 
@Martin: Thank you very much for your detailed and informative reply!
Given that a banded solver is idle most of the time, is it standard practice to merge the computation of the nonzero coefficients and the components of the right hand side with the actual solution of the system? As opposed to first assembling the entire system and only then calling a solver. /Carl Christian. 

March 7, 2012, 06:51 

#4 
Member
Join Date: Mar 2011
Posts: 50
Rep Power: 7 
% Written by Amin ShariatKhah 2012  Shahrood University Of technology
%Feel Free to use it %This code solve the AX=b (When A is Tridiagonal ) function X=TDMAsolver(A,b) m=length(b); % m is the number of rows X=zeros(m,1); A(1,2)= A(1,2) ./ A(1,1); % Division by zero risk. b(1)= b(1) ./ A(1,1); % Division by zero would imply a singular matrix for i=2:m1 temp= A(i,i)  A(i,i1) .* A(i1,i); A(i,i+1)= A(i,i+1) ./ temp; b(i)= ( b(i)  A(i,i1) .* b(i1) ) ./ temp; end i=m; X(m)=(b(i)  A(i,i1) .* b(i1)) ./ (A(i,i)  A(i,i1) .* A(i1,i)); for i=m1:1:1 X(i)= A(i,i+1) .* X(i+1) + b(i); end end 

March 7, 2012, 07:07 

#5 
New Member
Carl Christian Kjelgaard Mikkelsen
Join Date: Mar 2012
Posts: 7
Rep Power: 7 
@cfd_user: I can appreciate your enthusiasm and willingness to help, but I am not looking for a MATLAB implementation of the Thomas algorithm. I am interested in the significance of serial and parallel banded solvers in CFD.


March 7, 2012, 11:12 

#6  
Senior Member
Martin Hegedus
Join Date: Feb 2011
Posts: 479
Rep Power: 12 
Quote:
loop (k) loop (j) loop (i) assemble i direction components of rhs loop (k) loop (i) loop (j) assemble j direction components of rhs loop (j) loop (i) loop (k) assemble k direction components of rhs loop (k) loop (j) { loop (i) assemble blocks/coefficients of lhs in i direction call scaler/block diagonal solver to solve for this (j,k) line } loop (k) loop (i) { loop (j) assemble blocks/coefficients of lhs in j direction call scaler/block diagonal solver to solve for this (i,k) line } loop (j) loop (i) { loop (k) assemble blocks/coefficients of lhs in k direction call scaler/block diagonal solver to solve for this (i,j) line } 

March 7, 2012, 11:21 

#7 
Senior Member
Martin Hegedus
Join Date: Feb 2011
Posts: 479
Rep Power: 12 
Ooops, my spaces were thrown away
This loop (k) loop (j) { loop (i) assemble blocks/coefficients of lhs in i direction call scaler/block diagonal solver to solve for this (j,k) line } should read this loop (k) { loop (j) { loop (i) { assemble blocks/coefficients of lhs in i direction } call scaler/block diagonal solver to solve for this (j,k) line } } Hopefully the rest is easy to figure out 

March 7, 2012, 12:11 

#8 
New Member
Carl Christian Kjelgaard Mikkelsen
Join Date: Mar 2012
Posts: 7
Rep Power: 7 
@Martin: Thank you very much. You were quite clear even without the spaces. Weaving is a good word to describe the ADI method.


March 7, 2012, 14:25 

#9 
Senior Member
Martin Hegedus
Join Date: Feb 2011
Posts: 479
Rep Power: 12 
BTW, I'm going to throw this out here in case it is in your area of interest.
As you know, two categories of approximate structured implicit solvers for compressible time marching schemes are ADI (DDADI) and DADI (D3ADI). ADI is the more general and DADI requires that one be able to pull out the eigenvalues. The ADI method solves for the flow variables and DADI solves for the eigenvalues. However, to get to DADI one needs to make approximations which degrade as the grid curves and viscosity enters the picture. The DADI is best for Euler problems on a cartesian mesh. What would be cool is a diagonal solver which fusses scaler and block solving methodologies along the line. For example, assume the solver is on the k line going away from the body. A penta diagonal solver is used on the entire k line from surface to outer boundary. Close to the body I would like to use a block diagonal solver but as I move away from the surface I might want to transition to a scaler diagonal solver. Of course there is a lot which goes into this since the overall time for a CFD code is related to time*iterations. Time savings on the diagonal solver is just a part of it. One also wants to maximize on the time step or CFL number to reduce the number of iterations. I personally haven't seen anything that does this. So either it is a dumb idea or a bit of research will be required. Maybe this idea also applies to other fields. Here is a link to give a some idea of the difference between DADI and ADI. http://www.hegedusaero.com/examples/...ifiedDADI.html Good Luck! 

March 7, 2012, 17:40 

#10 
Senior Member
Arjun
Join Date: Mar 2009
Location: Nurenberg, Germany
Posts: 605
Rep Power: 15 
I think it is faster to redistribute the tridiagonals among the processors and solve them by single processor version and then distribute the solution. At least this is how i do with block tridiagonal solvers. Off course this depend on all to all communication step and efficiency solely depend on it. In my case I devised the algorithm (sorry can not share the algo because office restriction etc) such that this is not a big issue. So this could be done.
Further rather than searching for an algorithm that has best scaling over 1000s of processors , one should look for the algorithm that is best for the number of processors he is going to use. Example of this is suppose that I have only 50 processors to work with , I know that BiCGStab has very good scaling compared to AMG when 1000s of processors involved. But since I am going to use 50 only processors I would go with AMG and not with BiCGStab algorithm. BiCGStab's very good scaling is no use to me in this situation. Scaling is not everything. 

March 8, 2012, 05:44 

#11 
New Member
Carl Christian Kjelgaard Mikkelsen
Join Date: Mar 2012
Posts: 7
Rep Power: 7 
@Martin: Thanks again. Unfortunately, I can not afford to make the necessary shift in my focus, at least not at this time. What I really need at this time is a list of application areas which are easily recognized as relying heavily on the use of banded solvers.
@Arjun: Thank you for your input. Redistributing the tridiagonal systems means moving data proportional to the volume of the domain, whereas solving the systems in parallel means moving data proportional to the area of the interface between the subdomains assigned to the individual CPU! I agree that a physicist must concentrate on the specific architecture available as it is the output which is essential to her. As a mathematician working in the field of computing my focus is somewhat different and I must consider the computers of today, both large and small. 

March 13, 2012, 10:21 

#12 
New Member
Arpiruk Hokpunna
Join Date: Mar 2011
Posts: 7
Rep Power: 8 
Dear Martin Hegedus,
I'm just curious. I usually use SIP (Stone's strongly implicit procedure). I have been told that ADI hasproblems with highly nonuniform grid eg. dx:dy:dz=30:15:1 or strong stretching factor. Is it true ? Do you see convergence degradation in those cases ? Arpiruk Last edited by arphy; March 13, 2012 at 10:51. 

March 13, 2012, 10:49 

#13 
New Member
Arpiruk Hokpunna
Join Date: Mar 2011
Posts: 7
Rep Power: 8 
Hi CCKM,
In incompressible flows, the time for enforcing massconservation (projection or other classes) costs somewhere between 40%  70% of the runtime. This depends on how efficient the solver is and how small the conservation errors are needed. I only use SIP and never use ADI. In DNS of channel flow, massconservation error of oneper mille is is usually achieved by 7  25 iterations for Re(Tau) 180  950. It is increasing with Reynolds numbers. At these numbers, it is as fast as Direct FFT+Gaussian elimination (FFT in 2 directions and then GE in another direction). I think there is a real challenge here. On Cartesian grids, the Poisson equation is 7banded in 3D under secondorder discretization. When you are upgraded to Fourthorder, the number of bands become 19!!. This is an insanely high number. Consider that 7bands already cost you 50% of the computational time. Further, imagine that you use this discrete Poisson equation with implicit time Integration. I honestly dont know how to do it. You can look at Kampanis's work for finitedifference at Journal of Computational Physics 215 (2006) 589–613. Hokpunna's also got a similar equation for finitevolume. Journal of Computational Physics Volume 229 Issue 20, October, 2010. Arpiruk, 

March 13, 2012, 12:01 

#14  
Senior Member
Martin Hegedus
Join Date: Feb 2011
Posts: 479
Rep Power: 12 
Quote:
In the literature I usually see CFLs between 10 and 20 for codes similar to mine but then they lag all the boundary conditions and use tridiagonal block solvers by approximating the 4th order artificial dissipation in the LHS. When I try it this way my convergence is poor, CFLs between 5 and 15. I can get to 20 on nice grids. So, I'm not sure how much of a degradation I see. Also, haven't really compared to a direct solver method. I figure at some point physics catches up with one and one can only push the solution so hard. Granted, the method should be stable and therefore the CFL can go to infinity, but I guess the solution trajectory would not be optimum. On a different note, if I squash my cells in the field I do see a hit due to the local time stepping since the time step through the squashed cells is smaller relative to bigger cells, thus the solution has resistance moving through those areas. But, that is matrix solver independent. I just want to mention it since it could be clouding the discussion from others. It sounds like your problem is time accurate, so I'm not sure how much my case applies to you, except to say that if you use subiterations I guess it might. Hope the answer was clear enough. 

March 14, 2012, 08:40 

#15 
New Member
Carl Christian Kjelgaard Mikkelsen
Join Date: Mar 2012
Posts: 7
Rep Power: 7 
@Arpiruk: Thanks for the figures, thoughts and the references. In particular, solving Poisson's equation in a beam gives me a good example were a parallel solver is appropriate.
I do not suppose that it is this simple, but there are two problems with ADI for elliptic problems that I am aware of: a) You have to find good shift parameters in order to accelerate convergence and this is not trivial. b) The condition number of the linear systems is proportional to the CFL number, with a ratio of 30:15:1 for the spatial stepsize the condition numbers will have a ratio of about 900:225:1 and the rounding errors will spill over from the "bad" systems and pollute the rest of the calculation. But I suppose that the truncation error from discretizing the differential equations is typically much, much bigger than rounding errors? 

Tags 
diagonally, dominant, parallel, sequential, tridiagonal 
Thread Tools  
Display Modes  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
grid dependancy  gueynard a.  Main CFD Forum  19  June 27, 2014 21:22 
please help me which solver is better for my application  Ger_US  OpenFOAM  8  January 8, 2013 13:29 
Doubt on Implicit Methods  analyse In India  Main CFD Forum  10  March 9, 2007 04:01 
CFL Condition  Matt Umbel  Main CFD Forum  14  January 12, 2001 15:34 
Wall functions  Abhijit Tilak  Main CFD Forum  6  February 5, 1999 02:16 