Optimizing the solution of tridiagonal linear systems
Greetings,
I work on the analysis and development of parallel algorithms for banded linear systems and matrix equations. After browsing some CFD textbooks and papers I am stuck with the impression that CFD relies heavily on solvers of linear systems which are both tridiagonal and diagonally dominant by rows. However, with very little practical experience in this field I could easily be wrong. My questions: 1) Within your own code roughly who much time is spent solving tridiagonal linear systems? 2) The ADI method for solving say a parabolic PDE requires the solution of many independent tridiagonal systems which are typically solved sequentially on different cores. Do you ever split the solution of a tridiagonal linear system over multiple cores? I imagine that this would be relevant on domains which are particularly long such as pipes, canals and blood vessels. 3) Which tridiagonal solver do you use? Background: Simple, I need to justify my continued existence to a funding agency, so I need to present some good applications of my work. It is possible to improve on LAPACK's sequential solvers. Time can be saved by eliminating the logical check for division by zero and reorganising the memory access pattern. I can remove about 25% of the run time for LAPACK's routine DPTSV which applies to system which are tridiagonal and symmetric positive definite. Removal of the logical check can be justified mathematically. The same principle can be successfully applied to the Thomas algorithm for systems which are strictly diagonally dominant by rows, but not necessarily symmetric positive definite. Tridiagonal solves can be done in parallel using, say, Sameh's Spike algorithms, ScaLAPACK's implementations of block cyclic reduction, or Sun's algorithms (PDD and others). However, in each case the parallel efficiency is below 50%. This is not bad, but recently, Hokpunna's interface splitting algorithm has achieved efficiencies above 90% for systems generated by parabolic PDE and certain spline interpolation problems. I believe that something similar can be achieved in a more general setting. Thank you for your attention! Carl Christian  Carl Christian Kjelgaard Mikkelsen, Ph.D. Umeå University Department of Computing Science SE90187 Umeå Sweden 
I hope someone with broader experience than I answers your question.
There are different types of CFD solvers. Some of the parameters which define a CFD code are explicit and implicit, and structured and unstructured. An implicit structured code is most likely to use a tridiagonal solver. Other important parameters are how the code is differentiated, i.e. central differencing, flux splitting, and flux difference splitting. Then there is the order of the spatial differencing, second order, fourth order, etc. The type and order of the differencing will determine the tridiagonal, pentadiagonal, etc. will need to be used. And ADI methods will need a block solver while diagonalized methods will need a scaler solver. So there is a lot which goes into a solver. Also, in general, the memory bus is the bottle neck, and not the CPU. So, to optimize a CFD code, one needs to look at the whole. There are several types of parallelization, machine, CPU, and/or GPU. In general, the block/scaler diagonal solver does not need to be parallelized for machines and CPUs. However, for massively parallel hardware, such as GPUs, the solver does need to be parallelized. And, from what I understand, that is the tricky part. Answers to your questions (my solver is a implicit structured central difference solver): 1) Not really sure. 10%? A good chunk of the overall CPU time is actually idle while memory is being brought in. For my solver, (which goes in the i, j, k direction where the most paging is done in k direction), for a square block, the k direction takes 4x longer than the i direction. 2) If my solver is running on one machine the domain does not need to be split. Each of the various threads (one for each CPU core) is dedicated to one diagonal solver. So, if I have 8 threads I've got 8 diagonal solvers running. My code is designed for CPUs and not GPUs. 3) I've written my own scaler and block (2x2, 3x3, and 5x5) tri and penta diagonal solvers. For each solver there is a periodic and nonperiodic version. The reason I wrote my own is that I would like my code to be easily portable and I've tried to limit the number of 3rd party packages which go into it. The code is more of a commercial code than a research one. 
@Martin: Thank you very much for your detailed and informative reply!
Given that a banded solver is idle most of the time, is it standard practice to merge the computation of the nonzero coefficients and the components of the right hand side with the actual solution of the system? As opposed to first assembling the entire system and only then calling a solver. /Carl Christian. 
% Written by Amin ShariatKhah 2012  Shahrood University Of technology
%Feel Free to use it %This code solve the AX=b (When A is Tridiagonal ) function X=TDMAsolver(A,b) m=length(b); % m is the number of rows X=zeros(m,1); A(1,2)= A(1,2) ./ A(1,1); % Division by zero risk. b(1)= b(1) ./ A(1,1); % Division by zero would imply a singular matrix for i=2:m1 temp= A(i,i)  A(i,i1) .* A(i1,i); A(i,i+1)= A(i,i+1) ./ temp; b(i)= ( b(i)  A(i,i1) .* b(i1) ) ./ temp; end i=m; X(m)=(b(i)  A(i,i1) .* b(i1)) ./ (A(i,i)  A(i,i1) .* A(i1,i)); for i=m1:1:1 X(i)= A(i,i+1) .* X(i+1) + b(i); end end 
@cfd_user: I can appreciate your enthusiasm and willingness to help, but I am not looking for a MATLAB implementation of the Thomas algorithm. I am interested in the significance of serial and parallel banded solvers in CFD.

Quote:
loop (k) loop (j) loop (i) assemble i direction components of rhs loop (k) loop (i) loop (j) assemble j direction components of rhs loop (j) loop (i) loop (k) assemble k direction components of rhs loop (k) loop (j) { loop (i) assemble blocks/coefficients of lhs in i direction call scaler/block diagonal solver to solve for this (j,k) line } loop (k) loop (i) { loop (j) assemble blocks/coefficients of lhs in j direction call scaler/block diagonal solver to solve for this (i,k) line } loop (j) loop (i) { loop (k) assemble blocks/coefficients of lhs in k direction call scaler/block diagonal solver to solve for this (i,j) line } 
Ooops, my spaces were thrown away
This loop (k) loop (j) { loop (i) assemble blocks/coefficients of lhs in i direction call scaler/block diagonal solver to solve for this (j,k) line } should read this loop (k) { loop (j) { loop (i) { assemble blocks/coefficients of lhs in i direction } call scaler/block diagonal solver to solve for this (j,k) line } } Hopefully the rest is easy to figure out 
@Martin: Thank you very much. You were quite clear even without the spaces. Weaving is a good word to describe the ADI method.

BTW, I'm going to throw this out here in case it is in your area of interest.
As you know, two categories of approximate structured implicit solvers for compressible time marching schemes are ADI (DDADI) and DADI (D3ADI). ADI is the more general and DADI requires that one be able to pull out the eigenvalues. The ADI method solves for the flow variables and DADI solves for the eigenvalues. However, to get to DADI one needs to make approximations which degrade as the grid curves and viscosity enters the picture. The DADI is best for Euler problems on a cartesian mesh. What would be cool is a diagonal solver which fusses scaler and block solving methodologies along the line. For example, assume the solver is on the k line going away from the body. A penta diagonal solver is used on the entire k line from surface to outer boundary. Close to the body I would like to use a block diagonal solver but as I move away from the surface I might want to transition to a scaler diagonal solver. Of course there is a lot which goes into this since the overall time for a CFD code is related to time*iterations. Time savings on the diagonal solver is just a part of it. One also wants to maximize on the time step or CFL number to reduce the number of iterations. I personally haven't seen anything that does this. So either it is a dumb idea or a bit of research will be required. Maybe this idea also applies to other fields. Here is a link to give a some idea of the difference between DADI and ADI. http://www.hegedusaero.com/examples/...ifiedDADI.html Good Luck! 
I think it is faster to redistribute the tridiagonals among the processors and solve them by single processor version and then distribute the solution. At least this is how i do with block tridiagonal solvers. Off course this depend on all to all communication step and efficiency solely depend on it. In my case I devised the algorithm (sorry can not share the algo because office restriction etc) such that this is not a big issue. So this could be done.
Further rather than searching for an algorithm that has best scaling over 1000s of processors , one should look for the algorithm that is best for the number of processors he is going to use. Example of this is suppose that I have only 50 processors to work with , I know that BiCGStab has very good scaling compared to AMG when 1000s of processors involved. But since I am going to use 50 only processors I would go with AMG and not with BiCGStab algorithm. BiCGStab's very good scaling is no use to me in this situation. Scaling is not everything. 
@Martin: Thanks again. Unfortunately, I can not afford to make the necessary shift in my focus, at least not at this time. What I really need at this time is a list of application areas which are easily recognized as relying heavily on the use of banded solvers.
@Arjun: Thank you for your input. Redistributing the tridiagonal systems means moving data proportional to the volume of the domain, whereas solving the systems in parallel means moving data proportional to the area of the interface between the subdomains assigned to the individual CPU! I agree that a physicist must concentrate on the specific architecture available as it is the output which is essential to her. As a mathematician working in the field of computing my focus is somewhat different and I must consider the computers of today, both large and small. 
Dear Martin Hegedus,
I'm just curious. I usually use SIP (Stone's strongly implicit procedure). I have been told that ADI hasproblems with highly nonuniform grid eg. dx:dy:dz=30:15:1 or strong stretching factor. Is it true ? Do you see convergence degradation in those cases ? Arpiruk 
Hi CCKM,
In incompressible flows, the time for enforcing massconservation (projection or other classes) costs somewhere between 40%  70% of the runtime. This depends on how efficient the solver is and how small the conservation errors are needed. I only use SIP and never use ADI. In DNS of channel flow, massconservation error of oneper mille is is usually achieved by 7  25 iterations for Re(Tau) 180  950. It is increasing with Reynolds numbers. At these numbers, it is as fast as Direct FFT+Gaussian elimination (FFT in 2 directions and then GE in another direction). I think there is a real challenge here. On Cartesian grids, the Poisson equation is 7banded in 3D under secondorder discretization. When you are upgraded to Fourthorder, the number of bands become 19!!. This is an insanely high number. Consider that 7bands already cost you 50% of the computational time. Further, imagine that you use this discrete Poisson equation with implicit time Integration. I honestly dont know how to do it. You can look at Kampanis's work for finitedifference at Journal of Computational Physics 215 (2006) 589–613. Hokpunna's also got a similar equation for finitevolume. Journal of Computational Physics Volume 229 Issue 20, October, 2010. Arpiruk, 
Quote:
In the literature I usually see CFLs between 10 and 20 for codes similar to mine but then they lag all the boundary conditions and use tridiagonal block solvers by approximating the 4th order artificial dissipation in the LHS. When I try it this way my convergence is poor, CFLs between 5 and 15. I can get to 20 on nice grids. So, I'm not sure how much of a degradation I see. Also, haven't really compared to a direct solver method. I figure at some point physics catches up with one and one can only push the solution so hard. Granted, the method should be stable and therefore the CFL can go to infinity, but I guess the solution trajectory would not be optimum. On a different note, if I squash my cells in the field I do see a hit due to the local time stepping since the time step through the squashed cells is smaller relative to bigger cells, thus the solution has resistance moving through those areas. But, that is matrix solver independent. I just want to mention it since it could be clouding the discussion from others. It sounds like your problem is time accurate, so I'm not sure how much my case applies to you, except to say that if you use subiterations I guess it might. Hope the answer was clear enough. 
@Arpiruk: Thanks for the figures, thoughts and the references. In particular, solving Poisson's equation in a beam gives me a good example were a parallel solver is appropriate.
I do not suppose that it is this simple, but there are two problems with ADI for elliptic problems that I am aware of: a) You have to find good shift parameters in order to accelerate convergence and this is not trivial. b) The condition number of the linear systems is proportional to the CFL number, with a ratio of 30:15:1 for the spatial stepsize the condition numbers will have a ratio of about 900:225:1 and the rounding errors will spill over from the "bad" systems and pollute the rest of the calculation. But I suppose that the truncation error from discretizing the differential equations is typically much, much bigger than rounding errors? 
All times are GMT 4. The time now is 14:28. 