CFD Online Discussion Forums - Optimizing the solution of tridiagonal linear systems

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Main CFD Forum (https://www.cfd-online.com/Forums/main/)

- - Optimizing the solution of tridiagonal linear systems (https://www.cfd-online.com/Forums/main/98222-optimizing-solution-tridiagonal-linear-systems.html)

CCKM	March 6, 2012 06:52

Optimizing the solution of tridiagonal linear systems

Greetings,

I work on the analysis and development of parallel algorithms for banded linear systems and matrix equations. After browsing some CFD textbooks and papers I am stuck with the impression that CFD relies heavily on solvers of linear systems which are both tridiagonal and diagonally dominant by rows. However, with very little practical experience in this field I could easily be wrong.

My questions:

1) Within your own code roughly who much time is spent solving tridiagonal linear systems?
2) The ADI method for solving say a parabolic PDE requires the solution of many independent tridiagonal systems which are typically solved sequentially on different cores. Do you ever split the solution of a tridiagonal linear system over multiple cores? I imagine that this would be relevant on domains which are particularly long such as pipes, canals and blood vessels.
3) Which tridiagonal solver do you use?

Background:

Simple, I need to justify my continued existence to a funding agency, so I need to present some good applications of my work.

It is possible to improve on LAPACK's sequential solvers. Time can be saved by eliminating the logical check for division by zero and reorganising the memory access pattern. I can remove about 25% of the run time for LAPACK's routine DPTSV which applies to system which are tridiagonal and symmetric positive definite. Removal of the logical check can be justified mathematically. The same principle can be successfully applied to the Thomas algorithm for systems which are strictly diagonally dominant by rows, but not necessarily symmetric positive definite.

Tridiagonal solves can be done in parallel using, say, Sameh's Spike algorithms, ScaLAPACK's implementations of block cyclic reduction, or Sun's algorithms (PDD and others). However, in each case the parallel efficiency is below 50%. This is not bad, but recently, Hokpunna's interface splitting algorithm has achieved efficiencies above 90% for systems generated by parabolic PDE and certain spline interpolation problems. I believe that something similar can be achieved in a more general setting.

Thank you for your attention!

Carl Christian

-------------------------------------------
Carl Christian Kjelgaard Mikkelsen, Ph.D.

Umeå University
Department of Computing Science
SE-90187 Umeå
Sweden

Martin Hegedus

March 6, 2012 11:31

I hope someone with broader experience than I answers your question.

There are different types of CFD solvers. Some of the parameters which define a CFD code are explicit and implicit, and structured and unstructured. An implicit structured code is most likely to use a tridiagonal solver. Other important parameters are how the code is differentiated, i.e. central differencing, flux splitting, and flux difference splitting. Then there is the order of the spatial differencing, second order, fourth order, etc. The type and order of the differencing will determine the tridiagonal, pentadiagonal, etc. will need to be used. And ADI methods will need a block solver while diagonalized methods will need a scaler solver. So there is a lot which goes into a solver. Also, in general, the memory bus is the bottle neck, and not the CPU. So, to optimize a CFD code, one needs to look at the whole.

There are several types of parallelization, machine, CPU, and/or GPU. In general, the block/scaler diagonal solver does not need to be parallelized for machines and CPUs. However, for massively parallel hardware, such as GPUs, the solver does need to be parallelized. And, from what I understand, that is the tricky part.

Answers to your questions (my solver is a implicit structured central difference solver):
1) Not really sure. 10%? A good chunk of the overall CPU time is actually idle while memory is being brought in. For my solver, (which goes in the i, j, k direction where the most paging is done in k direction), for a square block, the k direction takes 4x longer than the i direction.
2) If my solver is running on one machine the domain does not need to be split. Each of the various threads (one for each CPU core) is dedicated to one diagonal solver. So, if I have 8 threads I've got 8 diagonal solvers running. My code is designed for CPUs and not GPUs.
3) I've written my own scaler and block (2x2, 3x3, and 5x5) tri and penta diagonal solvers. For each solver there is a periodic and nonperiodic version. The reason I wrote my own is that I would like my code to be easily portable and I've tried to limit the number of 3rd party packages which go into it. The code is more of a commercial code than a research one.

CCKM	March 7, 2012 05:51

@Martin: Thank you very much for your detailed and informative reply!

Given that a banded solver is idle most of the time, is it standard
practice to merge the computation of the nonzero coefficients and
the components of the right hand side with the actual solution of
the system? As opposed to first assembling the entire system and
only then calling a solver.

/Carl Christian.

cdf_user

March 7, 2012 05:51

% Written by Amin ShariatKhah 2012 - Shahrood University Of technology
%Feel Free to use it
%This code solve the AX=b (When A is Tridiagonal )

function X=TDMAsolver(A,b)

m=length(b); % m is the number of rows
X=zeros(m,1);
A(1,2)= A(1,2) ./ A(1,1); % Division by zero risk.
b(1)= b(1) ./ A(1,1); % Division by zero would imply a singular matrix

for i=2:m-1
temp= A(i,i) - A(i,i-1) .* A(i-1,i);
A(i,i+1)= A(i,i+1) ./ temp;
b(i)= ( b(i) - A(i,i-1) .* b(i-1) ) ./ temp;
end

i=m;
X(m)=(b(i) - A(i,i-1) .* b(i-1)) ./ (A(i,i) - A(i,i-1) .* A(i-1,i));

for i=m-1:-1:1
X(i)= -A(i,i+1) .* X(i+1) + b(i);
end
end

CCKM	March 7, 2012 06:07

@cfd_user: I can appreciate your enthusiasm and willingness to help, but I am not looking for a MATLAB implementation of the Thomas algorithm. I am interested in the significance of serial and parallel banded solvers in CFD.

Martin Hegedus

March 7, 2012 10:12

Quote:

Originally Posted by CCKM (Post 348109)

The standard procedure for a structured implicit solver would be something like what is shown below (assumes ADI, DADI, DDADI, or D3ADI is being used). This is just an example. People tweak it for performance gains. But, in general, the diagonal solver is a subroutine which is separate from assembling the coefficients, (the matrix coefficients are passed into the diagonal solver subroutine). However, the entire system is not assembled and then the solver is called. A structured implicit solver works along i, j, and k lines. So a line is assembled and then that line's coefficients are passed to the solver. Then the code marches to the next line. I like to call it weaving. I assume structured solvers are a totally different thing, but that is not where my expertise is. Also, I guess, direct structured implicit solvers, such as SSOR, work differently. But that is not my area of expertise either.

loop (k)
loop (j)
loop (i)
assemble i direction components of rhs

loop (k)
loop (i)
loop (j)
assemble j direction components of rhs

loop (j)
loop (i)
loop (k)
assemble k direction components of rhs

loop (k)
loop (j) {
loop (i)
assemble blocks/coefficients of lhs in i direction
call scaler/block diagonal solver to solve for this (j,k) line
}

loop (k)
loop (i) {
loop (j)
assemble blocks/coefficients of lhs in j direction
call scaler/block diagonal solver to solve for this (i,k) line
}

loop (j)
loop (i) {
loop (k)
assemble blocks/coefficients of lhs in k direction
call scaler/block diagonal solver to solve for this (i,j) line
}

Martin Hegedus

March 7, 2012 10:21

Ooops, my spaces were thrown away

This

loop (k)
loop (j) {
loop (i)
assemble blocks/coefficients of lhs in i direction
call scaler/block diagonal solver to solve for this (j,k) line
}

should read this

loop (k) {
loop (j) {
loop (i) {
assemble blocks/coefficients of lhs in i direction }
call scaler/block diagonal solver to solve for this (j,k) line
} }

Hopefully the rest is easy to figure out

CCKM	March 7, 2012 11:11

@Martin: Thank you very much. You were quite clear even without the spaces. Weaving is a good word to describe the ADI method.

Martin Hegedus

March 7, 2012 13:25

BTW, I'm going to throw this out here in case it is in your area of interest.

As you know, two categories of approximate structured implicit solvers for compressible time marching schemes are ADI (DDADI) and DADI (D3ADI). ADI is the more general and DADI requires that one be able to pull out the eigenvalues. The ADI method solves for the flow variables and DADI solves for the eigenvalues. However, to get to DADI one needs to make approximations which degrade as the grid curves and viscosity enters the picture. The DADI is best for Euler problems on a cartesian mesh. What would be cool is a diagonal solver which fusses scaler and block solving methodologies along the line. For example, assume the solver is on the k line going away from the body. A penta diagonal solver is used on the entire k line from surface to outer boundary. Close to the body I would like to use a block diagonal solver but as I move away from the surface I might want to transition to a scaler diagonal solver.

Of course there is a lot which goes into this since the overall time for a CFD code is related to time*iterations. Time savings on the diagonal solver is just a part of it. One also wants to maximize on the time step or CFL number to reduce the number of iterations.

I personally haven't seen anything that does this. So either it is a dumb idea or a bit of research will be required. Maybe this idea also applies to other fields.

Here is a link to give a some idea of the difference between DADI and ADI. http://www.hegedusaero.com/examples/...ifiedDADI.html

Good Luck!

arjun

March 7, 2012 16:40

I think it is faster to redistribute the tridiagonals among the processors and solve them by single processor version and then distribute the solution. At least this is how i do with block tridiagonal solvers. Off course this depend on all to all communication step and efficiency solely depend on it. In my case I devised the algorithm (sorry can not share the algo because office restriction etc) such that this is not a big issue. So this could be done.

Further rather than searching for an algorithm that has best scaling over 1000s of processors , one should look for the algorithm that is best for the number of processors he is going to use.

Example of this is suppose that I have only 50 processors to work with , I know that BiCGStab has very good scaling compared to AMG when 1000s of processors involved. But since I am going to use 50 only processors I would go with AMG and not with BiCGStab algorithm. BiCGStab's very good scaling is no use to me in this situation.

Scaling is not everything.

CCKM	March 8, 2012 04:44

@Martin: Thanks again. Unfortunately, I can not afford to make the necessary shift in my focus, at least not at this time. What I really need at this time is a list of application areas which are easily recognized as relying heavily on the use of banded solvers.

@Arjun: Thank you for your input. Redistributing the tridiagonal systems means moving data proportional to the volume of the domain, whereas solving the systems in parallel means moving data proportional to the area of the interface between the subdomains assigned to the individual CPU! I agree that a physicist must concentrate on the specific architecture available as it is the output which is essential to her. As a mathematician working in the field of computing my focus is somewhat different and I must consider the computers of today, both large and small.

arphy

March 13, 2012 09:21

Dear Martin Hegedus,

I'm just curious. I usually use SIP (Stone's strongly implicit procedure). I have been told that ADI hasproblems with highly non-uniform grid eg. dx:dy:dz=30:15:1 or strong stretching factor. Is it true ?

Do you see convergence degradation in those cases ?

Arpiruk

arphy

March 13, 2012 09:49

Hi CCKM,

In incompressible flows, the time for enforcing mass-conservation (projection or other classes) costs somewhere between 40% - 70% of the runtime. This depends on how efficient the solver is and how small the conservation errors are needed.

I only use SIP and never use ADI. In DNS of channel flow, mass-conservation error of one-per mille is is usually achieved by 7 - 25 iterations for Re(Tau) 180 - 950. It is increasing with Reynolds numbers. At these numbers, it is as fast as Direct FFT+Gaussian elimination (FFT in 2 directions and then GE in another direction).

I think there is a real challenge here.
On Cartesian grids, the Poisson equation is 7-banded in 3D under second-order discretization. When you are upgraded to Fourth-order, the number of bands become 19!!. This is an insanely high number. Consider that 7-bands already cost you 50% of the computational time.

Further, imagine that you use this discrete Poisson equation with implicit time Integration. I honestly dont know how to do it.

You can look at Kampanis's work for finite-difference
at Journal of Computational Physics 215 (2006) 589–613.
Hokpunna's also got a similar equation for finite-volume.
Journal of Computational Physics Volume 229 Issue 20, October, 2010.

Arpiruk,

Martin Hegedus

March 13, 2012 11:01

Quote:

Originally Posted by arphy (Post 349181)

I think a lot depends on what people are doing. I'm solving 2nd order central difference with 2nd/4th order artificial dissipation, so I've got five bands in the i, j, and k directions. The grids are viscous next to the body so I've got squashed cells with somewhat strong stretching, 10 to 15 percent. My code is a time marching, steady state, compressible, RANS solver. To get the CFL cranked up I need to use a penta diagonal block solver and include boundary conditions in the matrix blocks. I can get CFL numbers between 40 and 150. My method is also Chimera where I lag the fringe boundary points. In general, this causes my CFL to max out around 40 since the lagged b.c.s cause unsteady flow. Without overlapping grids (lagged b.c.s), I'm usually at 80. Going above 80 on a general grid is rare, but not too uncommon.

In the literature I usually see CFLs between 10 and 20 for codes similar to mine but then they lag all the boundary conditions and use tridiagonal block solvers by approximating the 4th order artificial dissipation in the LHS. When I try it this way my convergence is poor, CFLs between 5 and 15. I can get to 20 on nice grids.

So, I'm not sure how much of a degradation I see. Also, haven't really compared to a direct solver method. I figure at some point physics catches up with one and one can only push the solution so hard. Granted, the method should be stable and therefore the CFL can go to infinity, but I guess the solution trajectory would not be optimum.

On a different note, if I squash my cells in the field I do see a hit due to the local time stepping since the time step through the squashed cells is smaller relative to bigger cells, thus the solution has resistance moving through those areas. But, that is matrix solver independent. I just want to mention it since it could be clouding the discussion from others.

It sounds like your problem is time accurate, so I'm not sure how much my case applies to you, except to say that if you use sub-iterations I guess it might.

Hope the answer was clear enough.

CCKM	March 14, 2012 07:40

@Arpiruk: Thanks for the figures, thoughts and the references. In particular, solving Poisson's equation in a beam gives me a good example were a parallel solver is appropriate.

I do not suppose that it is this simple, but there are two problems with ADI for elliptic problems that I am aware of:

a) You have to find good shift parameters in order to accelerate convergence and this is not trivial.

b) The condition number of the linear systems is proportional to the CFL number, with a ratio of 30:15:1 for the spatial stepsize the condition numbers will have a ratio of about 900:225:1 and the rounding errors will spill over from the "bad" systems and pollute the rest of the calculation.

But I suppose that the truncation error from discretizing the differential equations is typically much, much bigger than rounding errors?

All times are GMT -4. The time now is 06:03.