Fastest parallel solver for tridiagonal system
Hello everyone,
I have been searching for parallel algorithm for compact scheme, i.e. Lele1993 (J.Comp.Phy) and Kobayashi1999 (J.Comp.Phy). This means I have to solve Ax=Bf where x is the desired value(interpolation/differentiation/secondorder differentiation). The matrix A i have to deal with is a tridiagonal matrix. I this A are always positive definite on ordinary smooth grid.(still don't know how to prove it though) I did some research and believe that finegrained parallelism such as Recursive doubling (Stone1975 TOMS) and Cyclic reduction(Hockney1965 J.ACM ) are not suited for current supercomputers which are now very large grain ( A cluster with multicore node ). Family of pipelining algorithm also requires too much communications. It seems that the reduced parallel diagonal dominant of SUN (http://mack.ittc.ku.edu/sun95application.html) are the most efficient for this problem. The complexity is 5n/p+4J (p=#of proc, J = truncated bandwidth) which almost matches the best serial code ( Thomas algo. with prefactored LU ~ 5n3) . Is this algorithm already the best (at the moment ?) Are there any algorithm more efficient than this one ? Are there any particular problem in this algorithm that I should be aware of ? Any suggestions are appreciated. Regards, Arpiruk 
Re: Fastest parallel solver for tridiagonal system
The best way to solve this is not to make the tridiagonal solver parallel but to perform the tridiagonal solves simultaneously; i.e. if you have a 2d equation in x & y and you want to calculate u_x at each of the y values then split the grid into strips which can be solved at the same time using the usual single processor tridiagonal solver.

Re: Fastest parallel solver for tridiagonal system
I believe one critical step that was left out in the above algorithm is the need to transpose the decomposition. I.e., you can get u_x with parallelization in y if you have all of x for chunks of y on each processor. But then what happens when you need to get u_y? You need to transpose the decomposition so that each processor has all of y for chunks of x. A few algorithms are out there that perform this transpose efficiently in parallel, such as the block transpose or the perfect shuffle.

Re: Fastest parallel solver for tridiagonal system
>The best way to solve this is not to make the tridiagonal
:solver parallel but to perform the tridiagonal solves :simultaneously; i.e. if you have a 2d equation in x & y and :you want to calculate u_x at each of the y values then split >the grid into strips which can be solved at the same time :using the usual single processor tridiagonal solver. That is a good idea. I herd that idea once but I dont know the reference to the literature. I think it's called method of transpose or something like that. However it still require a tremedous amount of data transfer. Let us consider the following model problem. Suppose we are to solve a 2D convection problem u u_x + v u_y = 0 u v_x + v v_y = 0 ( f_i = partial derivative of f along i) with NxNy = 10^4x10^4 using 4 grids on 4 processors. ______________________     1  2  _____________________    y  3  4   _____________________ __ x Each of the processor only know the local data its is sotring. The compact scheme will couple the whole domain in to one equation. For example df/dx. Therefore to compute u_x, we can divide data in y direction into 4 strips. ( i.e. 3 to 31y and 3 2y, 1 to 13y and 14y) Then 41y can be exchanged with 32y. The p3 will have the whole data to compute u_x in the first strip(1y) and p4 will solve for 2y. The latency scales with number of processor(mpi_gather ) This method allow latency hiding but it is still tightly coupled. What disturb me most is the huge data transfer. Can this really be used on ordinary ethernet cluster? 
Re: Fastest parallel solver for tridiagonal system
"What disturb me most is the huge data transfer. Can this really be used on ordinary ethernet cluster?"
Never tried it on a conventional ethernet cluster  it works quite well on a supercomputer though! A similar trick is also used on vector CPU's where the tridiagonal part does not vectorize well  however with a little memory overhead the outer loops (which do vectorize) can be moved inside the tridiagonal solver's loop. It may be cheaper to use a Conjugate gradient method to invert your system  the matrix multiplies only require a moderate amount of data traffic and this may compensate for the need to iterate (a Chebyshev iterative method may be even better since it avoids innerproducts/global sums). 
Re: Fastest parallel solver for tridiagonal system
Thank you very much. I will look for algorithms you mentioned.

Re: Fastest parallel solver for tridiagonal system
To solve a tridiagonal system you need a forward sweep along the line from one boundary to the other followed by a backsubstitution. Periodic boundary conditions need a slight modification but follow the same procedure.
Parallelisation consists of passing the result of the forward sweep to a neighbouring block and then waiting for the backsubstitution to be returned by that neighbouring block. This introduces some inefficiency as internal blocks have to wait to get data from blocks on the boundary. Also the latency of passing only one or two words of data at the end of each line would really hurt efficiency on most hardware today (but was OK on transputers in the 80s) and so normal implementations would collect a few line ending before passing the data onto the neighbour. The TDMA algorithm is literally about 10 lines long and so can be easily programmed by students to help teach themselves how to program numerical methods in parallel. 
Re: Fastest parallel solver for tridiagonal system
"Parallelisation consists of passing the result of the forward sweep to a neighbouring block and then waiting for the backsubstitution to be returned by that neighbouring block. This introduces some inefficiency as internal blocks have to wait to get data from blocks on the boundary."
This algoithm is actually very inefficient. It has a parallel speed up that at best goes as 1/N where N is the number of processors  may as well run on a single processor. 
Re: Fastest parallel solver for tridiagonal system
> This algoithm is actually very inefficient. It has a parallel speed up that at
: best goes as 1/N where N is the number of processors  may as well run on : a single processor. Nonsense. My original implementations in the 80s were over 95% efficient and on all my parallel machines since I have had no problems maintaining efficiencies greater than 50% for real CFD simulations (I take little interest in optimising further when I get efficiencies above 50%). 
Re: Fastest parallel solver for tridiagonal system
I thought what you meant was that one processor starts off the forward substitution and when it gets to its last index, it passes the result to the next processor so that it can continue the forward substitution. This will continue until the last processor (ie Imax), which will then start the backward substitution, which proceeds all the way back to the first processor. In this case, only one processor is working at a time and all of the others are idle. The result is a speedup that goes as 1/(number of prcessors). I have no doubt that your algorithm is different but I don't yet see how. Can you explain it?

Re: Fastest parallel solver for tridiagonal system
> Can you explain it?
What you are describing applies to one grid line but there are lots of grid lines. For example. when a processor has passed on data from a forward sweep it starts the forward sweep of the next grid line and goes back later for the results from the back substitution. In practise there are a number of things one can do to tune the performance based on actual grid topology and the ratio of number crunching to communication speed of the parallel computer. It also needs a slight modification to handle the coarsest grids inside a multigrid scheme. Of course one can simply treat the interfaces in an explicit manner which is seems to be fairly common but then it is not the same as the sequential algorithm. This works well in some applications where the solution procedure takes only a few more extra iterations to converge but not in others. 
Re: Fastest parallel solver for tridiagonal system
>What you are describing applies to one grid line but there >are lots of grid lines. For example. when a processor has >passed on data from a forward sweep it starts the forward >sweep of the next grid line and goes back later for the >results from the back substitution.
Ok, but won't all except the first processor experience idle time while waiting for input from the previous processor? Can't the idle time of the last processor be significant? Can you compare this approach to the one in which the decomposition is transposed between tridiag solves so that all of the data for each linear system is always on a single processor? 
Re: Fastest parallel solver for tridiagonal system
> Ok, but won't all except the first processor experience idle time while
: waiting for input from the previous processor? All processors on a particular boundary can start at the same time not just one. > Can't the idle time of the last processor be significant? Not usually. For 64 processors in a 4*4*4 arrangement solving a CFD problem on a 512*512*512 grid there will be a delay of 3 forward sweeps before all processors are busy for passing a message at the end of each line. This is an efficiency loss of roughly 3/(2*512*512) = 0.000005! Of course, on modern machines you would collect a few line endings before passing the data because of the slow communication speed to number crunching speed but you can make this source of inefficiency a few orders of magnitude larger before it becomes significant to the overall problem. > Can you compare this approach to the one in which the decomposition is : transposed between tridiag solves so that all of the data for each linear : system is always on a single processor? If you sweep in only one direction then a sensible domain decomposition would give you this anyway. If you sweep in several directions it would not only mean moving a lot of data around but also moving it beyond immediate neighbours. On modern machines this is likely to take too much communication time given that line sweeps are very fast. My guess is that you would need a more number crunching intensive solver for it to be worthwhile but I have not tried it. My experience has been that it is often faster to duplicate small calculations rather than perform them once and try to pass the results to distant processors. 
Re: Fastest parallel solver for tridiagonal system
It seems there are two approaches being proposed here. 1.Transposed algorithm. (Tom) 2.Pipelining (Andy).
Most time consuming communication of 1 is the data transfer(seldom communication but big portion each time). Let M^3(M*M*M) be total data in one grid and npi be the number of grid in idirection, the communication time will be ti_1 = npi*(M^3/npi)/(Connection speed ) = M^3/(Connection speed) [ Because each processor has to gather the data from all processor ] Most time consuming communication of 2 is the frequent data transfer(small data but frequently) but the last processor has to wait until the previous processor send the data to him. Therefore the communication time, ti_2 = M^2 *Latency +M/(CPU SPEED) [Assuming Latency >> 1/Connection speed]. ti_1/ti_2 ~ ( M/CS )/(Lat+1/(CPUS*M^2) ) The ratio is clearly dependent on M,CS(connection speed) and Lat(latency). If we just plug in current supercomputer spec. i.e. latency ~ 2.5 microsec CS~6GB/s and CPUS ~ 6GB/s. We will see that the transpose method are faster until M ~4,000 [under single precision]. This is of course over exaggerate. We know that we cannot access main memory at that speed. Let assume a more practical rate, CS~ 2GB/s. This will give the break even point at M~1000. In my opinion I think the transpose method is more suited for current hardware. Tom, could you please point me to some literature about this ? 
Re: Fastest parallel solver for tridiagonal system
All I'm aware of is some older work:
Solution of the Euler and Navier Stokes equations on parallel processors using a transposed/Thomas ADI Algorithm CHYCZEWSKI, THOMAS S. (Pennsylvania State Univ., University Park) MARCONI, FRANK (Grumman Corporate Research Center, Bethpage, NY) PELZ, RICHARD B., CURCHITSER, ENRIQUE N. (Rutgers Univ., Piscataway, NJ) AIAA19933310 This was done on hypercube architectures (before linux clusters became popular). You may also want to look at a similar paper in which parallel cyclic reduction is used: Solution of the Euler and NavierStokes equations on MIMD distributed memory multiprocessors using cyclic reduction (605 KB) CURCHITSER, ENRIQUE N.PELZ, RICHARD B. (Rutgers University, Piscataway, NJ) MARCONI, FRANK (Grumman Aerospace Corp., Bethpage, NY) AIAA1992561 Aerospace Sciences Meeting and Exhibit, 30th, Reno, NV, Jan 69, 1992. 15 p. Research supported by Grumman Aerospace Corp. Although not discussed in the paper, the Thomas/transpose algorithm performed better than the cyclic reduction up to 128 processors. At some point, however, there should be a crossover. 
Re: Fastest parallel solver for tridiagonal system
Thank you!

Re: Fastest parallel solver for tridiagonal system
Hi, all!
Does anyone know any existing parallel implementations of ADIlike/Fractional Step methods that use parallel tridiagonal solvers, whether for SMPs, or clusters? Any scientific or industrial codes, that actually use these kinds of methods? I am looking for some hard numbers on parallel efficiency/FLOPs/time on modern architectures, and I would really like to try some codes out, but it is kind of difficult to find anything. Cheers, Marcin 
All times are GMT 4. The time now is 14:40. 