Fastest parallel solver for tridiagonal system

August 17, 2007, 06:57

Hello everyone,

I have been searching for parallel algorithm for compact scheme, i.e. Lele1993 (J.Comp.Phy) and Kobayashi1999 (J.Comp.Phy). This means I have to solve Ax=Bf where x is the desired value(interpolation/differentiation/second-order differentiation).

The matrix A i have to deal with is a tridiagonal matrix. I this A are always positive definite on ordinary smooth grid.(still don't know how to prove it though)

I did some research and believe that fine-grained parallelism such as Recursive doubling (Stone1975 TOMS) and Cyclic reduction(Hockney1965 J.ACM ) are not suited for current supercomputers which are now very large grain ( A cluster with multi-core node ). Family of pipelining algorithm also requires too much communications.

It seems that the reduced parallel diagonal dominant of SUN (http://mack.ittc.ku.edu/sun95application.html) are the most efficient for this problem. The complexity is 5n/p+4J (p=#of proc, J = truncated bandwidth) which almost matches the best serial code ( Thomas algo. with prefactored LU ~ 5n-3) .

Is this algorithm already the best (at the moment ?) Are there any algorithm more efficient than this one ? Are there any particular problem in this algorithm that I should be aware of ?

Any suggestions are appreciated.

Regards, Arpiruk

August 17, 2007, 08:06

The best way to solve this is not to make the tridiagonal solver parallel but to perform the tridiagonal solves simultaneously; i.e. if you have a 2d equation in x & y and you want to calculate u_x at each of the y values then split the grid into strips which can be solved at the same time using the usual single processor tridiagonal solver.

August 17, 2007, 08:44

I believe one critical step that was left out in the above algorithm is the need to transpose the decomposition. I.e., you can get u_x with parallelization in y if you have all of x for chunks of y on each processor. But then what happens when you need to get u_y? You need to transpose the decomposition so that each processor has all of y for chunks of x. A few algorithms are out there that perform this transpose efficiently in parallel, such as the block transpose or the perfect shuffle.

August 17, 2007, 09:31

>The best way to solve this is not to make the tridiagonal
:solver parallel but to perform the tridiagonal solves
:simultaneously; i.e. if you have a 2d equation in x & y and
:you want to calculate u_x at each of the y values then split >the grid into strips which can be solved at the same time
:using the usual single processor tridiagonal solver.

That is a good idea. I herd that idea once but I dont know the reference to the literature. I think it's called method of transpose or something like that.

However it still require a tremedous amount of data transfer. Let us consider the following model problem. Suppose we are to solve a 2D convection problem

u u_x + v u_y = 0 u v_x + v v_y = 0

( f_i = partial derivative of f along i)

with NxNy = 10^4x10^4 using 4 grids on 4 processors. ______________________ | | | | 1 | 2 | |___________|__________| | | | y | 3 | 4 | | |___________|__________| |__ x

Each of the processor only know the local data its is sotring. The compact scheme will couple the whole domain in to one equation. For example df/dx. Therefore to compute u_x, we can divide data in y direction into 4 strips. ( i.e. 3 to 3-1y and 3 -2y, 1 to 1-3y and 1-4y) Then 4-1y can be exchanged with 3-2y. The p3 will have the whole data to compute u_x in the first strip(1y) and p4 will solve for 2y.

The latency scales with number of processor(mpi_gather ) This method allow latency hiding but it is still tightly coupled.

What disturb me most is the huge data transfer. Can this really be used on ordinary ethernet cluster?

August 17, 2007, 11:00

"What disturb me most is the huge data transfer. Can this really be used on ordinary ethernet cluster?"

Never tried it on a conventional ethernet cluster - it works quite well on a supercomputer though! A similar trick is also used on vector CPU's where the tridiagonal part does not vectorize well - however with a little memory overhead the outer loops (which do vectorize) can be moved inside the tridiagonal solver's loop.

It may be cheaper to use a Conjugate gradient method to invert your system - the matrix multiplies only require a moderate amount of data traffic and this may compensate for the need to iterate (a Chebyshev iterative method may be even better since it avoids innerproducts/global sums).

August 17, 2007, 11:06

Thank you very much. I will look for algorithms you mentioned.

August 17, 2007, 12:20

To solve a tridiagonal system you need a forward sweep along the line from one boundary to the other followed by a backsubstitution. Periodic boundary conditions need a slight modification but follow the same procedure.

Parallelisation consists of passing the result of the forward sweep to a neighbouring block and then waiting for the backsubstitution to be returned by that neighbouring block. This introduces some inefficiency as internal blocks have to wait to get data from blocks on the boundary. Also the latency of passing only one or two words of data at the end of each line would really hurt efficiency on most hardware today (but was OK on transputers in the 80s) and so normal implementations would collect a few line ending before passing the data onto the neighbour.

The TDMA algorithm is literally about 10 lines long and so can be easily programmed by students to help teach themselves how to program numerical methods in parallel.

August 17, 2007, 15:26

"Parallelisation consists of passing the result of the forward sweep to a neighbouring block and then waiting for the backsubstitution to be returned by that neighbouring block. This introduces some inefficiency as internal blocks have to wait to get data from blocks on the boundary."

This algoithm is actually very inefficient. It has a parallel speed up that at best goes as 1/N where N is the number of processors - may as well run on a single processor.

August 17, 2007, 19:33

> This algoithm is actually very inefficient. It has a parallel speed up that at
: best goes as 1/N where N is the number of processors - may as well run on
: a single processor.

Nonsense. My original implementations in the 80s were over 95% efficient and on all my parallel machines since I have had no problems maintaining efficiencies greater than 50% for real CFD simulations (I take little interest in optimising further when I get efficiencies above 50%).

August 17, 2007, 22:04

I thought what you meant was that one processor starts off the forward substitution and when it gets to its last index, it passes the result to the next processor so that it can continue the forward substitution. This will continue until the last processor (ie Imax), which will then start the backward substitution, which proceeds all the way back to the first processor. In this case, only one processor is working at a time and all of the others are idle. The result is a speedup that goes as 1/(number of prcessors). I have no doubt that your algorithm is different but I don't yet see how. Can you explain it?

August 18, 2007, 03:44

> Can you explain it?

What you are describing applies to one grid line but there are lots of grid lines. For example. when a processor has passed on data from a forward sweep it starts the forward sweep of the next grid line and goes back later for the results from the back substitution.

In practise there are a number of things one can do to tune the performance based on actual grid topology and the ratio of number crunching to communication speed of the parallel computer. It also needs a slight modification to handle the coarsest grids inside a multigrid scheme.

Of course one can simply treat the interfaces in an explicit manner which is seems to be fairly common but then it is not the same as the sequential algorithm. This works well in some applications where the solution procedure takes only a few more extra iterations to converge but not in others.

August 18, 2007, 10:08

>What you are describing applies to one grid line but there >are lots of grid lines. For example. when a processor has >passed on data from a forward sweep it starts the forward >sweep of the next grid line and goes back later for the >results from the back substitution.

Ok, but won't all except the first processor experience idle time while waiting for input from the previous processor? Can't the idle time of the last processor be significant?

Can you compare this approach to the one in which the decomposition is transposed between tridiag solves so that all of the data for each linear system is always on a single processor?

August 18, 2007, 11:38

> Ok, but won't all except the first processor experience idle time while
: waiting for input from the previous processor?

All processors on a particular boundary can start at the same time not just one.

> Can't the idle time of the last processor be significant?

Not usually. For 64 processors in a 4*4*4 arrangement solving a CFD problem on a 512*512*512 grid there will be a delay of 3 forward sweeps before all processors are busy for passing a message at the end of each line. This is an efficiency loss of roughly 3/(2*512*512) = 0.000005! Of course, on modern machines you would collect a few line endings before passing the data because of the slow communication speed to number crunching speed but you can make this source of inefficiency a few orders of magnitude larger before it becomes significant to the overall problem.

> Can you compare this approach to the one in which the decomposition is
: transposed between tridiag solves so that all of the data for each linear
: system is always on a single processor?

If you sweep in only one direction then a sensible domain decomposition would give you this anyway. If you sweep in several directions it would not only mean moving a lot of data around but also moving it beyond immediate neighbours. On modern machines this is likely to take too much communication time given that line sweeps are very fast. My guess is that you would need a more number crunching intensive solver for it to be worthwhile but I have not tried it. My experience has been that it is often faster to duplicate small calculations rather than perform them once and try to pass the results to distant processors.

August 20, 2007, 09:44

It seems there are two approaches being proposed here. 1.Transposed algorithm. (Tom) 2.Pipelining (Andy).

Most time consuming communication of 1 is the data transfer(seldom communication but big portion each time). Let M^3(M*M*M) be total data in one grid and npi be the number of grid in i-direction, the communication time will be ti_1 = npi*(M^3/npi)/(Connection speed ) = M^3/(Connection speed) [ Because each processor has to gather the data from all processor ]

Most time consuming communication of 2 is the frequent data transfer(small data but frequently) but the last processor has to wait until the previous processor send the data to him.

Therefore the communication time, ti_2 = M^2 *Latency +M/(CPU SPEED) [Assuming Latency >> 1/Connection speed].

ti_1/ti_2 ~ ( M/CS )/(Lat+1/(CPUS*M^2) )

The ratio is clearly dependent on M,CS(connection speed) and Lat(latency).

If we just plug in current supercomputer spec. i.e. latency ~ 2.5 microsec CS~6GB/s and CPUS ~ 6GB/s. We will see that the transpose method are faster until M ~4,000 [under single precision]. This is of course over exaggerate. We know that we cannot access main memory at that speed. Let assume a more practical rate, CS~ 2GB/s. This will give the break even point at M~1000.

In my opinion I think the transpose method is more suited for current hardware.

Tom, could you please point me to some literature about this ?

August 20, 2007, 10:37

All I'm aware of is some older work:

Solution of the Euler and Navier Stokes equations on parallel processors using a transposed/Thomas ADI Algorithm

CHYCZEWSKI, THOMAS S. (Pennsylvania State Univ., University Park) MARCONI, FRANK (Grumman Corporate Research Center, Bethpage, NY) PELZ, RICHARD B., CURCHITSER, ENRIQUE N. (Rutgers Univ., Piscataway, NJ) AIAA-1993-3310

This was done on hypercube architectures (before linux clusters became popular). You may also want to look at a similar paper in which parallel cyclic reduction is used:

Solution of the Euler and Navier-Stokes equations on MIMD distributed memory multiprocessors using cyclic reduction (605 KB)

CURCHITSER, ENRIQUE N.PELZ, RICHARD B. (Rutgers University, Piscataway, NJ) MARCONI, FRANK (Grumman Aerospace Corp., Bethpage, NY) AIAA-1992-561 Aerospace Sciences Meeting and Exhibit, 30th, Reno, NV, Jan 6-9, 1992. 15 p. Research supported by Grumman Aerospace Corp.

Although not discussed in the paper, the Thomas/transpose algorithm performed better than the cyclic reduction up to 128 processors. At some point, however, there should be a crossover.

August 20, 2007, 10:53

Thank you!

August 28, 2007, 05:14

Hi, all!

Does anyone know any existing parallel implementations of ADI-like/Fractional Step methods that use parallel tridiagonal solvers, whether for SMPs, or clusters? Any scientific or industrial codes, that actually use these kinds of methods?

I am looking for some hard numbers on parallel efficiency/FLOPs/time on modern architectures, and I would really like to try some codes out, but it is kind of difficult to find anything.

Cheers,

Marcin

August 17, 2007, 06:57	Fastest parallel solver for tridiagonal system	#1
Arpiruk Guest Posts: n/a	Hello everyone, I have been searching for parallel algorithm for compact scheme, i.e. Lele1993 (J.Comp.Phy) and Kobayashi1999 (J.Comp.Phy). This means I have to solve Ax=Bf where x is the desired value(interpolation/differentiation/second-order differentiation). The matrix A i have to deal with is a tridiagonal matrix. I this A are always positive definite on ordinary smooth grid.(still don't know how to prove it though) I did some research and believe that fine-grained parallelism such as Recursive doubling (Stone1975 TOMS) and Cyclic reduction(Hockney1965 J.ACM ) are not suited for current supercomputers which are now very large grain ( A cluster with multi-core node ). Family of pipelining algorithm also requires too much communications. It seems that the reduced parallel diagonal dominant of SUN (http://mack.ittc.ku.edu/sun95application.html) are the most efficient for this problem. The complexity is 5n/p+4J (p=#of proc, J = truncated bandwidth) which almost matches the best serial code ( Thomas algo. with prefactored LU ~ 5n-3) . Is this algorithm already the best (at the moment ?) Are there any algorithm more efficient than this one ? Are there any particular problem in this algorithm that I should be aware of ? Any suggestions are appreciated. Regards, Arpiruk

August 17, 2007, 08:06	Re: Fastest parallel solver for tridiagonal system	#2
Tom Guest Posts: n/a	The best way to solve this is not to make the tridiagonal solver parallel but to perform the tridiagonal solves simultaneously; i.e. if you have a 2d equation in x & y and you want to calculate u_x at each of the y values then split the grid into strips which can be solved at the same time using the usual single processor tridiagonal solver.

August 17, 2007, 08:44	Re: Fastest parallel solver for tridiagonal system	#3
Tom2 Guest Posts: n/a	I believe one critical step that was left out in the above algorithm is the need to transpose the decomposition. I.e., you can get u_x with parallelization in y if you have all of x for chunks of y on each processor. But then what happens when you need to get u_y? You need to transpose the decomposition so that each processor has all of y for chunks of x. A few algorithms are out there that perform this transpose efficiently in parallel, such as the block transpose or the perfect shuffle.

August 17, 2007, 09:31	Re: Fastest parallel solver for tridiagonal system	#4
Arpiruk Guest Posts: n/a	>The best way to solve this is not to make the tridiagonal :solver parallel but to perform the tridiagonal solves :simultaneously; i.e. if you have a 2d equation in x & y and :you want to calculate u_x at each of the y values then split >the grid into strips which can be solved at the same time :using the usual single processor tridiagonal solver. That is a good idea. I herd that idea once but I dont know the reference to the literature. I think it's called method of transpose or something like that. However it still require a tremedous amount of data transfer. Let us consider the following model problem. Suppose we are to solve a 2D convection problem u u_x + v u_y = 0 u v_x + v v_y = 0 ( f_i = partial derivative of f along i) with NxNy = 10^4x10^4 using 4 grids on 4 processors. ______________________ \| \| \| \| 1 \| 2 \| \|___________\|__________\| \| \| \| y \| 3 \| 4 \| \| \|___________\|__________\| \|__ x Each of the processor only know the local data its is sotring. The compact scheme will couple the whole domain in to one equation. For example df/dx. Therefore to compute u_x, we can divide data in y direction into 4 strips. ( i.e. 3 to 3-1y and 3 -2y, 1 to 1-3y and 1-4y) Then 4-1y can be exchanged with 3-2y. The p3 will have the whole data to compute u_x in the first strip(1y) and p4 will solve for 2y. The latency scales with number of processor(mpi_gather ) This method allow latency hiding but it is still tightly coupled. What disturb me most is the huge data transfer. Can this really be used on ordinary ethernet cluster?

August 17, 2007, 11:00	Re: Fastest parallel solver for tridiagonal system	#5
Tom Guest Posts: n/a	"What disturb me most is the huge data transfer. Can this really be used on ordinary ethernet cluster?" Never tried it on a conventional ethernet cluster - it works quite well on a supercomputer though! A similar trick is also used on vector CPU's where the tridiagonal part does not vectorize well - however with a little memory overhead the outer loops (which do vectorize) can be moved inside the tridiagonal solver's loop. It may be cheaper to use a Conjugate gradient method to invert your system - the matrix multiplies only require a moderate amount of data traffic and this may compensate for the need to iterate (a Chebyshev iterative method may be even better since it avoids innerproducts/global sums).

August 17, 2007, 11:06	Re: Fastest parallel solver for tridiagonal system	#6
Arpiruk Guest Posts: n/a	Thank you very much. I will look for algorithms you mentioned.

August 17, 2007, 12:20	Re: Fastest parallel solver for tridiagonal system	#7
andy Guest Posts: n/a	To solve a tridiagonal system you need a forward sweep along the line from one boundary to the other followed by a backsubstitution. Periodic boundary conditions need a slight modification but follow the same procedure. Parallelisation consists of passing the result of the forward sweep to a neighbouring block and then waiting for the backsubstitution to be returned by that neighbouring block. This introduces some inefficiency as internal blocks have to wait to get data from blocks on the boundary. Also the latency of passing only one or two words of data at the end of each line would really hurt efficiency on most hardware today (but was OK on transputers in the 80s) and so normal implementations would collect a few line ending before passing the data onto the neighbour. The TDMA algorithm is literally about 10 lines long and so can be easily programmed by students to help teach themselves how to program numerical methods in parallel.

August 17, 2007, 15:26	Re: Fastest parallel solver for tridiagonal system	#8
Tom2 Guest Posts: n/a	"Parallelisation consists of passing the result of the forward sweep to a neighbouring block and then waiting for the backsubstitution to be returned by that neighbouring block. This introduces some inefficiency as internal blocks have to wait to get data from blocks on the boundary." This algoithm is actually very inefficient. It has a parallel speed up that at best goes as 1/N where N is the number of processors - may as well run on a single processor.

August 17, 2007, 19:33	Re: Fastest parallel solver for tridiagonal system	#9
andy Guest Posts: n/a	> This algoithm is actually very inefficient. It has a parallel speed up that at : best goes as 1/N where N is the number of processors - may as well run on : a single processor. Nonsense. My original implementations in the 80s were over 95% efficient and on all my parallel machines since I have had no problems maintaining efficiencies greater than 50% for real CFD simulations (I take little interest in optimising further when I get efficiencies above 50%).

August 17, 2007, 22:04	Re: Fastest parallel solver for tridiagonal system	#10
Tom2 Guest Posts: n/a	I thought what you meant was that one processor starts off the forward substitution and when it gets to its last index, it passes the result to the next processor so that it can continue the forward substitution. This will continue until the last processor (ie Imax), which will then start the backward substitution, which proceeds all the way back to the first processor. In this case, only one processor is working at a time and all of the others are idle. The result is a speedup that goes as 1/(number of prcessors). I have no doubt that your algorithm is different but I don't yet see how. Can you explain it?

August 18, 2007, 03:44	Re: Fastest parallel solver for tridiagonal system	#11
andy Guest Posts: n/a	> Can you explain it? What you are describing applies to one grid line but there are lots of grid lines. For example. when a processor has passed on data from a forward sweep it starts the forward sweep of the next grid line and goes back later for the results from the back substitution. In practise there are a number of things one can do to tune the performance based on actual grid topology and the ratio of number crunching to communication speed of the parallel computer. It also needs a slight modification to handle the coarsest grids inside a multigrid scheme. Of course one can simply treat the interfaces in an explicit manner which is seems to be fairly common but then it is not the same as the sequential algorithm. This works well in some applications where the solution procedure takes only a few more extra iterations to converge but not in others.

August 18, 2007, 10:08	Re: Fastest parallel solver for tridiagonal system	#12
Tom2 Guest Posts: n/a	>What you are describing applies to one grid line but there >are lots of grid lines. For example. when a processor has >passed on data from a forward sweep it starts the forward >sweep of the next grid line and goes back later for the >results from the back substitution. Ok, but won't all except the first processor experience idle time while waiting for input from the previous processor? Can't the idle time of the last processor be significant? Can you compare this approach to the one in which the decomposition is transposed between tridiag solves so that all of the data for each linear system is always on a single processor?

August 18, 2007, 11:38	Re: Fastest parallel solver for tridiagonal system	#13
andy Guest Posts: n/a	> Ok, but won't all except the first processor experience idle time while : waiting for input from the previous processor? All processors on a particular boundary can start at the same time not just one. > Can't the idle time of the last processor be significant? Not usually. For 64 processors in a 444 arrangement solving a CFD problem on a 512512512 grid there will be a delay of 3 forward sweeps before all processors are busy for passing a message at the end of each line. This is an efficiency loss of roughly 3/(2512512) = 0.000005! Of course, on modern machines you would collect a few line endings before passing the data because of the slow communication speed to number crunching speed but you can make this source of inefficiency a few orders of magnitude larger before it becomes significant to the overall problem. > Can you compare this approach to the one in which the decomposition is : transposed between tridiag solves so that all of the data for each linear : system is always on a single processor? If you sweep in only one direction then a sensible domain decomposition would give you this anyway. If you sweep in several directions it would not only mean moving a lot of data around but also moving it beyond immediate neighbours. On modern machines this is likely to take too much communication time given that line sweeps are very fast. My guess is that you would need a more number crunching intensive solver for it to be worthwhile but I have not tried it. My experience has been that it is often faster to duplicate small calculations rather than perform them once and try to pass the results to distant processors.

August 20, 2007, 09:44	Re: Fastest parallel solver for tridiagonal system	#14
Arpiruk Guest Posts: n/a	It seems there are two approaches being proposed here. 1.Transposed algorithm. (Tom) 2.Pipelining (Andy). Most time consuming communication of 1 is the data transfer(seldom communication but big portion each time). Let M^3(MMM) be total data in one grid and npi be the number of grid in i-direction, the communication time will be ti_1 = npi(M^3/npi)/(Connection speed ) = M^3/(Connection speed) [ Because each processor has to gather the data from all processor ] Most time consuming communication of 2 is the frequent data transfer(small data but frequently) but the last processor has to wait until the previous processor send the data to him. Therefore the communication time, ti_2 = M^2 Latency +M/(CPU SPEED) [Assuming Latency >> 1/Connection speed]. ti_1/ti_2 ~ ( M/CS )/(Lat+1/(CPUS*M^2) ) The ratio is clearly dependent on M,CS(connection speed) and Lat(latency). If we just plug in current supercomputer spec. i.e. latency ~ 2.5 microsec CS~6GB/s and CPUS ~ 6GB/s. We will see that the transpose method are faster until M ~4,000 [under single precision]. This is of course over exaggerate. We know that we cannot access main memory at that speed. Let assume a more practical rate, CS~ 2GB/s. This will give the break even point at M~1000. In my opinion I think the transpose method is more suited for current hardware. Tom, could you please point me to some literature about this ?

August 20, 2007, 10:37	Re: Fastest parallel solver for tridiagonal system	#15
Tom2 Guest Posts: n/a	All I'm aware of is some older work: Solution of the Euler and Navier Stokes equations on parallel processors using a transposed/Thomas ADI Algorithm CHYCZEWSKI, THOMAS S. (Pennsylvania State Univ., University Park) MARCONI, FRANK (Grumman Corporate Research Center, Bethpage, NY) PELZ, RICHARD B., CURCHITSER, ENRIQUE N. (Rutgers Univ., Piscataway, NJ) AIAA-1993-3310 This was done on hypercube architectures (before linux clusters became popular). You may also want to look at a similar paper in which parallel cyclic reduction is used: Solution of the Euler and Navier-Stokes equations on MIMD distributed memory multiprocessors using cyclic reduction (605 KB) CURCHITSER, ENRIQUE N.PELZ, RICHARD B. (Rutgers University, Piscataway, NJ) MARCONI, FRANK (Grumman Aerospace Corp., Bethpage, NY) AIAA-1992-561 Aerospace Sciences Meeting and Exhibit, 30th, Reno, NV, Jan 6-9, 1992. 15 p. Research supported by Grumman Aerospace Corp. Although not discussed in the paper, the Thomas/transpose algorithm performed better than the cyclic reduction up to 128 processors. At some point, however, there should be a crossover.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
massive parallel run messed up lustre file system	matthias	OpenFOAM	1	May 28, 2011 16:18
Which parallel solver is most suitable for my app?	zonexo	Main CFD Forum	1	November 17, 2006 09:46
Parallel solver with 3.2	robbie	Siemens	1	August 10, 2004 07:29
direct parallel solver	sylvain	Main CFD Forum	2	June 25, 2001 07:56
solver for linear system with large sparse matrix	Yangang Bao	Main CFD Forum	1	October 25, 1999 04:22

August 20, 2007, 10:53	Re: Fastest parallel solver for tridiagonal system	#16
Arpiruk Guest Posts: n/a	Thank you!

August 28, 2007, 05:14	Re: Fastest parallel solver for tridiagonal system	#17
Marcin Guest Posts: n/a	Hi, all! Does anyone know any existing parallel implementations of ADI-like/Fractional Step methods that use parallel tridiagonal solvers, whether for SMPs, or clusters? Any scientific or industrial codes, that actually use these kinds of methods? I am looking for some hard numbers on parallel efficiency/FLOPs/time on modern architectures, and I would really like to try some codes out, but it is kind of difficult to find anything. Cheers, Marcin