Is there a way to order nodes in a mesh to efficiently compute sparse matrices?

HectorRedal · December 6, 2017, 10:16

Hi,

I don't know if I am going to explain my problem clear enough or my question makes no sense.

Suppose there is a mesh created or generated by hand or by any software.
The identifers provided to the nodes of the this mesh have been assigned randomly / arbitrarily.

The CFD software takes this mesh as input and creates some sparse matrices.
Obviously, the order assigned to the nodes affects the performance cost of the calculation / solution of the problem.
If I manage to constructed a narrow banded (almost diagonal) matrix, the performance cost when calculating the inverse of the matrix (for example) will definitelly be smaller than any other matrix (constructed for the same mesh).

Are there any algorithms I can check / read to implement some kind of preprocessing step so as to order the nodes of the mesh when building the sparse matrix of the CFD software?
Are there any paper or reading I can take a look at?

Thanks in advance for your kind support.

Best regards,
Hector.

FMDenaro · December 6, 2017, 10:41

https://it.mathworks.com/help/matlab...-matrices.html

sbaffini · December 6, 2017, 10:42

There is really a lot of material on this. Basically, any book on solving linear equation systems has some section on the proper ordering of the variables.

The reverse Cuthill–McKee algorithm (RCM) is one such orderings based only on the graph of the matrix (i.e., the connectivity of the mesh, not its coordinates).

Space Filling Curves are another mean to order "stuff" in space and only require local coordinates of the objects to sort. Thus they are suitable for matrices coming from PDEs discretized in space.

mprinkey · December 6, 2017, 10:58

Quote:

Originally Posted by sbaffini

There is really a lot of material on this. Basically, any book on solving linear equation systems has some section on the proper ordering of the variables.

The reverse Cuthill–McKee algorithm (RCM) is one such orderings based only on the graph of the matrix (i.e., the connectivity of the mesh, not its coordinates).

Space Filling Curves are another mean to order "stuff" in space and only require local coordinates of the objects to sort. Thus they are suitable for matrices coming from PDEs discretized in space.

I've found space filling curves to be every bit as good, if not better than RCM. RCM gives the lowest bandwidth result which is exactly what you want for avoiding fill-in for a banded algorithm. But SFC seem to do a better job of reducing "average" bandwidth allowing that cache misses/parallel communication must occur, it seems to reduce the average span of the matrix at the expense of more severe outliers. But if an outlier is not in cache, it doesn't matter much where it is. Of course, this is just my experience. I don't have any solid theoretical/experimental results to point to.

I have looked a using SFC to create blocks of a characteristic (say L2) cache size and then maybe RCM on each of those blocks.

sbaffini · December 6, 2017, 12:09

If I remember correctly, RCM might have some advantage for LU factorizations or the sort of. However, in addition to what Michael has stated, SFC orderings have also the advantage of producing an ordering also suitable for straightforward parallel partitioning and more (there is a paper by aftosmis on this).

HectorRedal · December 6, 2017, 17:35

Hi,

Thank you all for your help and support!

I have been digging out a bit in the internet using the information you have provided to me (Cuthill–McKee algorithm and Space Filling Curves), and I have found that all these methods are suitable for SparseLU solvers.
I have just one doubt, can these methods be used with BiCGSTAB solvers?
Maybe my question makes no sense, so I beg your pardon if this is the case.

Best regards.

mprinkey · December 6, 2017, 21:14

Quote:

Originally Posted by HectorRedal

Hi,

Thank you all for your help and support!

I have been digging out a bit in the internet using the information you have provided to me (Cuthill–McKee algorithm and Space Filling Curves), and I have found that all these methods are suitable for SparseLU solvers.
I have just one doubt, can these methods be used with BiCGSTAB solvers?
Maybe my question makes no sense, so I beg your pardon if this is the case.

Best regards.

Matrix ordering WILL NOT change the efficacy of BiCGStab or any other Krylov method in terms of increasing or decreasing the number of iterations required. The same Krylov spaces will be generated (subject to the same reordering of the solution and subspace vectors). The only potential changes *could* be due to round-off errors from the change in order of operations--but neither RCM or SFC address this issue in any fashion.

For (Super)-LU schemes or a wide-band Thomas algorithms that generate direct (aka exact) rather than iterative solutions to the linear system, there is a big benefit for RCM and the resulting reduced matrix bandwidth. The scaling of the solution of, say, a Thomas algorithm is O(N*B^2) where N is the number of unknowns and B is the bandwidth of the matrix. If you can reorder and reduce B by some significant factor, you reduce the runtime of the exact Thomas solution of the reordered matrix by that factor squared. Fill-in storage is significantly reduced as well.

For iterative schemes, the value of matrix re-ordering is ALL cache related. Matrices with smaller bandwidths tend to have more cache reuse and fewer pipeline stalls due to cache misses. Better cache usages tends to make for faster runtimes and more efficient solution schemes. But these efficiency improvements are NOT algorithmic...they are artifacts of the memory hierarchy in modern CPU and memory subsystems. And some CPU, cache, and memory configurations may see more benefit from re-ordering than others. This is deep in the weeds of high-performance coding, but these are where the distinctions lie.

HectorRedal · December 7, 2017, 02:28

Quote:

Originally Posted by mprinkey

Matrix ordering WILL NOT change the efficacy of BiCGStab or any other Krylov method in terms of increasing or decreasing the number of iterations required. The same Krylov spaces will be generated (subject to the same reordering of the solution and subspace vectors). The only potential changes *could* be due to round-off errors from the change in order of operations--but neither RCM or SFC address this issue in any fashion.

For (Super)-LU schemes or a wide-band Thomas algorithms that generate direct (aka exact) rather than iterative solutions to the linear system, there is a big benefit for RCM and the resulting reduced matrix bandwidth. The scaling of the solution of, say, a Thomas algorithm is O(N*B^2) where N is the number of unknowns and B is the bandwidth of the matrix. If you can reorder and reduce B by some significant factor, you reduce the runtime of the exact Thomas solution of the reordered matrix by that factor squared. Fill-in storage is significantly reduced as well.

For iterative schemes, the value of matrix re-ordering is ALL cache related. Matrices with smaller bandwidths tend to have more cache reuse and fewer pipeline stalls due to cache misses. Better cache usages tends to make for faster runtimes and more efficient solution schemes. But these efficiency improvements are NOT algorithmic...they are artifacts of the memory hierarchy in modern CPU and memory subsystems. And some CPU, cache, and memory configurations may see more benefit from re-ordering than others. This is deep in the weeds of high-performance coding, but these are where the distinctions lie.

Thanks a lot for your clear and cristal explanation.
I have perfectly grasped the main point of the subject under discussion.

Just more question: Which algorithms is quicker / faster?
Using a wide-band Thomas with RCM algorithm or a Krylov method? (If they are comparable)

Can they be compared? Or are we talking about things that are not comparable?

FMDenaro · December 7, 2017, 03:49

Quote:

Originally Posted by HectorRedal

Thanks a lot for your clear and cristal explanation.
I have perfectly grasped the main point of the subject under discussion.

Just more question: Which algorithms is quicker / faster?
Using a wide-band Thomas with RCM algorithm or a Krylov method? (If they are comparable)

Can they be compared? Or are we talking about things that are not comparable?

The issue is about what do you mean for quicker/faster. Depending on the method, CPU time and number of iterations can produce strange and conflicting results

mprinkey · December 7, 2017, 06:57

Quote:

Originally Posted by FMDenaro

The issue is about what do you mean for quicker/faster. Depending on the method, CPU time and number of iterations can produce strange and conflicting results

Indeed. 8)

In terms of practical use in CFD applications, Krylov solvers (CG for symmetric positive definite matrices, BiCGStab and GMRES for non-symmetric matrices) are state of the art. Other alternatives are pure multigrid iterative schemes. The best solvers for large systems use multigrid iteration as a precondition for the Krylov space solver and reap the synergistic benefit of both approaches.

Direct/Exact solvers like LU/Thomas algorithms are not regularly used in CFD for the solution of 2D/3D problems. They are only used in reduced-dimension approximations like Alternate Direction Implicit methods or for line relaxation in smoother/preconditioner schemes...but those applications usually have a bandwidth of 3 for 2nd-order schemes or proportionally larger for higher order schemes. Again, these are solving only 1D sub-sets of the larger 2D/3D problem and are used in an iterative fashion or used in conjunction with a dimensional reduction scheme, say by using FFT techniques in the other spatial dimensions. The only other CFD application I can think of for banded direct solvers is in the evaluation of the compact finite differences.

Outside of CFD, RCM with direct solver methods may be used to do more than solve the linear system. For example, a reduced bandwidth matrix will be easier to extract eigenvalues/vectors which is useful, say, for identifying vibrational frequency and mode shapes in elastic structures.

FMDenaro · December 7, 2017, 07:07

My experience is somehow depressing me...For standard 3D second order elliptic operators any try to get the best CPU time performance out of my optimized SOR (parallelized with black&white assumption) failed. That happened at least for number of unknowns <O(10^7).
Yes, I get improvements in terms of numbers of iterations but as my final goal was to have the fastest solver, I always used the SOR method...

mprinkey · December 7, 2017, 08:05

Quote:

Originally Posted by FMDenaro

My experience is somehow depressing me...For standard 3D second order elliptic operators any try to get the best CPU time performance out of my optimized SOR (parallelized with black&white assumption) failed. That happened at least for number of unknowns <O(10^7).
Yes, I get improvements in terms of numbers of iterations but as my final goal was to have the fastest solver, I always used the SOR method...

If you have hard-coded the matrix coefficients (-1,-1,-1,6,-1,-1,-1) and you use red/black coloring, that gives loops that are very easy for a compiler to optimize using unrolling, loop blocking, and many other tricks. Also, you only have two vectors being accessed...the RHS and the solution, so it is highly likely that your problem fits mostly in cache. I ran into a similar problem with a code I worked on many years ago. The code was ancient and not well designed. It used SOR, so I tried to replace it with a canned ICCG solver and it ran (slightly) slower, just because their problem sizes were so small, SOR tucked into L2 cache and ICCG didn't. These are important issues, IMO, especially because of the push to larger core counts and GPU architectures. What algorithm *does* get the fastest time to solution as we end up with way more cores available than we can possibly use efficiently with older algorithms.

Having said that, compilers should be able to optimize Geometric Multigrid loops in much the same way as your Red-Black SOR loops. I'd be surprised if a GMG solver using RB GS couldn't best your SOR solver even on small cases, just because the convergence rate is higher, the storage is only log(N) larger, and loops exhibit the same structure. Of course, I am often wrong about such things.

HectorRedal · December 8, 2017, 04:17

Quote:

Originally Posted by mprinkey

Indeed. 8)

In terms of practical use in CFD applications, Krylov solvers (CG for symmetric positive definite matrices, BiCGStab and GMRES for non-symmetric matrices) are state of the art. Other alternatives are pure multigrid iterative schemes. The best solvers for large systems use multigrid iteration as a precondition for the Krylov space solver and reap the synergistic benefit of both approaches.

Direct/Exact solvers like LU/Thomas algorithms are not regularly used in CFD for the solution of 2D/3D problems. They are only used in reduced-dimension approximations like Alternate Direction Implicit methods or for line relaxation in smoother/preconditioner schemes...but those applications usually have a bandwidth of 3 for 2nd-order schemes or proportionally larger for higher order schemes. Again, these are solving only 1D sub-sets of the larger 2D/3D problem and are used in an iterative fashion or used in conjunction with a dimensional reduction scheme, say by using FFT techniques in the other spatial dimensions. The only other CFD application I can think of for banded direct solvers is in the evaluation of the compact finite differences.

Outside of CFD, RCM with direct solver methods may be used to do more than solve the linear system. For example, a reduced bandwidth matrix will be easier to extract eigenvalues/vectors which is useful, say, for identifying vibrational frequency and mode shapes in elastic structures.

Thanks for the information.
Right now the algorithm I have implemented is using a BiCGStab for solving the linear systems used by my application. As pre-conditioner I am using IncompleteLUT.
I am not using any kind of multigrid in my application, so I understand my approach is fair enough.
I would like to comment also that I am using non-uniform meshes, based on TRIA elements that discretize the domain / problem.

December 6, 2017, 10:16	Is there a way to order nodes in a mesh to efficiently compute sparse matrices?	#1
HectorRedal Senior Member Hector Redal Join Date: Aug 2010 Location: Madrid, Spain Posts: 243 Rep Power: 16	Hi, I don't know if I am going to explain my problem clear enough or my question makes no sense. Suppose there is a mesh created or generated by hand or by any software. The identifers provided to the nodes of the this mesh have been assigned randomly / arbitrarily. The CFD software takes this mesh as input and creates some sparse matrices. Obviously, the order assigned to the nodes affects the performance cost of the calculation / solution of the problem. If I manage to constructed a narrow banded (almost diagonal) matrix, the performance cost when calculating the inverse of the matrix (for example) will definitelly be smaller than any other matrix (constructed for the same mesh). Are there any algorithms I can check / read to implement some kind of preprocessing step so as to order the nodes of the mesh when building the sparse matrix of the CFD software? Are there any paper or reading I can take a look at? Thanks in advance for your kind support. Best regards, Hector. Quasar_89 likes this.

December 6, 2017, 10:41		#2
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,768 Rep Power: 71	https://it.mathworks.com/help/matlab...-matrices.html HectorRedal likes this.

December 6, 2017, 10:42		#3
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	There is really a lot of material on this. Basically, any book on solving linear equation systems has some section on the proper ordering of the variables. The reverse Cuthill–McKee algorithm (RCM) is one such orderings based only on the graph of the matrix (i.e., the connectivity of the mesh, not its coordinates). Space Filling Curves are another mean to order "stuff" in space and only require local coordinates of the objects to sort. Thus they are suitable for matrices coming from PDEs discretized in space. HectorRedal likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Star CCM Overset Mesh Error (Rotating Turbine)	thezack	Siemens	7	October 12, 2016 11:14
2nd order boundary conditions for 2nd order discretization?	quarkz	Main CFD Forum	30	December 26, 2011 07:12
[snappyHexMesh] snappyHexMesh won't work - zeros everywhere!	sc298	OpenFOAM Meshing & Mesh Conversion	2	March 27, 2011 21:11
2d irregular grid	Remy	Main CFD Forum	1	December 22, 2008 04:49
Icemcfd 11: Loss of mesh from surface mesh option?	Joe	CFX	2	March 26, 2007 18:10

December 6, 2017, 12:09		#5
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	If I remember correctly, RCM might have some advantage for LU factorizations or the sort of. However, in addition to what Michael has stated, SFC orderings have also the advantage of producing an ordering also suitable for straightforward parallel partitioning and more (there is a paper by aftosmis on this).

December 6, 2017, 17:35		#6
HectorRedal Senior Member Hector Redal Join Date: Aug 2010 Location: Madrid, Spain Posts: 243 Rep Power: 16	Hi, Thank you all for your help and support! I have been digging out a bit in the internet using the information you have provided to me (Cuthill–McKee algorithm and Space Filling Curves), and I have found that all these methods are suitable for SparseLU solvers. I have just one doubt, can these methods be used with BiCGSTAB solvers? Maybe my question makes no sense, so I beg your pardon if this is the case. Best regards.

December 7, 2017, 07:07		#11
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,768 Rep Power: 71	My experience is somehow depressing me...For standard 3D second order elliptic operators any try to get the best CPU time performance out of my optimized SOR (parallelized with black&white assumption) failed. That happened at least for number of unknowns <O(10^7). Yes, I get improvements in terms of numbers of iterations but as my final goal was to have the fastest solver, I always used the SOR method...