Performance/Useability problem in my C++ matrix library

aerosayan · January 11, 2021, 04:12

Summary :

I had a performance problem in my C++ matrix library. I found a basic solution. I need to make it easier to use.

Body:

I'm using Fortran-C++ hybrid programming model for my solver.
Fortran *only* does the core calculations in the solver.
C++ handles everything else like file IO, memory management, error handling.

For each 2D matrix, I allocate a NROWS*NCOLS size linear contiguous block of memory using malloc in C++.

I can then send the pointer to the first element of the allocated memory block to Fortran and use it directly as a NROWS*NCOLS matrix.

However in C++, I have to write my own methods to calculate the array index of an element (i,j) to access the data stored at index (i,j). For higher dimensions, I have functions that calculate the array index from (i,j,k) in 3D, (i,j,k,q) in 4D etc.

Unfortunately, the address/index calculation becomes notoriously complex and slow for the higher dimensions. That's a biiiiiiiiiiiiiiiiiiiiiiiiggggg problem.

For 1D:

Code:

    /// Linear accessor
    /// NOTE : used when the operator() becomes slow for higher dimensions
    ///
    inline REAL& operator [] (int i)
    {
        return _head[i];
    }

For 2D:

Code:

    /// Access 2D matrices
    ///
    inline REAL& operator () (int irow, int icol)
    {
        return _head[nrows*icol + irow];
    }

For 4D:

Code:

    /// Access 4D matrices
    ///
    inline REAL& operator () (int i, int j, int k, int q)
    {
        return _head[ndim1*ndim2*ndim3*q + ndim1*ndim2*k + ndim1*j + i];
    }

The multiplication operations nrows*ncols, ndim1*ndim2, etc are going to have an overhead. I don't want that. I could pre-calculate the values of ndim1*ndim2*xyz etc, but still we can't get rid of the addition operations.

This is slow. This is horrendously slow. A linear memory access shouldn't be this much complex. I hate it.

I found a basic solution that works for now.

Instead of accessing the elements through the operator() overloads, I'm directly calculating the index to the first element of the column (fortran style memory allignment is used) that I want to see, then I use the 1D accessor (operator [] overload) shown above to access the elements.

This is fast, but I want to know if there is any other way to do it.

Now, the final code looks like:

Code:

/// Fast 2D matrix access *pseudocode for now*
/// I haven't tested it, so the logic could have bugs, 

/// but kindly assume it's correct.
///

for(j=0; j<NCOLS; ++j)
{
    // We get the index to the first element of each column

    int col_start_index = matrix_object.get_element_index(i,j);



    for(i=0; i<NROWS; ++i)
    {
        cout << matrix_object[col_start_index + i] << endl; // We access the matrix linearly

    }

}

The code should work. But there are still some big issues.

- Notice how i,j need to be defined outside of the loops (like in fortran). I don't like it.

I'm solving it by doing : for(int i=0, j=0; j<NCOLS; ++j) { for(i=0; i<NROWS; ++i) { CODE(); } }

EDIT : If I think about it, for(int i=0, j=0; j<NCOLS; ++j){ i=matrix_object.get_element_index(i,j); for( ; i<i+NROWS; ++i) { cout << matrix_object[i] << endl; }} could be used.

- Notice how writing stencils will quickly become more complex.

This will become more complex in higher dimensions. The only solution I can think of, is to use C++ iterators. I'm looking for other better options.

I want to improve this. What do?

sbaffini · January 11, 2021, 05:07

I'll supersede a lot of philosophical issues and just ask: if you want to use C++ for this (while Fortran would have been probably more suited), why don't you just use a library (not written by you)?

BTW, at some level, those multiplications in sub2ind (the MATLAB name of the function) are necessarily going to happen, even in Fortran, are you sure this is so painfully slow as you think?

Consider this: in my code I have matrices in CSR format, which is just a linear array as your, but you don't know how many columns, nor which ones, are present for each row. So, you have a lot of indexing in this array and, when you finally get to your row, you still have to search for the column you want. Now, a basic implementation would have a brute force search, while a more sofisticated one would have a binary search. I wanted to try both (even if my profiler didn't spot that as an issue) and, guess what: for the number of non zero columns I typically have, it didn't change at all the performances.

If I had to bet on something, if an issue is there is that you manipulate such a fundamental thing like a 2D matrix trough OOP and stuff. They should be your first-class citizens natively (not in the OOP meaning of the term).

aerosayan · January 11, 2021, 05:41

Quote:

Originally Posted by sbaffini

I'll supersede a lot of philosophical issues and just ask: if you want to use C++ for this (while Fortran would have been probably more suited), why don't you just use a library (not written by you)?

BTW, at some level, those multiplications in sub2ind (the MATLAB name of the function) are necessarily going to happen, even in Fortran, are you sure this is so painfully slow as you think?

Consider this: in my code I have matrices in CSR format, which is just a linear array as your, but you don't know how many columns, nor which ones, are present for each row. So, you have a lot of indexing in this array and, when you finally get to your row, you still have to search for the column you want. Now, a basic implementation would have a brute force search, while a more sofisticated one would have a binary search. I wanted to try both (even if my profiler didn't spot that as an issue) and, guess what: for the number of non zero columns I typically have, it didn't change at all the performances.

If I had to bet on something, if an issue is there is that you manipulate such a fundamental thing like a 2D matrix trough OOP and stuff. They should be your first-class citizens natively (not in the OOP meaning of the term).

Hey sbaffini, Thanks for the help.

> "why don't you just use a library"

Although the class is basically used to store data (U, V, residual for a structured grid as shown in image). Currently I don't need to do any complex matrix operations like inversion, cross multiplication etc.

In the previous version I used Eigen3. I liked it. But I don't want to have an external code dependency + I actually want to learn by writing and optimizing my own code.

> "while Fortran would have been probably more suited"
It is. That's why the mathematical operations are in fortran. This class is only used to access the matrix through the double* pointer _head for operations that need to be done in C++.

> "BTW, at some level, those multiplications in sub2ind (the MATLAB name of the function) are necessarily going to happen, even in Fortran, are you sure this is so painfully slow as you think?"
I'll have to respectfully disagree with this. The sub2ind method is basically similar to my i=matrix_object.get_element_index(i,j) method. The only difference is that I'm optimizing here by not calling get_element_index multiple times in each iteration (consider using U(i,j,k) = U(i,j,k) + U(i-1,j, k-1) + ...) , and moving through the matrix linearly by using matrix_object[i].

The multiplication operations will definitely make it slower. I don't want this very basic access to be slow.

The reason being, I need to read/write data into the matrix as we do block structured AMR refinement.

Intel since Haswell has

add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)

https://stackoverflow.com/questions/...on-on-a-modern

3 cycles per multiplication * no. of multiplications (also consider the temporary variables in ndim1*ndim2*ndim3*q + ndim1*ndim2*k + ndim1*j + i) = total cycles wasted in simply accessing the element. The addition operations will also stall since they have to wait for the multiplication operations to finish first.

This will get slow.

matrix_object[i] is significantly faster since we're only adding 1 to the value of i every iteration. It will take only 1 cycle for the addition. Possibly 1 cycle for accessing the element.

> "If I had to bet on something, if an issue is there is that you manipulate such a fundamental thing like a 2D matrix trough OOP and stuff. They should be your first-class citizens natively (not in the OOP meaning of the term)."
I agree. But the 1D accessor using operator [ ] overload will absolutely be inlined.

The major aim I have, is to see if I can make it easy to use, and learn about how others might solve this issue. Having more insight from experts will help.

sbaffini · January 11, 2021, 05:51

You know, I'm just playing the devil's advocate here. You seem quite into that, you will ventually find out without our help.

However, I'm still of the opinion that you should first try to have a fully working code the fastest you can, and only later invest time where the profiler tells you to do. As I wrote in a previous post, your code is going to change so many times that finding the perfect solution at any given time is typically a waste of time.

Quote:

Originally Posted by aerosayan

In the previous version I used Eigen3. I liked it.

Couldn't you just look at how they do it? Was performance still not good enough for you?

aerosayan · January 11, 2021, 06:05

Quote:

Originally Posted by sbaffini

You know, I'm just playing the devil's advocate here. You seem quite into that, you will ventually find out without our help.

However, I'm still of the opinion that you should first try to have a fully working code the fastest you can, and only later invest time where the profiler tells you to do. As I wrote in a previous post, your code is going to change so many times that finding the perfect solution at any given time is typically a waste of time.

Couldn't you just look at how they do it? Was performance still not good enough for you?

Always thanks for everyone's help. None of my friends or colleagues want to dive deep into solver dev, so I'm really glad a forum like this exists.

> "Couldn't you just look at how they do it? Was performance still not good enough for you?"

Honestly, I'm not skilled enough (right now) to understand Eigen's code.

They use an Expression Template engine they developed. Understanding the code gets very difficult. OpenFOAM also uses an Expression Template engine. This lets OpenFOAM define the PDEs as ddt::something = ddx::something_else + grad::something_else. But the underlying code is difficult to understand.

> "However, I'm still of the opinion that you should first try to have a fully working code the fastest you can, and only later invest time where the profiler tells you to do. As I wrote in a previous post, your code is going to change so many times that finding the perfect solution at any given time is typically a waste of time."

I agree. This is why I was actually using the sub2ind method till now. Only when I observed a drastic reduction in performance and bloated assembly code, I had to make the decision to make it faster.

Also, understanding the CPU architecture and generated assembly helps a lot in identifying performance optimizations as shown in my previous computation cycle calculation.

I came here because I was doing it the wrong way till now (for years).

aerosayan · January 11, 2021, 06:56

Quote:

Originally Posted by sbaffini

BTW, at some level, those multiplications in sub2ind (the MATLAB name of the function) are necessarily going to happen, even in Fortran, are you sure this is so painfully slow as you think?

I checked the generated assembly of a formula with 2D matrices. Fortran finds the index using a method similar to sub2ind. This is a little troubling, to be honest.

I don't want to say that Fortran is a little slow ... right now. I have tests to run.
Mainly for matrices of larger dimensions.

sbaffini · January 11, 2021, 07:04

Honestly, I don't know how you could otherwise randomly access any array if not trough sub2ind. I mean, that's the required math... you can do it in special registers or memory locations, but that's necessarily going to happen at some point.

Sometimes you are good with the sequential access, but that's not always possible.

aerosayan · January 11, 2021, 07:13

Quote:

Originally Posted by sbaffini

Honestly, I don't know how you could otherwise randomly access any array if not trough sub2ind.

Sometimes you are good with the sequential access, but that's not always possible.

You're absolutely correct. We need to do random access. But not always. In this cases, we need to do linear access through the elements in the matrices. We have the guarantee that we'll iterate from element 1 to N in order.

The index calculation is unnecessary. We can increment the index and access the data directly. But I don't know if that would be a major issue or not. Modern CPUs prefetch the data to fill the caches. This should make the code run fast.

But I have my doubts. Tests are required before making any claims.

EDIT :

The fortran code will most likely be fast and I'm most probably wrong. I didn't see the index calculation use ADD,IMUL and other operations. I trust the experts who wrote the compiler.

In C++, I don't. We could definitely see that the index calculation happening through ADD, IMUL and other operations. C++ is definitely slower of the two.

sbaffini · January 11, 2021, 07:41

Ok, maybe I misunderstood what you were after. I tought you were trying to find a way to randomly access an nd-dimensional array.

aerosayan · January 11, 2021, 14:46

Quote:

Originally Posted by sbaffini

Ok, maybe I misunderstood what you were after. I tought you were trying to find a way to randomly access an nd-dimensional array.

We're simply iterating through the matrix like we do in fortran. The data will be used in fortran also, so the data storage is same as fortran matrices. Thus, we know that we will be accessing the memory in a linear fashion for our iterations.

I'm currently moderately happy with the solution I have. Since the previous solution was worse and was fully dependent on slow OOP code.

How much slow OOP code?
The "slow" functions shown above *were* an optimization initially.

The data was originally stored inside nested objects like dynamic_matrix<static_array<static_array<double, 2>, 4>> XYZ(NROWS, NCOLS);

This was good for memory management and alignment purpose, but the performance was garbage. Reason being, due to nested calls to the operator () or operator [] overloads of each of the nested classes. It was bad.

So, to improve performance, I just took the double * pointer from the allocated memory block, and put it inside a lightweight matrix class that would allow me access to the whole memory block with a single call, instead of multiple nested calls.

The functions from this lightweight class is shown above in the original post.

January 11, 2021, 04:12	Performance/Useability problem in my C++ matrix library	#1
aerosayan Senior Member Sayan Bhattacharjee Join Date: Mar 2020 Posts: 495 Rep Power: 8	Summary : I had a performance problem in my C++ matrix library. I found a basic solution. I need to make it easier to use. Body: I'm using Fortran-C++ hybrid programming model for my solver. Fortran only does the core calculations in the solver. C++ handles everything else like file IO, memory management, error handling. For each 2D matrix, I allocate a NROWSNCOLS size linear contiguous block of memory using malloc in C++. I can then send the pointer to the first element of the allocated memory block to Fortran and use it directly as a NROWSNCOLS matrix. However in C++, I have to write my own methods to calculate the array index of an element (i,j) to access the data stored at index (i,j). For higher dimensions, I have functions that calculate the array index from (i,j,k) in 3D, (i,j,k,q) in 4D etc. Unfortunately, the address/index calculation becomes notoriously complex and slow for the higher dimensions. That's a biiiiiiiiiiiiiiiiiiiiiiiiggggg problem. For 1D: Code: /// Linear accessor /// NOTE : used when the operator() becomes slow for higher dimensions /// inline REAL& operator [] (int i) { return _head[i]; } For 2D: Code: /// Access 2D matrices /// inline REAL& operator () (int irow, int icol) { return _head[nrowsicol + irow]; } For 4D: Code: /// Access 4D matrices /// inline REAL& operator () (int i, int j, int k, int q) { return _head[ndim1ndim2ndim3q + ndim1ndim2k + ndim1j + i]; } The multiplication operations nrowsncols, ndim1ndim2, etc are going to have an overhead. I don't want that. I could pre-calculate the values of ndim1ndim2xyz etc, but still we can't get rid of the addition operations. This is slow. This is horrendously slow. A linear memory access shouldn't be this much complex. I hate it. I found a basic solution that works for now. Instead of accessing the elements through the operator() overloads, I'm directly calculating the index to the first element of the column (fortran style memory allignment is used) that I want to see, then I use the 1D accessor (operator [] overload) shown above to access the elements. This is fast, but I want to know if there is any other way to do it. Now, the final code looks like: Code: /// Fast 2D matrix access pseudocode for now* /// I haven't tested it, so the logic could have bugs, /// but kindly assume it's correct. /// for(j=0; j<NCOLS; ++j) { // We get the index to the first element of each column int col_start_index = matrix_object.get_element_index(i,j); for(i=0; i<NROWS; ++i) { cout << matrix_object[col_start_index + i] << endl; // We access the matrix linearly } } The code should work. But there are still some big issues. - Notice how i,j need to be defined outside of the loops (like in fortran). I don't like it. I'm solving it by doing : for(int i=0,* j=0; j<NCOLS; ++j) { for(i=0; i<NROWS; ++i) { CODE(); } }* EDIT : If I think about it, for(int i=0, j=0; j<NCOLS; ++j){ i=matrix_object.get_element_index(i,j); for( ; i<i+NROWS; ++i) { cout << matrix_object[i] << endl; }} could be used. - Notice how writing stencils will quickly become more complex. This will become more complex in higher dimensions. The only solution I can think of, is to use C++ iterators. I'm looking for other better options. I want to improve this. What do? Last edited by aerosayan; January 11, 2021 at 05:01. Reason: formatting is really hard

January 11, 2021, 05:07		#2
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	I'll supersede a lot of philosophical issues and just ask: if you want to use C++ for this (while Fortran would have been probably more suited), why don't you just use a library (not written by you)? BTW, at some level, those multiplications in sub2ind (the MATLAB name of the function) are necessarily going to happen, even in Fortran, are you sure this is so painfully slow as you think? Consider this: in my code I have matrices in CSR format, which is just a linear array as your, but you don't know how many columns, nor which ones, are present for each row. So, you have a lot of indexing in this array and, when you finally get to your row, you still have to search for the column you want. Now, a basic implementation would have a brute force search, while a more sofisticated one would have a binary search. I wanted to try both (even if my profiler didn't spot that as an issue) and, guess what: for the number of non zero columns I typically have, it didn't change at all the performances. If I had to bet on something, if an issue is there is that you manipulate such a fundamental thing like a 2D matrix trough OOP and stuff. They should be your first-class citizens natively (not in the OOP meaning of the term). aerosayan and aero_head like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ERROR: unable to find library	HJH	CFX	6	February 26, 2019 06:52
Is there any library for computing the inverse of a sparse matrix using parallelism?	HectorRedal	Main CFD Forum	7	November 23, 2018 04:07
Forcing a solver to use your custom library.	cdunn6754	OpenFOAM Programming & Development	0	March 30, 2017 16:05
Problem compiling custom boundary condition as a library	leroyv	OpenFOAM Programming & Development	2	January 21, 2014 10:44
CG, BICGSTAB(2) : problem with matrix operation and boundary conditions	moomba	Main CFD Forum	2	February 17, 2010 03:37

January 11, 2021, 07:04		#7
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	Honestly, I don't know how you could otherwise randomly access any array if not trough sub2ind. I mean, that's the required math... you can do it in special registers or memory locations, but that's necessarily going to happen at some point. Sometimes you are good with the sequential access, but that's not always possible.

January 11, 2021, 07:41		#9
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	Ok, maybe I misunderstood what you were after. I tought you were trying to find a way to randomly access an nd-dimensional array.