CFD Online Discussion Forums

CFD Online Discussion Forums (
-   OpenFOAM (
-   -   Converting OpenFOAM to VHDL Performance Testing (

bclodfe2uiucedu March 9, 2008 16:57

Greetings! I'm an electrica

I'm an electrical engineering student (senior year, looking to get into graduate school) at the University of Illinois in Urbana-Champaign.

Currently, I'm working on a research project that requires me to isolate one or two of the most computationally-demanding libraries of a CFD solver (like OpenFOAM), convert it to VHDL, and benchmark the converted code's performance on a Xilinx FPGA against code running on relatively recent Intel CPU (Core 2 Duo, for instance).

Henry Weller helped me out quite a bit--he indicated that the GAMG solver (more specifically, the Gauss-Seidel smoother and the matrix multiplier "IduMatrixATmul.C") requires a fairly large chunk of the CPU time on any given simulation.

Therein lies the problem: I downloaded and decompressed the source code, and after examining the matrix multiplier code (file path given at the bottom of this post), I'm having an enormous amount of difficulty understanding exactly what's going on.

Part of the problem is that I've only had a couple of semesters of C/C++ (also, the art of helpful commenting seems to have been lost on the coder), and this particular file references so many other functions that I realize that I need to request help from other, more experienced programmers. You don't have to know VHDL--that's my job--so long as you can help shed light on what this part of OpenFOAM is doing so that I can write an encapsulated version of it in VHDL and C/C++ for performance comparison's sake (code in a bottle, so to speak), I would greatly appreciate it.

Before I assume that someone's out there by launching into the plethora of questions that I've been compiling over the last couple of days, I'll just post this and see if anyone responds.

Thanks for your time!


Bryan Clodfelter
University of Illinois
Email: (automatically redirects to my personal email inbox)
Phone: (224) 636-4000
AIM: EliteMacFreak

Note: the file that I'm currently analyzing is:


mattijs March 10, 2008 15:42

In that specific file remove a
In that specific file remove all the #ifdef ICC_IA64_PREFETCH. What remains is the XXXXupdateMatrixInterfaces calls (non-constant boundaries) and a sparse vector matrix structure.

There is a psi (= vector) and a matrix (A) which is split into a diagonal, upper and lower part.

Sparse: since the matrix results from discretising between cells and immediate neighbours (i.e. face neighbours) the coefficients are stored in order of the face. From a face index one gets the neighbouring cell index through the lduAddr.upperAddr() and lduAddr().lowerAddr() addressing maps.

connclark March 11, 2008 11:50

I have some advice for you on
I have some advice for you on how to figure out the code for converting it to VHDL. Use the gcc switch '-save-temps' to compile it. Gcc will then keep the generated assembly code and you can implement the opcodes in hardware from there.

Actually your project is very interesting to me because I have an idea about building supercomputer nodes with a basic cheap processor and a FPGA to do the heavy work.

bclodfe2uiucedu March 16, 2008 19:49

Thanks for your help. Both re
Thanks for your help. Both replies were very informative!

Conn, you actually mentioned something related to my this project--if I can get this "proof of concept" working, then we may gain the funding necessary to begin work on something similar to Berkley's BEE2 project, using a large array of Xilinx Virtex5 FPGAs. The end result will probably wind up working like a physics accelerator, except that the "accelerator" will be the size of a server rack--not a little PCI-Express card. Assuming that we can find enough memory-sucking operations and multiplications to parallelize, we should be able to knock entire days off of a 3-5 day simulation.

For instance, 4 Virtex5 FPGAs in a 1U blade x 52U rack x ~200 multiplications/clock x 500 MHz =
~20.8 quadrillion multiplications/sec. There are actually some Virtex5 units that can do 512 multiplications/clock, but they might not have the I/O necessary for the job.


Anyway, at the moment I'm still very stuck. To recap, here are my objectives:

1. Create a self-sufficient version of IduMatrixATmul.C in C and VHDL, and then compare the performance of these two versions running on widely-available hardware.

2. Figure out what kind of computing resources I'll need to run a small scale (~5000 cell) simulation using my headless ("zombified!") code, and then a large one (~1 million cells) given the fact that I can do 200-500 18-bit multiplication operations simultaneously at 500 MHz in a single FPGA.

On a related note, my Xilinx rep has a few other questions regarding time-multiplexed data rates, sample times, and getting data in and out of the FPGA. In my opinion, the only way that I can answer those questions with any certainty is by writing both versions of this code and actually running them.


Now, if you don't mind, I have a few more questions that deal specifically with IduMatrixATmul.C:

1. What is the dimension and rank of the matrices that we're dealing with? For instance, d = 3 & r = 2 produces a matrix with 9 elements. D = 3 & r = 3 has 27 elements. I visualize the former case as a two-dimensional 3 x 3 matrix (a surface), and the latter case as a three-dimensional 3 x 3 matrix (a cube). Is that correct? How about the psi vectors? What dimension are they?

2. What kind of numbers are we dealing with? Integers? Floating point decimals? Double-precision floating point decimals?

3. How do I quantify the operations being performed here? For instance, are these multiplications (matrix * matrix), or (matrix * vector), or something else? How many are being performed per cell? For instance, is there one operation per face (so, six per cell)?

4. How is this data formatted before it is operated upon?

5. In your opinion, if I were to run operations on the same cell data over and over again--a situation that might occur if I create a standalone version of IduMatrixATmul, hard-code a single vector and matrix into it, and encapsulate the whole mess in a FOR loop--do you think that that would run any faster than "real" data on a typical x86 system due to the effect of the system's cache?

Sorry for the deluge,

Bryan Clodfelter
University of Illinois at Urbana-Champaign
(224) 636-4000

andre April 15, 2008 03:30

Hi guys, I am doing this thes
Hi guys,
I am doing this thesis already and have a lot of contact with H. Jasak. Since I am already further advanced and understand almost everything of how the matrices are build I could tell you that going into hardware is not so straightforward as it seems.

Let me give you an example:
Suppose we have matrix with 1,000,000 diagonal cells and 6,000,000 off-diagonal elements (a case build from hexahedrons and ignored boundaries). Suppose the matrix is fully executed in hardware. Lets assume zero hardware time execution.
The total communication from PC to FPGA would be 1,000,000 + 6,000,000 + 1,000,000 + 1,000,000 (Diagonal, Upper/Lower , x, y) = 9,000,000 entries to calculate (L+D+U)*x = y. The total multiply-(add) instructions equals the non-zero elements of the matrix which is 7,000,000. This means that the communication cost is higher then the computational cost. Therefore it would be useless to go to hardware if access time to the FPGA is slow. The computations and communication will be used to estimate the performances.

For example: suppose a PCI-X communication line is used between the CPU and the FPGA.

PCI-X: 133 MHz, 64 bits wide => 1064 MB/s maximal throughput
The total round trip time cost to send 9,000,000 64 bits (double precious floating points), which equals 72 MB, equals 72/1064 = 67.7 ms
Ok lets ignore the computational cost on the FPGA and say that the total time is 67.7 ms.

Now suppose we have an opteron machine with 6.4 GB/s bandwith from memory to CPU and clock frequency 2500 MHz.
Total communication: 72/6400 = 11.9 ms
Total computation: 1 cycle for a mult-add instructions (pipelined and throughput will be 1)
7,000,000 operations/2500 MHz = 2.8 ms. (almost also negligible compared to communication)
The total time to execute on the software is 14.7 ms (I ignored the index calculations).

If we compare hardware vs software, 67.7ms with 14.7ms we see that a lot of time is spend in communication.

Anyway I have a SGI machine with high data throughput and my first target is to implement the matrix vector product, after that the Gauss-seidel/Gamg solvers if there is some time left.


andre April 15, 2008 08:33

BTW you must not forget that y
BTW you must not forget that you operate on huge list: the upper, lower and diagonal, in- and output lists of the arrays can easily add up to 72 MB (for 9,000,000 DP elements) .
A virtex 4 has around 1 MB on chip memory and about 50 MB on-board memory (depends). So you have to come up with a good strategy to hide the communication latencies. If you are using the full 1 MB of local memory you can only have 168 (if I am correct for the Virtex 4 LX200) 64-bit memory accesses, this means that, assuming 3 input multiply add functions, 56 can be placed in parallel.

The way the matrices are stored now, are totally not suited to be implemented in hardware. You probably already know this. In order to reduce the communication the matrix must be stored in a different way (e.g Compressed Row Storage). In this way you can operate on whole rows and about 50% of memory accesses can be avoided for this matrix, since you operate on temporary results and dont have to load and write each time the values from the resulting array.

At the end we can compare our designs

Good luck.

bclodfe2uiucedu April 16, 2008 14:21

Hey Andre, thanks for the resp
Hey Andre, thanks for the response. You're clearly ahead of me in terms of conceptual knowledge of CFD analysis (I'm still an undergrad). Is there a way that I can email you for faster communication?

With that said, at the moment, I'm simply looking to write a little C executable (running on a 3.0 GHz Pentium 4) and VHDL program (running on a Virtex II Pro) to demonstrate the superior capability of an FPGA to do these multiply-add calculations. I have three weeks between now and then to finish this assignment, but I'll be continuing research on this project (at a much-faster speed) in the summer and fall semesters. If this works out, we've been promised a couple of Virtex 5 boards to continue research.

I've read a lot about CFD, and I'm unclear on a few points that hopefully you can help me with:

1. First, are the values used normally integers, floating point, or double-precision floating point? Doing floating point math on a FPGA is something I've never done before--are there any libraries that I can use to get the job done?

2. I understand the (L + D + U)*x = y process. However, I can't find an example of how this works (from a programmer's standpoint) on a very, very small scale. If you had 1 diagonal cell and 6 off-diagonal elements, how exactly do you do the math?

3. From a bandwidth standpoint, why would you ever want to use PCI-X? PCI-Express 2.0 has a 16 GB/sec bandwidth, and the upcoming PCI-Express 3.0 has 32 GB/sec bandwidth. I believe that Berkley's BEE3 project uses PCI-E as well.

bclodfe2uiucedu April 16, 2008 14:22

Oh, and what about the possibi
Oh, and what about the possibility of onboard DDR SDRAM? I don't think that you accounted for that in your original post.

andre April 16, 2008 16:57

1. High likeley double precisi
1. High likeley double precisious floating point, or single precision floating point. Forget about integers, I dont think they are supported. I will implement the DP since this is the default option.
I never build any of them, but I am 100% sure that someone on your university already build them, just ask your advisor.

2. It's very simple. Just write the equations out for a smalle case. Lets say 5 diagonal elements (which means a case with 5 cells) and 6 off diagonal elemens (3 upper and 3 lower, which means 3 internal faces). Look on the forum where Mattijs Janssens explained how the addressings work and where the non-zero entries come from (face sharing between cells). If you understand it, its very easy to understand how the matrix vector product works.

Unrolling the equations and executing them in hardware as written now is a possibility but you have to remove read write dependencies from the resulting vector (I do not take this approach), but I remember I wrote an algorithm that reordered the instructions, but it's bad, cause different schemes like CRS need almost 50% less communication access. So you have to rewrite the matrix into a different format (I am working on that now) to reduce memory accesses and operate on rows.

3. Just to point the importance of communication out All I am trying to say is you have to hide communication latency

4. I mentioned somewhere 50 MB on board memory. But it's still not enough as with the case with 1 million cells where at least 72 MB was necessary. You still have to build a multibuffering scheme.

BTW: I dont understand anything of CFD's since you dont need to understand that for implementing the matrix vector product or the gauss-seidel smoothers in hardware.

Good luck (Finishing this in 3 weeks seems impossible )

andre April 17, 2008 16:02

Can anyone answer this questio
Can anyone answer this question plz:

Is it always true that the lowerAddr list, as a consequence of the face ordering, ends up sorted?

Thanks in advance.

andre May 14, 2008 07:16

Hi again, I already have wr
Hi again,

I already have written code for my thesis to transform from the current addressing lists to an extended CRS format in lineair time.

So consider the last question not being send.

andre May 15, 2008 15:30

Hey Openfoamers, Does anyon
Hey Openfoamers,

Does anyone works with cells containing more then 16 faces?
I know that there is practically no limit on the amount of faces for a cell, but If I limit this to 16 the hardware control can be simplified a lot.


mattijs May 15, 2008 17:33

- refine (2x2x2) all neighbour
- refine (2x2x2) all neighbours of a cell but not the cell itself. So instead of 6 neighbours the cell now has 6X4 neighbours. This is a situation which will happen quite a lot in split-hex meshing.

- for polyhedral meshes from e.g. Delaunay+polyDualMesh there is no limit on the number of faces.

andre May 28, 2008 05:48

Mattijs you wrote somewhere ab
Mattijs you wrote somewhere above: "What remains is the XXXXupdateMatrixInterfaces calls (non-constant boundaries) ... "

Is there any relation between XXXupdateMatrixInterfaces and the matrix input or output vector, i.e. does it change any of these values (I don't understand the code )?

I have not examined the code yet for the Gauss-Seidel solver, but I would like to know if the matrix stays fixed during runtime.


mattijs May 29, 2008 04:29

Yes the XXXXupdateMatrixInterf
Yes the XXXXupdateMatrixInterfaces are e.g. processor interfaces. Our matrices are distributed, each processor holds part of the matrix. At every solver iteration an operation might require neighbour cell information. If the neighbour is on another processor how are you going to get this hold of this information?

So in the solver loop we have these 'interfaces' which do all the communication with neighbouring processors (through mpi) in the form of a swap of the neighbouring cell value.

NickG March 23, 2010 08:54

Hi Andre

I know this post is a bit old but could I ask a couple of questions? - namely:

How far did you go in changing the code to suite compressed row format?
- Did you just use the converter or did you rearrange the whole way that the solver uses matrices?

- Is there any chance of seeing your thesis?



All times are GMT -4. The time now is 16:29.