
[Sponsors] 
March 9, 2008, 16:57 
Greetings!
I'm an electrica

#1 
New Member
Bryan Clodfelter
Join Date: Mar 2009
Location: Champaign, IL, USA
Posts: 4
Rep Power: 9 
Greetings!
I'm an electrical engineering student (senior year, looking to get into graduate school) at the University of Illinois in UrbanaChampaign. Currently, I'm working on a research project that requires me to isolate one or two of the most computationallydemanding libraries of a CFD solver (like OpenFOAM), convert it to VHDL, and benchmark the converted code's performance on a Xilinx FPGA against code running on relatively recent Intel CPU (Core 2 Duo, for instance). Henry Weller helped me out quite a bithe indicated that the GAMG solver (more specifically, the GaussSeidel smoother and the matrix multiplier "IduMatrixATmul.C") requires a fairly large chunk of the CPU time on any given simulation. Therein lies the problem: I downloaded and decompressed the source code, and after examining the matrix multiplier code (file path given at the bottom of this post), I'm having an enormous amount of difficulty understanding exactly what's going on. Part of the problem is that I've only had a couple of semesters of C/C++ (also, the art of helpful commenting seems to have been lost on the coder), and this particular file references so many other functions that I realize that I need to request help from other, more experienced programmers. You don't have to know VHDLthat's my jobso long as you can help shed light on what this part of OpenFOAM is doing so that I can write an encapsulated version of it in VHDL and C/C++ for performance comparison's sake (code in a bottle, so to speak), I would greatly appreciate it. Before I assume that someone's out there by launching into the plethora of questions that I've been compiling over the last couple of days, I'll just post this and see if anyone responds. Thanks for your time! Sincerely, Bryan Clodfelter University of Illinois Email: bclodfe2@uiuc.edu (automatically redirects to my personal email inbox) Phone: (224) 6364000 AIM: EliteMacFreak Note: the file that I'm currently analyzing is: OpenFOAM1.4.2/src/OpenFOAM/matrices/lduMatrix/lduMatrix/lduMatrixATmul.C 

March 10, 2008, 15:42 
In that specific file remove a

#2 
Super Moderator
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,416
Rep Power: 17 
In that specific file remove all the #ifdef ICC_IA64_PREFETCH. What remains is the XXXXupdateMatrixInterfaces calls (nonconstant boundaries) and a sparse vector matrix structure.
There is a psi (= vector) and a matrix (A) which is split into a diagonal, upper and lower part. Sparse: since the matrix results from discretising between cells and immediate neighbours (i.e. face neighbours) the coefficients are stored in order of the face. From a face index one gets the neighbouring cell index through the lduAddr.upperAddr() and lduAddr().lowerAddr() addressing maps. 

March 11, 2008, 11:50 
I have some advice for you on

#3 
Guest
Posts: n/a

I have some advice for you on how to figure out the code for converting it to VHDL. Use the gcc switch 'savetemps' to compile it. Gcc will then keep the generated assembly code and you can implement the opcodes in hardware from there.
Actually your project is very interesting to me because I have an idea about building supercomputer nodes with a basic cheap processor and a FPGA to do the heavy work. 

March 16, 2008, 19:49 
Thanks for your help. Both re

#4 
New Member
Bryan Clodfelter
Join Date: Mar 2009
Location: Champaign, IL, USA
Posts: 4
Rep Power: 9 
Thanks for your help. Both replies were very informative!
Conn, you actually mentioned something related to my this projectif I can get this "proof of concept" working, then we may gain the funding necessary to begin work on something similar to Berkley's BEE2 project, using a large array of Xilinx Virtex5 FPGAs. The end result will probably wind up working like a physics accelerator, except that the "accelerator" will be the size of a server racknot a little PCIExpress card. Assuming that we can find enough memorysucking operations and multiplications to parallelize, we should be able to knock entire days off of a 35 day simulation. For instance, 4 Virtex5 FPGAs in a 1U blade x 52U rack x ~200 multiplications/clock x 500 MHz = ~20.8 quadrillion multiplications/sec. There are actually some Virtex5 units that can do 512 multiplications/clock, but they might not have the I/O necessary for the job.  Anyway, at the moment I'm still very stuck. To recap, here are my objectives: 1. Create a selfsufficient version of IduMatrixATmul.C in C and VHDL, and then compare the performance of these two versions running on widelyavailable hardware. 2. Figure out what kind of computing resources I'll need to run a small scale (~5000 cell) simulation using my headless ("zombified!") code, and then a large one (~1 million cells) given the fact that I can do 200500 18bit multiplication operations simultaneously at 500 MHz in a single FPGA. On a related note, my Xilinx rep has a few other questions regarding timemultiplexed data rates, sample times, and getting data in and out of the FPGA. In my opinion, the only way that I can answer those questions with any certainty is by writing both versions of this code and actually running them.  Now, if you don't mind, I have a few more questions that deal specifically with IduMatrixATmul.C: 1. What is the dimension and rank of the matrices that we're dealing with? For instance, d = 3 & r = 2 produces a matrix with 9 elements. D = 3 & r = 3 has 27 elements. I visualize the former case as a twodimensional 3 x 3 matrix (a surface), and the latter case as a threedimensional 3 x 3 matrix (a cube). Is that correct? How about the psi vectors? What dimension are they? 2. What kind of numbers are we dealing with? Integers? Floating point decimals? Doubleprecision floating point decimals? 3. How do I quantify the operations being performed here? For instance, are these multiplications (matrix * matrix), or (matrix * vector), or something else? How many are being performed per cell? For instance, is there one operation per face (so, six per cell)? 4. How is this data formatted before it is operated upon? 5. In your opinion, if I were to run operations on the same cell data over and over againa situation that might occur if I create a standalone version of IduMatrixATmul, hardcode a single vector and matrix into it, and encapsulate the whole mess in a FOR loopdo you think that that would run any faster than "real" data on a typical x86 system due to the effect of the system's cache? Sorry for the deluge, Bryan Clodfelter University of Illinois at UrbanaChampaign (224) 6364000 bclodfe2@uiuc.edu 

April 15, 2008, 03:30 
Hi guys,
I am doing this thes

#5 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
Hi guys,
I am doing this thesis already and have a lot of contact with H. Jasak. Since I am already further advanced and understand almost everything of how the matrices are build I could tell you that going into hardware is not so straightforward as it seems. Let me give you an example: Suppose we have matrix with 1,000,000 diagonal cells and 6,000,000 offdiagonal elements (a case build from hexahedrons and ignored boundaries). Suppose the matrix is fully executed in hardware. Lets assume zero hardware time execution. The total communication from PC to FPGA would be 1,000,000 + 6,000,000 + 1,000,000 + 1,000,000 (Diagonal, Upper/Lower , x, y) = 9,000,000 entries to calculate (L+D+U)*x = y. The total multiply(add) instructions equals the nonzero elements of the matrix which is 7,000,000. This means that the communication cost is higher then the computational cost. Therefore it would be useless to go to hardware if access time to the FPGA is slow. The computations and communication will be used to estimate the performances. For example: suppose a PCIX communication line is used between the CPU and the FPGA. PCIX: 133 MHz, 64 bits wide => 1064 MB/s maximal throughput The total round trip time cost to send 9,000,000 64 bits (double precious floating points), which equals 72 MB, equals 72/1064 = 67.7 ms Ok lets ignore the computational cost on the FPGA and say that the total time is 67.7 ms. Now suppose we have an opteron machine with 6.4 GB/s bandwith from memory to CPU and clock frequency 2500 MHz. Total communication: 72/6400 = 11.9 ms Total computation: 1 cycle for a multadd instructions (pipelined and throughput will be 1) 7,000,000 operations/2500 MHz = 2.8 ms. (almost also negligible compared to communication) The total time to execute on the software is 14.7 ms (I ignored the index calculations). If we compare hardware vs software, 67.7ms with 14.7ms we see that a lot of time is spend in communication. Anyway I have a SGI machine with high data throughput and my first target is to implement the matrix vector product, after that the Gaussseidel/Gamg solvers if there is some time left. Regards 

April 15, 2008, 08:33 
BTW you must not forget that y

#6 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
BTW you must not forget that you operate on huge list: the upper, lower and diagonal, in and output lists of the arrays can easily add up to 72 MB (for 9,000,000 DP elements) .
A virtex 4 has around 1 MB on chip memory and about 50 MB onboard memory (depends). So you have to come up with a good strategy to hide the communication latencies. If you are using the full 1 MB of local memory you can only have 168 (if I am correct for the Virtex 4 LX200) 64bit memory accesses, this means that, assuming 3 input multiply add functions, 56 can be placed in parallel. The way the matrices are stored now, are totally not suited to be implemented in hardware. You probably already know this. In order to reduce the communication the matrix must be stored in a different way (e.g Compressed Row Storage). In this way you can operate on whole rows and about 50% of memory accesses can be avoided for this matrix, since you operate on temporary results and dont have to load and write each time the values from the resulting array. At the end we can compare our designs Good luck. 

April 16, 2008, 14:21 
Hey Andre, thanks for the resp

#7 
New Member
Bryan Clodfelter
Join Date: Mar 2009
Location: Champaign, IL, USA
Posts: 4
Rep Power: 9 
Hey Andre, thanks for the response. You're clearly ahead of me in terms of conceptual knowledge of CFD analysis (I'm still an undergrad). Is there a way that I can email you for faster communication?
With that said, at the moment, I'm simply looking to write a little C executable (running on a 3.0 GHz Pentium 4) and VHDL program (running on a Virtex II Pro) to demonstrate the superior capability of an FPGA to do these multiplyadd calculations. I have three weeks between now and then to finish this assignment, but I'll be continuing research on this project (at a muchfaster speed) in the summer and fall semesters. If this works out, we've been promised a couple of Virtex 5 boards to continue research. I've read a lot about CFD, and I'm unclear on a few points that hopefully you can help me with: 1. First, are the values used normally integers, floating point, or doubleprecision floating point? Doing floating point math on a FPGA is something I've never done beforeare there any libraries that I can use to get the job done? 2. I understand the (L + D + U)*x = y process. However, I can't find an example of how this works (from a programmer's standpoint) on a very, very small scale. If you had 1 diagonal cell and 6 offdiagonal elements, how exactly do you do the math? 3. From a bandwidth standpoint, why would you ever want to use PCIX? PCIExpress 2.0 has a 16 GB/sec bandwidth, and the upcoming PCIExpress 3.0 has 32 GB/sec bandwidth. I believe that Berkley's BEE3 project uses PCIE as well. 

April 16, 2008, 14:22 
Oh, and what about the possibi

#8 
New Member
Bryan Clodfelter
Join Date: Mar 2009
Location: Champaign, IL, USA
Posts: 4
Rep Power: 9 
Oh, and what about the possibility of onboard DDR SDRAM? I don't think that you accounted for that in your original post.


April 16, 2008, 16:57 
1. High likeley double precisi

#9 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
1. High likeley double precisious floating point, or single precision floating point. Forget about integers, I dont think they are supported. I will implement the DP since this is the default option.
I never build any of them, but I am 100% sure that someone on your university already build them, just ask your advisor. 2. It's very simple. Just write the equations out for a smalle case. Lets say 5 diagonal elements (which means a case with 5 cells) and 6 off diagonal elemens (3 upper and 3 lower, which means 3 internal faces). Look on the forum where Mattijs Janssens explained how the addressings work and where the nonzero entries come from (face sharing between cells). If you understand it, its very easy to understand how the matrix vector product works. Unrolling the equations and executing them in hardware as written now is a possibility but you have to remove read write dependencies from the resulting vector (I do not take this approach), but I remember I wrote an algorithm that reordered the instructions, but it's bad, cause different schemes like CRS need almost 50% less communication access. So you have to rewrite the matrix into a different format (I am working on that now) to reduce memory accesses and operate on rows. 3. Just to point the importance of communication out . All I am trying to say is you have to hide communication latency 4. I mentioned somewhere 50 MB on board memory. But it's still not enough as with the case with 1 million cells where at least 72 MB was necessary. You still have to build a multibuffering scheme. BTW: I dont understand anything of CFD's since you dont need to understand that for implementing the matrix vector product or the gaussseidel smoothers in hardware. Good luck (Finishing this in 3 weeks seems impossible ) 

April 17, 2008, 16:02 
Can anyone answer this questio

#10 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
Can anyone answer this question plz:
Is it always true that the lowerAddr list, as a consequence of the face ordering, ends up sorted? Thanks in advance. 

May 14, 2008, 07:16 
Hi again,
I already have wr

#11 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
Hi again,
I already have written code for my thesis to transform from the current addressing lists to an extended CRS format in lineair time. So consider the last question not being send. 

May 15, 2008, 15:30 
Hey Openfoamers,
Does anyon

#12 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
Hey Openfoamers,
Does anyone works with cells containing more then 16 faces? I know that there is practically no limit on the amount of faces for a cell, but If I limit this to 16 the hardware control can be simplified a lot. Regards 

May 15, 2008, 17:33 
 refine (2x2x2) all neighbour

#13 
Super Moderator
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,416
Rep Power: 17 
 refine (2x2x2) all neighbours of a cell but not the cell itself. So instead of 6 neighbours the cell now has 6X4 neighbours. This is a situation which will happen quite a lot in splithex meshing.
 for polyhedral meshes from e.g. Delaunay+polyDualMesh there is no limit on the number of faces. 

May 28, 2008, 05:48 
Mattijs you wrote somewhere ab

#14 
New Member
Andre
Join Date: Mar 2009
Posts: 19
Rep Power: 9 
Mattijs you wrote somewhere above: "What remains is the XXXXupdateMatrixInterfaces calls (nonconstant boundaries) ... "
Is there any relation between XXXupdateMatrixInterfaces and the matrix input or output vector, i.e. does it change any of these values (I don't understand the code )? I have not examined the code yet for the GaussSeidel solver, but I would like to know if the matrix stays fixed during runtime. Thanks 

May 29, 2008, 04:29 
Yes the XXXXupdateMatrixInterf

#15 
Super Moderator
Mattijs Janssens
Join Date: Mar 2009
Posts: 1,416
Rep Power: 17 
Yes the XXXXupdateMatrixInterfaces are e.g. processor interfaces. Our matrices are distributed, each processor holds part of the matrix. At every solver iteration an operation might require neighbour cell information. If the neighbour is on another processor how are you going to get this hold of this information?
So in the solver loop we have these 'interfaces' which do all the communication with neighbouring processors (through mpi) in the form of a swap of the neighbouring cell value. 

March 23, 2010, 08:54 

#16 
Member
Nick Gardiner
Join Date: Apr 2009
Location: Chichester, UK
Posts: 93
Rep Power: 9 
Hi Andre
I know this post is a bit old but could I ask a couple of questions?  namely: How far did you go in changing the code to suite compressed row format?  Did you just use the converter or did you rearrange the whole way that the solver uses matrices?  Is there any chance of seeing your thesis? thanks Nick 

Thread Tools  
Display Modes  


Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Testing of OpenFOAM 1.5beta Commenced  OpenFOAM discussion board administrator  OpenFOAM Announcements from ESIOpenCFD  0  May 19, 2008 05:22 
Testing of OpenFOAM 1.4alpha Commenced  OpenFOAM discussion board administrator  OpenFOAM Announcements from ESIOpenCFD  0  January 16, 2007 10:41 
Testing of OpenFOAM 1.3alpha Commenced  OpenFOAM discussion board administrator  OpenFOAM Announcements from ESIOpenCFD  0  February 7, 2006 08:31 