No one talking about FluidX3D???

LuckyTran · November 10, 2022, 17:52

Quote:

Originally Posted by ptpacheco

Regarding the floating point precision, he has this to say in the FAQ found in the project Github:

It was also validated for a couple of cases with analytic solutions, I believe. At least if you download it and try it out for yourself, you'll see a bunch of setups for Poiseulle flow, Taylor-Green vortices, cylinder in duct etc...

The problem is not whether it is accurate or not (I trust the authors are not stupid and blatantly lying to the community). The point is, there is a very obvious capacity and bandwidth difference between FP64 results and FP16 results. Yes you can make it work in FP16, but that just reemphasizes my statement about engineering a problem to fit the message. And actually if you read the works referenced the authors have to use their own custom FP16 representation which means that work produced by this project will not conform to IEEE-754 and (if you are a company) you can lose your ISO accreditation if you blindly apply this code to your work. Granted, this project is not open to commercial use anyway so you wouldn't be able to use the code in the first place. In this case the authors (like the good engineers and scientists that they are) have put in the effort to prove their FP16 approach works for these cases, but this is not something that everybody can do for every case they ever plan to simulate. The point of having industry standards is so people don't have to worry about these things. Imagine telling your boss you're going to risk your company's ISO certification... for fun... And if this FP16 gimmick isn't critical... well then why did they need to use it?

What they achieved is quite remarkable and worthy of publication. But once you get past the oooos and aaaahhhss this work is not easily replicable for the typical CFD'er. A100's are not your typical GPU and certainly not 8x Qty of them. No one has the time to make custom FP units for every CFD problem. They show what can be done, but it is something that most people should not even attempt.

Just because some people have showed it's possible to set foot on the moon doesn't mean we're all gonna be walking on the moon any time soon.

flotus1 · November 10, 2022, 18:33

Quote:

The point is, there is a very obvious capacity and bandwidth difference between FP64 results and FP16 results. Yes you can make it work in FP16, but that just reemphasizes my statement about engineering a problem to fit the message

I'm not quite sure I catch what you mean.
Using lower precision to get a smaller memory footprint and higher performance is perfectly reasonable IMHO, and not some hack hack to cheat your way to the top of some arbitrary performance metric you chose to chase. Could you clarify what you think is wrong with it?

Quote:

the authors have to use their own custom FP16 representation which means that work produced by this project will not conform to IEEE-754 and (if you are a company) you can lose your ISO accreditation if you blindly apply this code to your work

High FP precision is not a magic bullet that makes an algorithm produce correct results. Consequently, lower FP precision is not an indicator for worse or incorrect results. Bad code without V&V is.
To be honest, this is the first time I hear the claim that using anything else than IEEE-754 could in and of itself lead to an (IMO pointless, but that's a different discussion) ISO certificate being revoked.
Every commercial CFD software I ever used came with a some form of legal disclaimer that there is no guarantee for correctness of the results. That's always the engineers responsibility.

LuckyTran · November 10, 2022, 21:10

I don't think there's anything wrong with it. It's very reasonable. But again it should be obvious that you need less memory per byte and you can pass more bytes for a given bus. And then it should be obvious why you outperform higher requirement environments when you straight up have a bits/byte advantage.

Of course a software is gonna put that disclaimer. They're never trying to sell a product or make a guarantee. Try asking an OEM when was the last time they reformatted their floating point representation in their codes and then ran the verification tests to prove that it works.

I thought I was pointing out very obvious facts about how bits/bytes work, I don't know why there is even any pushback. I never said anyone can't use a lower standard. But try using a higher standard and tell me it still works the same. And this isn't even my personal criticism, the authors even state that both LBM and CPU-based FVM are both RAM bandwidth limited and their project (LBM + custom FP16 + more optimizations) is as fast as it is because they have a raw bits/byte advantage over existing commercial implementations. Btw I've shared beers with ppl from MH group at KIT, this isn't me dumpstering their work.

arjun · November 12, 2022, 00:36

Quote:

Originally Posted by sbaffini

While I am no expert of LBM, when fast looking at the msc. thesis underlying the code I think I saw several verifications for channels (yet, I didn't check them very deeply) and what looked like a general strong dedication to the code verification. Again, A LOT of work for a msc thesis.

Having written and studied (and even considered lbm for our own work) i can tell you that it can be very efficient in certain cases. However there are few major issues

1. As the Reynolds number rises it requires smaller and small dt to be stable. Majority of the efforts in lbm research are around this issue. The ideas range from multi-relaxation schemes to entropic lbm etc. They are all trying to improve this part of the problem. PowerFlow do have something very good but it is kept quite secret and not out in public. (For the reasons that it gives them edge over such codes).

2. It is still explicit code. Its tough to beat implict codes in general. For example if i had to use this code in OP for my golf ball simulation then it will be at least 2 times slower than what I had. (in practice it might be even slower). This kind of answers the original question as to why not everyone is talking about it. (We did not think our code would change the world either. ).

JBeilke · November 12, 2022, 07:44

@arjun

Are you sure, that PowerFlow is still an LB code? Years ago termini like wall functions and turbulence models were used around PowerFlow, so we suspected, that the code was transformed from LB to some sort of LES/DES.

arjun · November 13, 2022, 02:25

Quote:

Originally Posted by JBeilke

@arjun

Are you sure, that PowerFlow is still an LB code? Years ago termini like wall functions and turbulence models were used around PowerFlow, so we suspected, that the code was transformed from LB to some sort of LES/DES.

You can have LES etc in LBM too. I will forward you (on telegram) one of the pdfs that i used to have. If i remember correctly they talked about it.

FMDenaro · November 13, 2022, 04:14

Quote:

Originally Posted by arjun

You can have LES etc in LBM too. I will forward you (on telegram) one of the pdfs that i used to have. If i remember correctly they talked about it.

They state as LES something I was never able to really understand. That involves some relaxing time and other parameters that I cannot link directly to the concept of spatial filtering.
But I am not expert in the LBM field.

arjun · November 14, 2022, 00:32

Quote:

Originally Posted by FMDenaro

They state as LES something I was never able to really understand. That involves some relaxing time and other parameters that I cannot link directly to the concept of spatial filtering.
But I am not expert in the LBM field.

Numerically for them, LES means they have to change the effective viscosity. That means they need to vary the relaxation parameter which decides effective viscosity. What is not clear though how they achieve RANS model. Based on the doc it seems they are solving for transport equation. If its done in typical FV or FD manner then it shall kill all the efficiency. So it is little confusing.

sbaffini · November 14, 2022, 10:49

Quote:

Originally Posted by FMDenaro

They state as LES something I was never able to really understand. That involves some relaxing time and other parameters that I cannot link directly to the concept of spatial filtering.
But I am not expert in the LBM field.

I never actually investigated it deeply, but I think that Malaspinas and Sagaut have worked specifically on this. That is, they seem to suggest an alternative derivation with respect to the altered relaxation time. Where the classical approach is based on simply recovering the effective viscosity, they seem to start from filtered LB equations, whatever their sense might be. Yet, I think they recover the classical approach under certain hypotheses (that I didn't care to understand).

ProjectPhysX · November 20, 2022, 13:02

Thank you all for the discussion and feedback. I just want to clarify a few misunderstandings:

@ptpacheco

- largest drawback is the lack of data storage
It can export volumetric data too, but compared to other codes, it doesn't have to, because it can render directly from VRAM. File export is as fast as the hard drive allows, but I'm pushing the resolution so far that storing hundreds of GigaByte for just a single frame quickly becomes unfeasible. Rendering the data directly from VRAM reduces overall compute time by orders of magnitude.

- Besides it is open-source, unlike Powerflow.
No it's not open-source. It's source-available, free to use/modify in public research, education and personal use. But IO don't allow commercial and/or military use, which technically does make it not open-spurce.

- It seems like a great tool to deliver realistic visualizations in a short period of time.
It can do that, but also much more. The simulations in [1] would take years with any other software.

@arjun

- other solvers do not need same mesh for same accuracy
Exactly, the uniform cubic grid is a main disadvantage of LBM. Dynamic adaptive mesh refinement is a PITA to implement on GPU. There is reasons why Powerflow runs only on slow CPUs so far.
But with my memory efficient implementation you can make the mesh fine everywhere and it's still much faster than FVM and faster than grid-refinement on CPU.

- BTW has he benchmarked the results against experimental values?
Yes, see [1] and especially its SI document, and [2-4]. I did more than enough validation for my own science use-cases. Still I need to validate Smagorinsky-Lilly and more turbulent test cases. Will have more time for that after I submit my PhD thesis in 2 weeks.

@sbaffini

- the F1 video you see is actually depicting the whole domain... ouch
Yes. Note that this was mainly intended as a quick test of the multi-GPU update I'm working on. A showcase of what the software could do in a few hours on an AMD GPU node.

- single GPU, which means no help from the parallel side
I'm still working on multi-GPU. The prototype works (see the F1 videos), but the code is not yet refactored enough to publish the update, and I still need to add multi-GPU support for some extensions like thermal flows. My multi-GPU approach works on any GPU from any vendor via OpenCL, without needing SLI/Crossfire/NVLink. Even combinations of AMD/Intel GPUs work. No porting required, like with all other codes that use CUDA.

- no IO is also a big ouch
It can export volumetric data too, but compared to other codes, it doesn't have to, because it can render directly from VRAM. File export is as fast as the hard drive allows, but I'm pushing the resolution so far that storing hundreds of GigaByte for just a single frame quickly becomes unfeasible. Rendering the data directly from VRAM reduces overall compute time by orders of magnitude.

- it is a master thesis project
No, it's a hobby project. I extended and documented the code in my Master's thesis. I validated, used and further optimized it in my PhD, then entirely re-wrote it for clean code, that is what's in the repository now.

- actual CFD codes must do much more than this
LBM is a tool. It works really well for some use-cases, and if I used FVM for simulating 1600 raindrop impacts in [1] instead, I would still wait for simulation results for the next few years. But LBM of course has its limitations and does not work for everything.

- Of course, even the claim of fastest LBM code seems critical
Look at the roofline model efficiency. It is not physically possible to do better than the 80-90% I get. Coalesced memory access is at the maximum possible with in-place streaming [5].

- would I want to use this code
No idea. I wrote it and used it, because I could not afford $120k for a software license, I could not afford $70k for professional GPUs, and I could not afford to wait >3 years for the simulatiosn for my PhD.

@LuckyTran

- FP32 arithmetic and FP16 memory storage... Yes this is an optimization that makes it run faster but are you really going to tout that you have better performance than all other codes when you're not using FP64 like the industry expects you to?
It is exactly as accurate as FP64 in all but corner cases, even with FP16 memory compression. I have validated this in [2]. I only throw away the bits in the exponent that are not used at all, and the bits in the mantissa that are just numerical noise.
Even in FP32 mode, it's faster than all other LBM implementations because I use Esoteric-Pull streaming with optimal memory coalescence [5]. Higher performance is physically not possible. Using FP64 makes 0 sense because FP32 is indistinguishable from it and takes half the memory demand. The errors do not mainly originate in floating-point with LBM.

- Hey I can run faster than commercial codes if you just give me one of the fastest computers in the world!
The thing is, FluidX3D runs on any GPU from with in the last 10 years or so. Based on grid cell updates per second, it's ~150x faster than FVM, on the *same* GPU hardware. I know FVM is more accurate with lower cell count, so it's not entirely a fair comparison. Still with 100x more cells LBM is probably more accurate in many cases and still way faster. I don't need faster hardware to make that claim.

- this project will not conform to IEEE-754
Who says IEEE-754 is good? My FP16C format is more accurate than IEEE-FP16 for this application. I use it only in memory compression, so it runs on any hardware. Also I believe 32-bit Posit is far superior to FP32 and will replace it some time in the future.

- No one has the time to make custom FP units for every CFD problem.
You don't need to. The number range for the LBM density distribution functions is the same for all LBM simulations, and I tailored 16-bit formats to make use of this. And it works for all cases but simulations with extremely low velocities/forces.

@flotus1
- lower precision FP is something that needs to be validated if you want to use the code for more than impressive animations
It is validated and it is as accurate as FP64 in all but corner cases, see [2].

- That's probably a fair analogy for this project, and why it claims to be "the fastest".
Even with FP32 memory precision, my implementation is faster than all other LBM GPU implementations. Because I use the Esoteric-Pull streaming implementation [5] which has the largest fraction of coalesced memory access, surpassing the previous best Esoteric-Twist scheme. FP16 memory compression just adds another factor 2.

- done entirely during a masters thesis
I worked the past 4 years on this, during my master's and PhD.

@FMDenaro
- apart from some gaming-industry application, where the scientific validation of this project is?
Read [1-4].

- standard Smagorinsky model! Do you know what that means in terms of LBM?
I know there is better models than Smagorinsky nowadays, I used it for it's simplicity. I'll have to do more validation on turbulent cases, and dig through the literature to maybe find something better. But literature on turbulence models is just an error-prone, obfuscated mess.

- Faster and wrong, is this the destiny of CFD?
Hard to believe that CFD can be much faster and still accurate, huh? No reason to discredit my work in disbelief.

[1] https://doi.org/10.1186/s43591-021-00018-8
[2] https://www.researchgate.net/publica...number_formats
[3] https://doi.org/10.15495/EPub_UBT_00005400
[4] https://doi.org/10.1016/j.jcp.2022.111753
[5] https://doi.org/10.3390/computation10060092

LuckyTran · November 20, 2022, 22:20

You're using 4? 8? A100's that cost $10k + a piece, how is this in the category of I cannot afford $70k for professional GPUs?

I know that industry standards are not the best nor perfect, and in this instance you've done the work to show that it is accurate enough, still I would not go around telling people to try their best to violate them. If you believe in the superiority of your custom FP format, I recommend you (or anyone seeking to improve standards) to join the next committee meeting and propose revisions to the standard and/or propose fast methods to demonstrating that custom FP formats are good enough.

I think the work is cool and I'm sure I'll likely talk to Mathias about it over a beer the next time I see him. For the 99% of people that might be exposed to this type of work, I would say, seek professional training, take some classes, get a degree, before you attempt to replicate this work. At the minimum, actually read all the publications involved, because it isn't a trivial endeavor.

Anyway, it's not licensed for commercial use anyway so all my real world concerns are rather m00t points. And on this note, what percentage (%) of CFD'ers are there that would need large cell counts in their models that are not doing it for work?

ProjectPhysX · November 20, 2022, 22:57

Quote:

Originally Posted by LuckyTran

You're using 4? 8? A100's that cost $10k + a piece, how is this in the category of I cannot afford $70k for professional GPUs?

I know that industry standards are not the best nor perfect, and in this instance you've done the work to show that it is accurate enough, still I would not go around telling people to try their best to violate them. If you believe in the superiority of your custom FP format, I recommend you (or anyone seeking to improve standards) to join the next committee meeting and propose revisions to the standard and/or propose fast methods to demonstrating that custom FP formats are good enough.

I didn't have accessing to JSC hardware when I did my study on raindrops; I got the access only very recently. A100/MI200 didn't even exist back then.

IEEE floating-point works great for general use. I'm still doing all arithmetic in FP32, but I'm packing FP32 numbers into 16-bit to cut memory demand in half, only keeping the bits that actually contain information. 16 bits is not a lot, and the IEEE FP16 format ia too general to work well for this. Pre-scaling to lower the number range ("FP16S") is enough to make it work. Yet my custom "FP16C" with 1 bit moved from exponent to mantissa is more accurate in some cases as it halves the truncation error.

I don't see why making better use of the bits is bad? Even hardware companies deviate from IEEE, with 19-bit "TF32", and several FP8 variants. If it works better for a particular application, I see no reason to stick to IEEE standard and its limitations.

LuckyTran · November 20, 2022, 23:14

The snake keeps showing its head so I need to make something very clear. There's two very very very different scenarios

1) Meeting and exceeding ASME/ASTM/ISO/IEC/IEEE/etc. standards is one thing and this is exceptional engineering practice that rarely ever occurs unless the standard itself is directly revised, and when they do, are generally at the level of trade secrets. Something as simple as using an accurate database for fluid properties is something that I've only ever personally seen happen once in my line-of-work (and it was when I used refprop myself, personally).

2) Blatantly ignoring the standard is wilful ignorance. And most people that blindly look at this work without carrying out all the work necessary to prove they actually exceed the standard, are in this latter category. And to be even more clear, I am 99.99% certain you are in category (1) and not this latter category. I am worried for the next person.

Most publications have clear editorial policies for example: experiments should have detailed uncertainty calculations with full propagation of error; CFD should have mesh sensitivity studies. Most papers, neither of these are actually done to the standard because people got tired after having only done it once and never want to do it again. It is hard enough getting people to do the minimum work.

arjun · November 21, 2022, 09:16

Quote:

Originally Posted by ProjectPhysX

@arjun

But with my memory efficient implementation you can make the mesh fine everywhere and it's still much faster than FVM and faster than grid-refinement on CPU.

1. Last auto CFD workshop they provided the grids for the car and it took 4 hours (200million cells) to get the results out. The results match good with the experimental values so why would i spend 7 hours with this.

This was 2nd order solver. With coarser mesh the results will be there in less than an hour.

The point is that finite volume solver's users do not need it. (This is the reason why powerflow was not able to take market from other commerical codes).

2. More efficient codes can be created even with Cartesian mesh. (already exist!)

flotus1 · November 21, 2022, 10:33

Quote:

Originally Posted by ProjectPhysX

@flotus1
- That's probably a fair analogy for this project, and why it claims to be "the fastest".
Even with FP32 memory precision, my implementation is faster than all other LBM GPU implementations. Because I use the Esoteric-Pull streaming implementation [5] which has the largest fraction of coalesced memory access, surpassing the previous best Esoteric-Twist scheme. FP16 memory compression just adds another factor 2.

Hey, thanks for stopping by, despite the rather hostile atmosphere here.
Just so we don't have a misunderstanding: I put "the fastest" in quotation marks not because I doubt the claim. The code is certainly a very fast LB implementation in terms of lattice updates per second LUPS.
The quotation marks are there to indicate my opinion that highest LUPS does not necessarily equal fastest time to solution. A similar LB code that sacrifices some of those LUPS e.g. for grid interfaces could still get a simulation done in less time.

Did you ever find that the choices you made to achieve maximum LUPS were detrimental to other aspects of the code you would like to extend on?
I would imagine that having a single distribution function, along with the highly optimized streaming step, might cause problems when trying to implement e.g. different formulations for boundary conditions.

sbaffini · November 21, 2022, 11:06

Quote:

Originally Posted by ProjectPhysX

@sbaffini

- the F1 video you see is actually depicting the whole domain... ouch
Yes. Note that this was mainly intended as a quick test of the multi-GPU update I'm working on. A showcase of what the software could do in a few hours on an AMD GPU node.

- single GPU, which means no help from the parallel side
I'm still working on multi-GPU. The prototype works (see the F1 videos), but the code is not yet refactored enough to publish the update, and I still need to add multi-GPU support for some extensions like thermal flows. My multi-GPU approach works on any GPU from any vendor via OpenCL, without needing SLI/Crossfire/NVLink. Even combinations of AMD/Intel GPUs work. No porting required, like with all other codes that use CUDA.

- no IO is also a big ouch
It can export volumetric data too, but compared to other codes, it doesn't have to, because it can render directly from VRAM. File export is as fast as the hard drive allows, but I'm pushing the resolution so far that storing hundreds of GigaByte for just a single frame quickly becomes unfeasible. Rendering the data directly from VRAM reduces overall compute time by orders of magnitude.

- it is a master thesis project
No, it's a hobby project. I extended and documented the code in my Master's thesis. I validated, used and further optimized it in my PhD, then entirely re-wrote it for clean code, that is what's in the repository now.

- actual CFD codes must do much more than this
LBM is a tool. It works really well for some use-cases, and if I used FVM for simulating 1600 raindrop impacts in [1] instead, I would still wait for simulation results for the next few years. But LBM of course has its limitations and does not work for everything.

- Of course, even the claim of fastest LBM code seems critical
Look at the roofline model efficiency. It is not physically possible to do better than the 80-90% I get. Coalesced memory access is at the maximum possible with in-place streaming [5].

- would I want to use this code
No idea. I wrote it and used it, because I could not afford $120k for a software license, I could not afford $70k for professional GPUs, and I could not afford to wait >3 years for the simulatiosn for my PhD.

I will speak for myself, and I say that you don't need to defend your work here. I'm pretty sure you know exactly your code, its use case, best scenario
and why you needed it. Just as you know its limitations. How could I know better what you need and how you need it?

My answers were for the user asking why nobody is talking about your code here, and those were the first coming to my mind, given what I could understand of the code in that time frame. When I say that a CFD code must do much, much more than that, it is not in a negative way for your code, but just the objective truth. At least for the sort of jobs I am used to do. Which may inlcude multiple physical models or just more mundane stuff that, however, helps you take the job home. I mean, even OpenFOAM is only remotely resemblant of an actual CFD code in good shape.

It is simply clear that you have a research code that does perfectly what you meant it to do. Which is just a different thing from an actual CFD code to be used in production in the most diverse environments. I probably have doubts on this about most LBM codes out there, if that makes anything better. This is what I wanted OP to get as answer to his question.

What I may be more critical about is the claim of best in class code. Formally speaking, you just prove that YOUR CODE has a certain performance on certain hardware. I haven't seen any comparison with any other code out there, LBM or not. So, it could easily be true, but you don't show nothing about that claim. Indeed, you don't seem to claim that in your papers, for what I can tell. So, in fact, it is not even a real concern.

Also because, how can you compare different things doing different stuff? I would honestly take as serious only a comparison with a code with similar capabilities.

arjun · November 21, 2022, 13:04

Quote:

Originally Posted by flotus1

Hey, thanks for stopping by, despite the rather hostile atmosphere here.

Naa not really.

Its just that the same discussion that we have had last 15 years. LBM code is fast and it will take over world.

This keep coming in various forms and answer is that it has not taken over the world.

PS: Not saying that OP claimed such but with regards to lbm this feeling keeps coming.

ProjectPhysX · November 22, 2022, 15:54

Quote:

Originally Posted by flotus1

Did you ever find that the choices you made to achieve maximum LUPS were detrimental to other aspects of the code you would like to extend on?
I would imagine that having a single distribution function, along with the highly optimized streaming step, might cause problems when trying to implement e.g. different formulations for boundary conditions.

In-place streaming made the free surface extension a bit more difficult, requiring an additional kernel to avoid a race condition. But it worked out eventually.

Some people want symmetry/free slip boundaries. These are difficult in 2-grid streaming already, and I haven't yet wrapped my head around how to do them in 1-grid schemes. Should still be possible, but probably not easy.

Interpolated bounce-back boundaries also seems more difficult in 1-grid schemes, but become obsolete nowadays as VRAM allows for large enough grids, and large grids are needed anyways to resolve the turbulence.

Fortunately the streaming implementation is independent of the collision operator, and LBM is so far in the bandwidth limit that all collision operators, even those with a matrix multiplication, perform the same, based on the memory access pattern during streaming. I'm using BGK by default as that proved superior to TRT/MRT in most cases. But there is still room to try cumulants and central moment operators, without expected loss in performance.

aerosayan · November 24, 2022, 15:55

Quote:

Originally Posted by ptpacheco

As far as I can tell the largest drawback is the lack of data storage (to minimize IO, only rendered frames are saved).

I like the project, but I'm not sold on the premise. We need the data. Not being able to store it is bad. Rendering frames out might be useful to some, but not for the cases I'm interested in. For me CFD is not colorful rendered frames. It's the boring old graphs that we get about pressure, velocity, temperature distribution.

ProjectPhysX · November 25, 2022, 02:25

Quote:

Originally Posted by aerosayan

I like the project, but I'm not sold on the premise. We need the data. Not being able to store it is bad. Rendering frames out might be useful to some, but not for the cases I'm interested in. For me CFD is not colorful rendered frames. It's the boring old graphs that we get about pressure, velocity, temperature distribution.

I think you misunderstood that. FluidX3D can of course export the data and do IO like any other software.

But unlike other software, it has the option to render directly in VRAM, which allows for very fast, at lower resolution even interactive previews of simulations.

In some of the YouTube videos, grid resolution is so large (10 billion cells) that one single frame of the velocity field alone is 120GB, and every ~40 seconds a new frame (180 LBM time steps) is computed and ready for rendering/export. You can imagine how fast that would fill all available hard drives; this is not feasible without built-in rendering at all. Other software could not even handle such cases.

November 20, 2022, 13:02	Hi all, author of the FluidX3D code here	#30
ProjectPhysX New Member Moritz Lehmann Join Date: Nov 2022 Posts: 4 Rep Power: 3	Thank you all for the discussion and feedback. I just want to clarify a few misunderstandings: @ptpacheco - largest drawback is the lack of data storage It can export volumetric data too, but compared to other codes, it doesn't have to, because it can render directly from VRAM. File export is as fast as the hard drive allows, but I'm pushing the resolution so far that storing hundreds of GigaByte for just a single frame quickly becomes unfeasible. Rendering the data directly from VRAM reduces overall compute time by orders of magnitude. - Besides it is open-source, unlike Powerflow. No it's not open-source. It's source-available, free to use/modify in public research, education and personal use. But IO don't allow commercial and/or military use, which technically does make it not open-spurce. - It seems like a great tool to deliver realistic visualizations in a short period of time. It can do that, but also much more. The simulations in [1] would take years with any other software. @arjun - other solvers do not need same mesh for same accuracy Exactly, the uniform cubic grid is a main disadvantage of LBM. Dynamic adaptive mesh refinement is a PITA to implement on GPU. There is reasons why Powerflow runs only on slow CPUs so far. But with my memory efficient implementation you can make the mesh fine everywhere and it's still much faster than FVM and faster than grid-refinement on CPU. - BTW has he benchmarked the results against experimental values? Yes, see [1] and especially its SI document, and [2-4]. I did more than enough validation for my own science use-cases. Still I need to validate Smagorinsky-Lilly and more turbulent test cases. Will have more time for that after I submit my PhD thesis in 2 weeks. @sbaffini - the F1 video you see is actually depicting the whole domain... ouch Yes. Note that this was mainly intended as a quick test of the multi-GPU update I'm working on. A showcase of what the software could do in a few hours on an AMD GPU node. - single GPU, which means no help from the parallel side I'm still working on multi-GPU. The prototype works (see the F1 videos), but the code is not yet refactored enough to publish the update, and I still need to add multi-GPU support for some extensions like thermal flows. My multi-GPU approach works on any GPU from any vendor via OpenCL, without needing SLI/Crossfire/NVLink. Even combinations of AMD/Intel GPUs work. No porting required, like with all other codes that use CUDA. - no IO is also a big ouch It can export volumetric data too, but compared to other codes, it doesn't have to, because it can render directly from VRAM. File export is as fast as the hard drive allows, but I'm pushing the resolution so far that storing hundreds of GigaByte for just a single frame quickly becomes unfeasible. Rendering the data directly from VRAM reduces overall compute time by orders of magnitude. - it is a master thesis project No, it's a hobby project. I extended and documented the code in my Master's thesis. I validated, used and further optimized it in my PhD, then entirely re-wrote it for clean code, that is what's in the repository now. - actual CFD codes must do much more than this LBM is a tool. It works really well for some use-cases, and if I used FVM for simulating 1600 raindrop impacts in [1] instead, I would still wait for simulation results for the next few years. But LBM of course has its limitations and does not work for everything. - Of course, even the claim of fastest LBM code seems critical Look at the roofline model efficiency. It is not physically possible to do better than the 80-90% I get. Coalesced memory access is at the maximum possible with in-place streaming [5]. - would I want to use this code No idea. I wrote it and used it, because I could not afford $120k for a software license, I could not afford $70k for professional GPUs, and I could not afford to wait >3 years for the simulatiosn for my PhD. @LuckyTran - FP32 arithmetic and FP16 memory storage... Yes this is an optimization that makes it run faster but are you really going to tout that you have better performance than all other codes when you're not using FP64 like the industry expects you to? It is exactly as accurate as FP64 in all but corner cases, even with FP16 memory compression. I have validated this in [2]. I only throw away the bits in the exponent that are not used at all, and the bits in the mantissa that are just numerical noise. Even in FP32 mode, it's faster than all other LBM implementations because I use Esoteric-Pull streaming with optimal memory coalescence [5]. Higher performance is physically not possible. Using FP64 makes 0 sense because FP32 is indistinguishable from it and takes half the memory demand. The errors do not mainly originate in floating-point with LBM. - Hey I can run faster than commercial codes if you just give me one of the fastest computers in the world! The thing is, FluidX3D runs on any GPU from with in the last 10 years or so. Based on grid cell updates per second, it's ~150x faster than FVM, on the same GPU hardware. I know FVM is more accurate with lower cell count, so it's not entirely a fair comparison. Still with 100x more cells LBM is probably more accurate in many cases and still way faster. I don't need faster hardware to make that claim. - this project will not conform to IEEE-754 Who says IEEE-754 is good? My FP16C format is more accurate than IEEE-FP16 for this application. I use it only in memory compression, so it runs on any hardware. Also I believe 32-bit Posit is far superior to FP32 and will replace it some time in the future. - No one has the time to make custom FP units for every CFD problem. You don't need to. The number range for the LBM density distribution functions is the same for all LBM simulations, and I tailored 16-bit formats to make use of this. And it works for all cases but simulations with extremely low velocities/forces. @flotus1 - lower precision FP is something that needs to be validated if you want to use the code for more than impressive animations It is validated and it is as accurate as FP64 in all but corner cases, see [2]. - That's probably a fair analogy for this project, and why it claims to be "the fastest". Even with FP32 memory precision, my implementation is faster than all other LBM GPU implementations. Because I use the Esoteric-Pull streaming implementation [5] which has the largest fraction of coalesced memory access, surpassing the previous best Esoteric-Twist scheme. FP16 memory compression just adds another factor 2. - done entirely during a masters thesis I worked the past 4 years on this, during my master's and PhD. @FMDenaro - apart from some gaming-industry application, where the scientific validation of this project is? Read [1-4]. - standard Smagorinsky model! Do you know what that means in terms of LBM? I know there is better models than Smagorinsky nowadays, I used it for it's simplicity. I'll have to do more validation on turbulent cases, and dig through the literature to maybe find something better. But literature on turbulence models is just an error-prone, obfuscated mess. - Faster and wrong, is this the destiny of CFD? Hard to believe that CFD can be much faster and still accurate, huh? No reason to discredit my work in disbelief. [1] https://doi.org/10.1186/s43591-021-00018-8 [2] https://www.researchgate.net/publica...number_formats [3] https://doi.org/10.15495/EPub_UBT_00005400 [4] https://doi.org/10.1016/j.jcp.2022.111753 [5] https://doi.org/10.3390/computation10060092 Last edited by ProjectPhysX; November 20, 2022 at 22:36.

November 20, 2022, 23:14		#33
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,680 Rep Power: 66	The snake keeps showing its head so I need to make something very clear. There's two very very very different scenarios 1) Meeting and exceeding ASME/ASTM/ISO/IEC/IEEE/etc. standards is one thing and this is exceptional engineering practice that rarely ever occurs unless the standard itself is directly revised, and when they do, are generally at the level of trade secrets. Something as simple as using an accurate database for fluid properties is something that I've only ever personally seen happen once in my line-of-work (and it was when I used refprop myself, personally). 2) Blatantly ignoring the standard is wilful ignorance. And most people that blindly look at this work without carrying out all the work necessary to prove they actually exceed the standard, are in this latter category. And to be even more clear, I am 99.99% certain you are in category (1) and not this latter category. I am worried for the next person. Most publications have clear editorial policies for example: experiments should have detailed uncertainty calculations with full propagation of error; CFD should have mesh sensitivity studies. Most papers, neither of these are actually done to the standard because people got tired after having only done it once and never want to do it again. It is hard enough getting people to do the minimum work. poly_tec likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to simulate the gravitydriven RayleighTaylor instability	luckyluke	OpenFOAM Running, Solving & CFD	13	October 15, 2019 10:52
Running dieselFoam error	adorean	OpenFOAM Running, Solving & CFD	119	February 1, 2016 14:41
ANSYS and CFX not talking in two-way FSI?	brashear	CFX	6	November 25, 2012 08:13
How to add transport equations	alimansouri	OpenFOAM Running, Solving & CFD	6	January 12, 2009 16:20
Simulation of a free falling wedge into water 2D	nico765	OpenFOAM Running, Solving & CFD	3	January 11, 2009 02:47

November 10, 2022, 21:10		#23
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,680 Rep Power: 66	I don't think there's anything wrong with it. It's very reasonable. But again it should be obvious that you need less memory per byte and you can pass more bytes for a given bus. And then it should be obvious why you outperform higher requirement environments when you straight up have a bits/byte advantage. Of course a software is gonna put that disclaimer. They're never trying to sell a product or make a guarantee. Try asking an OEM when was the last time they reformatted their floating point representation in their codes and then ran the verification tests to prove that it works. I thought I was pointing out very obvious facts about how bits/bytes work, I don't know why there is even any pushback. I never said anyone can't use a lower standard. But try using a higher standard and tell me it still works the same. And this isn't even my personal criticism, the authors even state that both LBM and CPU-based FVM are both RAM bandwidth limited and their project (LBM + custom FP16 + more optimizations) is as fast as it is because they have a raw bits/byte advantage over existing commercial implementations. Btw I've shared beers with ppl from MH group at KIT, this isn't me dumpstering their work.

November 12, 2022, 07:44		#25
JBeilke Senior Member Joern Beilke Join Date: Mar 2009 Location: Dresden Posts: 501 Rep Power: 20	@arjun Are you sure, that PowerFlow is still an LB code? Years ago termini like wall functions and turbulence models were used around PowerFlow, so we suspected, that the code was transformed from LB to some sort of LES/DES.

November 20, 2022, 22:20		#31
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,680 Rep Power: 66	You're using 4? 8? A100's that cost $10k + a piece, how is this in the category of I cannot afford $70k for professional GPUs? I know that industry standards are not the best nor perfect, and in this instance you've done the work to show that it is accurate enough, still I would not go around telling people to try their best to violate them. If you believe in the superiority of your custom FP format, I recommend you (or anyone seeking to improve standards) to join the next committee meeting and propose revisions to the standard and/or propose fast methods to demonstrating that custom FP formats are good enough. I think the work is cool and I'm sure I'll likely talk to Mathias about it over a beer the next time I see him. For the 99% of people that might be exposed to this type of work, I would say, seek professional training, take some classes, get a degree, before you attempt to replicate this work. At the minimum, actually read all the publications involved, because it isn't a trivial endeavor. Anyway, it's not licensed for commercial use anyway so all my real world concerns are rather m00t points. And on this note, what percentage (%) of CFD'ers are there that would need large cell counts in their models that are not doing it for work?