No one talking about FluidX3D???

ptpacheco · November 9, 2022, 10:28

Surprised no one is talking about this here in the forums... Super optimized open-source GPU solver FluidX3D apparently blows all other CFD software out of the water in terms of time-to-solution.

https://www.youtube.com/watch?v=q4rNIbqvyQI

https://github.com/ProjectPhysX/FluidX3D

As far as I can tell the largest drawback is the lack of data storage (to minimize IO, only rendered frames are saved). Haven't really played around with it so can't really say that much, but would be interested in hearing the opinions of veteran CFDers regarding this development!

arjun · November 9, 2022, 11:12

Quote:

Originally Posted by ptpacheco

Surprised no one is talking about this here in the forums... Super optimized open-source GPU solver FluidX3D apparently blows all other CFD software out of the water in terms of time-to-solution.

If you assume same mesh then may be but other solvers do not need same mesh for same accuracy. For example take higher order solvers, they can do the same thing in much much coarser meshes.

Also for the same mesh, it needs to compare itself to Powerflow of Exa which is the same animal in more efficient and powerful form. :-D

ptpacheco · November 9, 2022, 11:21

Quote:

Originally Posted by arjun

If you assume same mesh then may be but other solvers do not need same mesh for same accuracy. For example take higher order solvers, they can do the same thing in much much coarser meshes.

Also for the same mesh, it needs to compare itself to Powerflow of Exa which is the same animal in more efficient and powerful form. :-D

Well according to the claims of the author he simulated external aerodynamics of a F1 vehicle in a 2144×4288×1072 grid for 0.5 seconds in 7 h (8x AMD Instinct MI200 GPUs)... I don't believe this computation speed is possible with Powerflow, I might be wrong however.

Besides it is open-source, unlike Powerflow.

flotus1 · November 9, 2022, 11:35

Powerflow doesn't necessarily need to be the fastest in order to compete with that. They have grid refinements. Which -depending on the case- can reduce computational effort by orders of magnitude.
I could be mistaken, but it sounds like this project is currently limited to a single "cell" size. Cool stuff nonetheless, but I consider this feature mandatory for a more mainstream appeal.

ptpacheco · November 9, 2022, 11:39

Quote:

Originally Posted by flotus1

I could be mistaken, but it sounds like this project is currently limited to a single "cell" size.

How do you mean? It is lattice-Boltzman but I believe you can set whatever grid density you'd like.

flotus1 · November 9, 2022, 11:47

I'm talking about the ability to represent different parts of the computational domain with different cell sizes. I'll keep using the term cell size when referring to the minimum distance between two adjacent collision centers.
When limited to a single cell size, you keep running into the issue that you can't have a fine enough resolution in the interesting parts, while wasting a lot of collision centers in parts where they are unnecessary. And/or limiting the size of the computational domain you can work with.

By the way, this isn't me talking trash about this project. I know how challenging multiple resolutions are for LBM. It's probably a good idea to leave that for later in a GPU code.
I just think it's something to be aware off. And it will be one of the reasons for high performance.

arjun · November 9, 2022, 12:18

Quote:

Originally Posted by ptpacheco

Well according to the claims of the author he simulated external aerodynamics of a F1 vehicle in a 2144×4288×1072 grid for 0.5 seconds in 7 h (8x AMD Instinct MI200 GPUs)... I don't believe this computation speed is possible with Powerflow, I might be wrong however.

Besides it is open-source, unlike Powerflow.

Well powerFlow can be stable at much much higher time-step sizes thats where it shines. If there is a gpu version of PowerFlow then it will leave it in dust based on what i have seen so far from them.

In anyway, higher order methods with implicit time step could be much much more efficient.

BTW has he benchmarked the results against experimental values? This is where my interest lies. It makes good animation though. That is true.

sbaffini · November 9, 2022, 15:10

Quote:

Originally Posted by ptpacheco

Surprised no one is talking about this here in the forums... Super optimized open-source GPU solver FluidX3D apparently blows all other CFD software out of the water in terms of time-to-solution.

https://www.youtube.com/watch?v=q4rNIbqvyQI

https://github.com/ProjectPhysX/FluidX3D

As far as I can tell the largest drawback is the lack of data storage (to minimize IO, only rendered frames are saved). Haven't really played around with it so can't really say that much, but would be interested in hearing the opinions of veteran CFDers regarding this development!

The code (apparently) does a couple of things very well, but not many others. The main drawbacks I see are:

- uniform grid density, which also means that the F1 video you see is actually depicting the whole domain... ouch

- single GPU, which means no help from the parallel side on alleviating the problem above

- no IO is also a big ouch

- it is a LBM code, which also has some limitations, it's not the perfect world everybody is trying to tell at every street corner

- it is a master thesis project, and both the source code and the development repository that contains it, as they stand today, reflect all the limitations you would expect from such a project

It is an outstanding work for a master thesis, but it is clearly focused on speed and, I guess, visualization while staying accurate. Which is fine, but actual CFD codes must do much more than this.

I mean, MUCH MUCH MUCH MORE THAN THIS, if you have had a look at the source code you have seen that it actually does nothing besides advancing the equations in time.

LuckyTran · November 9, 2022, 17:05

This project in its entirety is a great example of how you can engineer a problem to fit the message you want to give.

Also I read some quick notes by the author that it uses FP32 arithmetic and FP16 memory storage... Yes this is an optimization that makes it run faster but are you really going to tout that you have better performance than all other codes when you're not using FP64 like the industry expects you to? And what's stopping someone from running an even faster FP8 simulation...? There has to be some standards...

Btw it's obvious to anyone that knows anything about gpu computing why it was NOT done in FP64 but let's just bury that under the rug.

That being said, the output and youtube video is pretty nice for getting people interested in the topic in the same way as Bill Nye the science guy got a lot of people interested in science.

flotus1 · November 9, 2022, 17:15

I wouldn't consider lower precision an inherent flaw to any code.
There are lots of simulation types with real-world relevance that would not benefit from FP64 at all. And you can certainly use some magic to limit the impact of lower precision FP calculations.
And of course, since it's a GPU code, I would rather welcome the ability for FP32. GPUs with noteworthy FP64 performance are few and far between, GPU memory is always scarce, and they are expensive. Having single precision lowers the bar for entry here.
I haven't looked into what exactly is going on here with half precision though.
But yeah, lower precision FP is something that needs to be validated if you want to use the code for more than impressive animations.

LuckyTran · November 9, 2022, 17:40

Of course precision is not a coding problem, but it opens the door to exploits that can be (and in this case are) misleading.

For example, you can use FFT algorithm on datasets that are 2^N in length and equally spaced apart and claim you can "outperform" other codes that that brute force the full, non-uniform discrete fourier transform using arbitrary length signals with arbitrary spacing. You technically did it faster but is that fair?

Now if the entire industry comes together and say we only care about signals that are equally spaced apart and only lengths in powers of 2 (because real world signals always come this way right!?) then you have a basis for comparison even if the bar has been lowered. In the case of signal processing that is exactly what has happened. But this lowering of the bar has not yet happened in CFD.

flotus1 · November 9, 2022, 18:13

Quote:

Originally Posted by LuckyTran

For example, you can use FFT algorithm on datasets that are 2^N in length and equally spaced apart and claim you can "outperform" other codes that that brute force the full, non-uniform discrete fourier transform using arbitrary length signals with arbitrary spacing. You technically did it faster but is that fair?

That's probably a fair analogy for this project, and why it claims to be "the fastest".

But let's not get carried away here. This seems to be a fairly small project. If it was really done entirely during a masters thesis, it is pretty amazing.
I don't think "lowering the bar" is something we need to be afraid of. If you want to get into CFD solvers these days, the only option is finding a niche. You won't catch up to the versatility and robustness of established commercial solvers.
Smaller projects aiming to be better at some specific aspects are a benefit to the CFD community. They won't replace today's multiphysics behemoths, or lower any standards. It is still our job as CFD experts to choose the right tool for the job. And if that job happens to fall into a niche covered by a more specialized code, we have that option.
Note that this isn't me saying I would use this code in its current form to produce CFD results.

LuckyTran · November 9, 2022, 20:58

Well to muddy up the waters a bit... it turned into a PhD project and then some.

Yes there are some things that can be done very quickly using LBM and such. In particular, I know Sony is very interested in the raindrop simulation work (something they actively pursued for the development of the PS5) and that there is very real business opportunities for some of this work category. However, even when looking at this specification application...

It was also run on the Jülich supercompute cluster. As someone who has had personal experience running things on JUROPA (the slowest subcomplex out of JCC) I can confirm it is not your typical workstation or your typical compute cluster, but has all the infrastructure that you would expect from a computing cluster and corresponding staff that can provide support. Where else can you load 144 TB of raw velocity data onto RAM and have it never see a harddisk (physical or virtual)? What I find most impressive is the fast rendering of 10 billion voxels.

The more you look into it the more over-the-top you'll find some of the nuances. What I want to say is: Kids, don't try this at home! This is a professional stunt done by trained professionals! It is a superoptimized opensource project (in fine print: optimized to run in a supercomputer environment). Hey I can run faster than commercial codes if you just give me one of the fastest computers in the world! Gee, if you had that type of hardware, I would hope you actually do stump everyone else!

arjun · November 10, 2022, 02:14

Quote:

Originally Posted by LuckyTran

Hey I can run faster than commercial codes if you just give me one of the fastest computers in the world! Gee, if you had that type of hardware, I would hope you actually do stump everyone else!

All these things assume that someone else has to do the same in other environment. But most of the time they do not have to do the same to achieve same or similar results.

I worked on golf ball research in years 2004 to 2012 and that time we initially worked with Arizona state univ and they had DNS code that did simulations for us.
It was explicit code so time-step was physics dependent. Same is the case here with lattice boltzmann codes.

Now when i wrote the same thing at our place i wrote SIMPLE on cartesian grid. That means we could go for much higher time steps. Based on ASU, they were needing dt=1E-7 or so for our case and we fixed the dt to 1E-5.

We had a specialised pressure solver based on FFT (ASU used the FFT based solver too). So our cost of time step was around 5 to 10 times of ASU (but dt was large).

We took 5 to 6 days for 0.35 seconds of simulation using 570 or so processors for 3Billion cell case). The maximum velocity near the ball is around 90 m/sec and dimple depth is 0.15mm (we had at least 10 cells in this depth).

Now coming to car, that also has maximum velocities around 70m/sec in domain but cell sizes are much much larger so dt is quite relaxed. In that case 7 hours of simulation seems quite inefficient to me compared to what we managed in 2010 without GPU. (They had double the meshes but still).

So it all depends.

PS: The golf ball was spinning arounf 2500 RPM too.

FMDenaro · November 10, 2022, 04:51

Just had a look to your discussion ... apart from some gaming-industry application, where the scientific validation of this project is?

What is the final goal of the project? If you have a look to the youtube video, it is addressed as LES simulation with standard Smagorinsky model! Do you know what that means in terms of LBM?

Faster and wrong, is this the destiny of CFD?

arjun · November 10, 2022, 07:20

Quote:

Originally Posted by FMDenaro

Faster and wrong, is this the destiny of CFD?

We personally moved to our own code because commercial tools did not cut it. We got good results compared to experiments with our own codes.

You brought a good question though. Accuracy is very important. I would say shall be top priority.

sbaffini · November 10, 2022, 08:05

Quote:

Originally Posted by FMDenaro

Just had a look to your discussion ... apart from some gaming-industry application, where the scientific validation of this project is?

What is the final goal of the project? If you have a look to the youtube video, it is addressed as LES simulation with standard Smagorinsky model! Do you know what that means in terms of LBM?

Faster and wrong, is this the destiny of CFD?

While I am no expert of LBM, when fast looking at the msc. thesis underlying the code I think I saw several verifications for channels (yet, I didn't check them very deeply) and what looked like a general strong dedication to the code verification. Again, A LOT of work for a msc thesis.

I think that the project goal is to be understood to put the code in perspective. Indeed the actual publications on the code regard the coding part.

Of course, even the claim of fastest LBM code seems critical (compared to what code and on which hardware, exactly?), but still seems in the realm of healthy naivity for an otherwise excellent m.sc. project.

So, if we stay in the frame of the original question, "why no one is talking about this here in the forums", I would summarize that it is because the code is close to useless for me and 90% of performances are due to obvious reasons that make the code useless to me, but I would not say it is a bad work in general.

ptpacheco · November 10, 2022, 10:02

Quote:

Originally Posted by sbaffini

- uniform grid density, which also means that the F1 video you see is actually depicting the whole domain... ouch

- single GPU, which means no help from the parallel side on alleviating the problem above

- no IO is also a big ouch

- it is a LBM code, which also has some limitations, it's not the perfect world everybody is trying to tell at every street corner

- it is a master thesis project, and both the source code and the development repository that contains it, as they stand today, reflect all the limitations you would expect from such a project

While I definitely agree with the 3rd and 4th points, I would counterargue that the "extreme" optimization (if it really is at extreme as the benchmarks make it out to be) kind of removes the need for localized refinement, no? Just use a dense grid everywhere.

And I do believe FluidX3D works with parallel GPUs. According to this comment by the solver dev on the YT video:

Quote:

"Commercial CFD software would take months for a simulation this detailed. I did it in 14 hours on 8 GPUs, including rendering."

Regarding the floating point precision, he has this to say in the FAQ found in the project Github:

Quote:

"FluidX3D only uses FP32 or even FP32/FP16, in contrast to FP64. Are simulation results physically accurate?

Yes, in all but extreme edge cases. The code has been specially optimized to minimize arithmetic round-off errors and make the most out of lower precision. With these optimizations, accuracy in most cases is indistinguishable from FP64 double-precision, even with FP32/FP16 mixed-precision. Details can be found in this paper."

It was also validated for a couple of cases with analytic solutions, I believe. At least if you download it and try it out for yourself, you'll see a bunch of setups for Poiseulle flow, Taylor-Green vortices, cylinder in duct etc...

Let me say I have no special interest in this code nor am I defending it - in fact, the no IO thing kind of kills it for my uses. But, similarly to flotus, I think it is a very well developed piece of software which might be of use in niche applications. It seems like a great tool to deliver realistic visualizations in a short period of time.

flotus1 · November 10, 2022, 10:39

Quote:

Originally Posted by ptpacheco

While I definitely agree with the 3rd and 4th points, I would counterargue that the "extreme" optimization (if it really is at extreme as the benchmarks make it out to be) kind of removes the need for localized refinement, no? Just use a dense grid everywhere.

Absolutely not.
The F1 racecar in the youtube video is rather a best-case scenario, since the computational domain is a rectangular box after all. But even here you see the limitation of this approach: the domain is not nearly large enough to satisfy best practice rules for external aerodynamics.
Having no grid refinement becomes even more problematic for internal aerodynamics, depending on the geometry. There are certainly cases where this approach works just fine. But there are others where grid refinement is mandatory.

sbaffini · November 10, 2022, 12:33

Quote:

Originally Posted by ptpacheco

While I definitely agree with the 3rd and 4th points, I would counterargue that the "extreme" optimization (if it really is at extreme as the benchmarks make it out to be) kind of removes the need for localized refinement, no? Just use a dense grid everywhere.

And I do believe FluidX3D works with parallel GPUs. According to this comment by the solver dev on the YT video:

Regarding the floating point precision, he has this to say in the FAQ found in the project Github:

It was also validated for a couple of cases with analytic solutions, I believe. At least if you download it and try it out for yourself, you'll see a bunch of setups for Poiseulle flow, Taylor-Green vortices, cylinder in duct etc...

Let me say I have no special interest in this code nor am I defending it - in fact, the no IO thing kind of kills it for my uses. But, similarly to flotus, I think it is a very well developed piece of software which might be of use in niche applications. It seems like a great tool to deliver realistic visualizations in a short period of time.

Apparently, the F1 video is indeed based on a WIP multi-GPU version that is not present on github, which again reinforces my sentiment about the development repository as it stands.

Besides this, reading the comments under the video, not only you get the whole bag of uneducated comments about commercial solvers and how they sucks and are slow, by people who don't even understand what are theirselves writing, but then you get this wonderful Q&A:

Q: Can your code handle shock and super sonic flows?
A: Shock waves yes, supersonic flow no

Now, I know nothing about LBM, but I think this is no less than a miracle.

Jokes apart, I also went again trough the references and what I see is that analytical cases are used in the same way as validation cases. Now, this is probably a LB (community?) issue, but still a relevant one. So, I would probably be more cautious about the validation of the code.

In the end, the guy, his code and the papers speak for themselves. If someone doesn't understand why a top commercial unstructured multiphysics FV code, that does literally everything the user wants, doesn't have the same speed of a uniform grid LB code for GPU with no input, no output and no user intervention in whatever form (basically less than a regular CFD course project in terms of flexibility, and it only stores the variable it advances), then who am I to disagree?

This goes also beyond "simple CFD", and requires to understand the CFD business as a whole and the maintenance of the mathematical software at its core. Which, if you haven't done this as a job, you probably don't understand it.

The real question is, why would I want to use this code, because of some video on youtube and its roaring title? Because it's arbitrarily fast? There is probably someone who needs that, but industry needs at large are more complex.

November 9, 2022, 10:28	No one talking about FluidX3D???	#1
ptpacheco New Member Pedro Pacheco Join Date: Dec 2019 Location: Portugal Posts: 10 Rep Power: 6	Surprised no one is talking about this here in the forums... Super optimized open-source GPU solver FluidX3D apparently blows all other CFD software out of the water in terms of time-to-solution. https://www.youtube.com/watch?v=q4rNIbqvyQI https://github.com/ProjectPhysX/FluidX3D As far as I can tell the largest drawback is the lack of data storage (to minimize IO, only rendered frames are saved). Haven't really played around with it so can't really say that much, but would be interested in hearing the opinions of veteran CFDers regarding this development! sina_989, flotus1 and lourencosm like this.

November 9, 2022, 11:47		#6
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	I'm talking about the ability to represent different parts of the computational domain with different cell sizes. I'll keep using the term cell size when referring to the minimum distance between two adjacent collision centers. When limited to a single cell size, you keep running into the issue that you can't have a fine enough resolution in the interesting parts, while wasting a lot of collision centers in parts where they are unnecessary. And/or limiting the size of the computational domain you can work with. By the way, this isn't me talking trash about this project. I know how challenging multiple resolutions are for LBM. It's probably a good idea to leave that for later in a GPU code. I just think it's something to be aware off. And it will be one of the reasons for high performance. arjun likes this.

November 9, 2022, 17:05		#9
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,680 Rep Power: 66	This project in its entirety is a great example of how you can engineer a problem to fit the message you want to give. Also I read some quick notes by the author that it uses FP32 arithmetic and FP16 memory storage... Yes this is an optimization that makes it run faster but are you really going to tout that you have better performance than all other codes when you're not using FP64 like the industry expects you to? And what's stopping someone from running an even faster FP8 simulation...? There has to be some standards... Btw it's obvious to anyone that knows anything about gpu computing why it was NOT done in FP64 but let's just bury that under the rug. That being said, the output and youtube video is pretty nice for getting people interested in the topic in the same way as Bill Nye the science guy got a lot of people interested in science. sbaffini, zyzycomcn, lpz456 and 1 others like this.

November 9, 2022, 17:15		#10
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	I wouldn't consider lower precision an inherent flaw to any code. There are lots of simulation types with real-world relevance that would not benefit from FP64 at all. And you can certainly use some magic to limit the impact of lower precision FP calculations. And of course, since it's a GPU code, I would rather welcome the ability for FP32. GPUs with noteworthy FP64 performance are few and far between, GPU memory is always scarce, and they are expensive. Having single precision lowers the bar for entry here. I haven't looked into what exactly is going on here with half precision though. But yeah, lower precision FP is something that needs to be validated if you want to use the code for more than impressive animations. lourencosm and zyzycomcn like this.

November 9, 2022, 17:40		#11
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,680 Rep Power: 66	Of course precision is not a coding problem, but it opens the door to exploits that can be (and in this case are) misleading. For example, you can use FFT algorithm on datasets that are 2^N in length and equally spaced apart and claim you can "outperform" other codes that that brute force the full, non-uniform discrete fourier transform using arbitrary length signals with arbitrary spacing. You technically did it faster but is that fair? Now if the entire industry comes together and say we only care about signals that are equally spaced apart and only lengths in powers of 2 (because real world signals always come this way right!?) then you have a basis for comparison even if the bar has been lowered. In the case of signal processing that is exactly what has happened. But this lowering of the bar has not yet happened in CFD. sbaffini and arjun like this.

November 9, 2022, 11:35		#4
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,400 Rep Power: 47	Powerflow doesn't necessarily need to be the fastest in order to compete with that. They have grid refinements. Which -depending on the case- can reduce computational effort by orders of magnitude. I could be mistaken, but it sounds like this project is currently limited to a single "cell" size. Cool stuff nonetheless, but I consider this feature mandatory for a more mainstream appeal.

November 9, 2022, 20:58		#13
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,680 Rep Power: 66	Well to muddy up the waters a bit... it turned into a PhD project and then some. Yes there are some things that can be done very quickly using LBM and such. In particular, I know Sony is very interested in the raindrop simulation work (something they actively pursued for the development of the PS5) and that there is very real business opportunities for some of this work category. However, even when looking at this specification application... It was also run on the Jülich supercompute cluster. As someone who has had personal experience running things on JUROPA (the slowest subcomplex out of JCC) I can confirm it is not your typical workstation or your typical compute cluster, but has all the infrastructure that you would expect from a computing cluster and corresponding staff that can provide support. Where else can you load 144 TB of raw velocity data onto RAM and have it never see a harddisk (physical or virtual)? What I find most impressive is the fast rendering of 10 billion voxels. The more you look into it the more over-the-top you'll find some of the nuances. What I want to say is: Kids, don't try this at home! This is a professional stunt done by trained professionals! It is a superoptimized opensource project (in fine print: optimized to run in a supercomputer environment). Hey I can run faster than commercial codes if you just give me one of the fastest computers in the world! Gee, if you had that type of hardware, I would hope you actually do stump everyone else! lpz456 likes this.

November 10, 2022, 04:51		#15
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,782 Rep Power: 71	Just had a look to your discussion ... apart from some gaming-industry application, where the scientific validation of this project is? What is the final goal of the project? If you have a look to the youtube video, it is addressed as LES simulation with standard Smagorinsky model! Do you know what that means in terms of LBM? Faster and wrong, is this the destiny of CFD? lpz456 likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to simulate the gravitydriven RayleighTaylor instability	luckyluke	OpenFOAM Running, Solving & CFD	13	October 15, 2019 10:52
Running dieselFoam error	adorean	OpenFOAM Running, Solving & CFD	119	February 1, 2016 14:41
ANSYS and CFX not talking in two-way FSI?	brashear	CFX	6	November 25, 2012 08:13
How to add transport equations	alimansouri	OpenFOAM Running, Solving & CFD	6	January 12, 2009 16:20
Simulation of a free falling wedge into water 2D	nico765	OpenFOAM Running, Solving & CFD	3	January 11, 2009 02:47