Cache usage in CFD, and AMD 3D V-cache

hsnyder · November 24, 2021, 15:02

I was under the impression that CFD workloads, at least finite volume methods, didn't benefit too much from cache size, but in AMD's recent Milan-X announcement, they're claiming outrageous performance improvements in OpenFOAM and Fluent from the introduction of 3D V-cache. So either I'm wrong and having lots of cache really matters, or they're using a small enough case that some substantial fraction of the decomposed domain fits into the 700+MB of cache they have. I'm curious to hear from anyone who might have deeper knowledge on this subject.

flotus1 · November 25, 2021, 03:24

The CFD benchmarks I have seen so far seem to be honest in that regard https://www.tomshardware.com/news/mi...n-x-benchmarks
The cases are way too big to fit into the cache entirely, and they also show results for different number of nodes. You can also see where the advantage of Milan-X starts to break down: the "Combustor_830m" case for Fluent only shows a minor performance increase, which then grows for higher number of nodes.
Cache size always mattered a bit with unstructured CFD, but so far, last level caches were an order of magnitude smaller. Now we suddenly get cache sizes in the GB range for a dual-socket system, instead of MB. Intel will go a similar route in the future, presumably with slower but even larger HBM inside the CPU package.

hsnyder · November 25, 2021, 10:11

Interesting. What property of the combustor case is responsible for this difference? Just the outright size?

flotus1 · November 25, 2021, 10:28

I would think so, yes. Especially since it starts speeding up again for higher number of nodes.

Malinator · February 18, 2022, 07:44

Well, talking about caches - of course take it with a grain of salt, but Intel recently bragged about some serious performance uplift that on-die fast memory (HBM) give to its upcoming server products.

Have a look here , claims performance increase in 2.8 times of Sapphire Radips with HBM versus (some) current gen Xeon or EPYC Milan. Well, as usual it isn't clear which one CPU and in what exact configuration; but most likely some that gives the same performance as Xeon, so probably not top of the line

) And, more interesting, it is 1.75x faster then Sapphire Rapids without HBM. 28million cells grid in that case is not a huge one, but pretty relevant for engineering uses.

Cinek_Poland · February 19, 2022, 13:06

What is the effect of the L3 cache size on the performance of non-CFD simulation programs such as metal forming?

Does the simulation time of mechanical problems, such as for example metal forming (implicit or explicit) decrease with increasing L3 cache capacity?

https://www.youtube.com/watch?v=i0MW66Mly8E

from the 30th minute of the Abaqus 2022 presentation, the topic of performance begins and a few examples of processor performance

digitalmg · March 22, 2022, 01:52

Hi,
Do you think new 3D V-Cache could affect maximum possible core numbers per node ?
There was a rule of thumb that (core counts) / (number of memory channels) not greater than 4. e.g no more than 32 cores per socket for a 8 channel memory capable CPU.
Now could we use 64 cores on 8 channel memory while having the linear scaleup ?

Thanks

the_phew · July 11, 2022, 16:26

Quote:

Originally Posted by digitalmg

Hi,
Do you think new 3D V-Cache could affect maximum possible core numbers per node ?
There was a rule of thumb that (core counts) / (number of memory channels) not greater than 4. e.g no more than 32 cores per socket for a 8 channel memory capable CPU.
Now could we use 64 cores on 8 channel memory while having the linear scaleup ?

Thanks

I'm resurrecting this thread because this is a very good question, that has huge implications for CFD clusters going forward. I'm also thinking about Intel's Sapphire Rapids+HBM; since HBM2e has ~4x the memory bandwidth per socket vs. 8 channels of DDR4, then one might surmise that you could get get linear scaling with ~4x the cores per socket of current-gen HPC processors.

Since current Xeons typically get linear scaling up to ~24 cores per socket for most CFD workloads, that would mean Xeons with HBM2E could scale linearly up to ~96 cores per socket. Unfortunately, Sapphire Rapids will top out at 60 cores per chip, so while Sapphire Rapids+HBM will have over 2X the memory bandwidth per socket as EPYC Genoa (even with its 12 channels of DDR5), Sapphire Rapids+HBM Xeons may not actually have enough CPU horsepower to make use of all that memory bandwidth for CFD workloads. A 2P 64-core Genoa (or especially Genoa-X) node could well perform very similarly to a 2P 60-core Sapphire Rapids-HBM node, even for simulations that fit within the 128 GB of HBM.

I'm anxious to see the 3rd-party Sapphire Rapids+HBM vs. Genoa-X CFD benchmarks in early 2023. Despite VERY different approaches, it wouldn't surprise me if the end result is quite similar. Throw GPU CFD into the mix, and things will get really interesting; with the next generation of HPC CPUs, the memory bandwidth gulf between CPUs and GPUs will shrink dramatically, so GPU vs. CPU will most likely come down to things like licensing costs, hardware costs, simulation memory requirements, power/space efficiency, etc.

flotus1 · July 11, 2022, 17:07

Quote:

since HBM2e has ~4x the memory bandwidth per socket vs. 8 channels of DDR4, then one might surmise that you could get get linear scaling with ~4x the cores per socket of current-gen HPC processors.

As far as my understanding goes, this is the best-case scenario under ideal condition. Namely all data fitting into HBM.
With real-world applications, some of the data has to come from slower memory at some point. So performance/scaling will be somewhat lower. How much exactly is the exciting part, and will probably also depend on the application and model size.

Quote:

Despite VERY different approaches, it wouldn't surprise me if the end result is quite similar

It's the showdown between larger, but slower HBM. And smaller, but faster last-level caches that also should have lower latency. Which will also be backed by higher memory bandwidth as the next tier.
My guess is that there will be noticeable differences depending on applications. And whether a code gets some architecture-specific optimizations.
Anyway, we can look forward to quite substantial performance increases compared to current-gen architectures with both approaches.
I am more worried about pricing and availability. Neither of this will be cheap, there is going to be product segmentation, and hyperscalers will take the first bites.

the_phew · July 12, 2022, 09:27

Quote:

Originally Posted by flotus1

My guess is that there will be noticeable differences depending on applications. And whether a code gets some architecture-specific optimizations.

Indeed; for example, the Ryzen 5800X3D (with its 'large for a desktop CPU' L3 cache) fails to beat the Ryzen 5800X (with 1/3rd the L3 cache but higher clocks) for most productivity (non-game) workloads under Windows 10/11. But in Linux it's a completely different story, with sometimes-incredible speedup for CAD/FEA/CFD workloads thanks to the extra cache. Windows clearly doesn't know what to do with that extra cache for any application that isn't a game.

Quote:

Originally Posted by flotus1

I am more worried about pricing and availability. Neither of this will be cheap, there is going to be product segmentation, and hyperscalers will take the first bites.

Yeah, as much as we'd all love to pick the perfect hardware for our CFD workloads, in reality we'll be constrained by what we can actually get our hands on. It was 4-6 months between the 'launches' of EPYC Milan and Xeon 'Cascade Lake Refresh' and when regular Joes could actually buy them at retail. Milan-X processors are still hard to come by for anywhere near MSRP, Threadripper Pro 5000WX is STILL unobtainium four months after 'launch', yet AMD says Genoa is 'launching' imminently. It's gotten to the point where the hyperscalers are always a CPU generation ahead of the rest of the market.

DVSoares · July 13, 2022, 08:51

Jumping in just to highlight two things:
- The 5800X3D result in the benchmarking thread here is the fastest single core run I could find, depending on what one plans for work, that is a beast.
- Yet another Phoronix test with the latest AMD huge cache server chip (7773) was recently published https://www.phoronix.com/scan.php?pa...3x-redux&num=3 and may deserve a look (with some grain of salt

but nice to give a look at)
Cheers!

November 24, 2021, 15:02	Cache usage in CFD, and AMD 3D V-cache	#1
hsnyder New Member Harris Snyder Join Date: Aug 2018 Posts: 24 Rep Power: 7	I was under the impression that CFD workloads, at least finite volume methods, didn't benefit too much from cache size, but in AMD's recent Milan-X announcement, they're claiming outrageous performance improvements in OpenFOAM and Fluent from the introduction of 3D V-cache. So either I'm wrong and having lots of cache really matters, or they're using a small enough case that some substantial fraction of the decomposed domain fits into the 700+MB of cache they have. I'm curious to hear from anyone who might have deeper knowledge on this subject.

November 25, 2021, 03:24		#2
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	The CFD benchmarks I have seen so far seem to be honest in that regard https://www.tomshardware.com/news/mi...n-x-benchmarks The cases are way too big to fit into the cache entirely, and they also show results for different number of nodes. You can also see where the advantage of Milan-X starts to break down: the "Combustor_830m" case for Fluent only shows a minor performance increase, which then grows for higher number of nodes. Cache size always mattered a bit with unstructured CFD, but so far, last level caches were an order of magnitude smaller. Now we suddenly get cache sizes in the GB range for a dual-socket system, instead of MB. Intel will go a similar route in the future, presumably with slower but even larger HBM inside the CPU package. Last edited by flotus1; November 25, 2021 at 07:16.

March 22, 2022, 01:52		#7
digitalmg New Member M-G Join Date: Apr 2016 Posts: 28 Rep Power: 9	Hi, Do you think new 3D V-Cache could affect maximum possible core numbers per node ? There was a rule of thumb that (core counts) / (number of memory channels) not greater than 4. e.g no more than 32 cores per socket for a 8 channel memory capable CPU. Now could we use 64 cores on 8 channel memory while having the linear scaleup ? Thanks the_phew likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
General recommendations for CFD hardware [WIP]	flotus1	Hardware	18	February 29, 2024 12:48
AMD Epyc CFD benchmarks with Ansys Fluent	flotus1	Hardware	55	November 12, 2018 05:33
AMD EPYC 7281 (32 MB L2 cache) vs 7301 (64 MB cache)?	zwilhoit	Hardware	0	November 9, 2018 15:00
AMD Ryzen Threadripper vs Intel Xeon, importance of cache and memory channels	JohnMartinGodo	Hardware	4	March 21, 2018 12:07
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36

November 25, 2021, 10:11		#3
hsnyder New Member Harris Snyder Join Date: Aug 2018 Posts: 24 Rep Power: 7	Interesting. What property of the combustor case is responsible for this difference? Just the outright size?

November 25, 2021, 10:28		#4
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	I would think so, yes. Especially since it starts speeding up again for higher number of nodes.

February 19, 2022, 13:06		#6
Cinek_Poland New Member Join Date: Aug 2021 Posts: 22 Rep Power: 4	What is the effect of the L3 cache size on the performance of non-CFD simulation programs such as metal forming? Does the simulation time of mechanical problems, such as for example metal forming (implicit or explicit) decrease with increasing L3 cache capacity? https://www.youtube.com/watch?v=i0MW66Mly8E from the 30th minute of the Abaqus 2022 presentation, the topic of performance begins and a few examples of processor performance

July 13, 2022, 08:51		#11
DVSoares New Member Daniel Join Date: Jun 2010 Posts: 12 Rep Power: 15	Jumping in just to highlight two things: - The 5800X3D result in the benchmarking thread here is the fastest single core run I could find, depending on what one plans for work, that is a beast. - Yet another Phoronix test with the latest AMD huge cache server chip (7773) was recently published https://www.phoronix.com/scan.php?pa...3x-redux&num=3 and may deserve a look (with some grain of salt but nice to give a look at) Cheers!