CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Cache usage in CFD, and AMD 3D V-cache

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree2Likes
  • 1 Post By digitalmg
  • 1 Post By flotus1

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 24, 2021, 16:02
Default Cache usage in CFD, and AMD 3D V-cache
  #1
New Member
 
Harris Snyder
Join Date: Aug 2018
Posts: 24
Rep Power: 6
hsnyder is on a distinguished road
I was under the impression that CFD workloads, at least finite volume methods, didn't benefit too much from cache size, but in AMD's recent Milan-X announcement, they're claiming outrageous performance improvements in OpenFOAM and Fluent from the introduction of 3D V-cache. So either I'm wrong and having lots of cache really matters, or they're using a small enough case that some substantial fraction of the decomposed domain fits into the 700+MB of cache they have. I'm curious to hear from anyone who might have deeper knowledge on this subject.
hsnyder is offline   Reply With Quote

Old   November 25, 2021, 04:24
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,168
Rep Power: 43
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
The CFD benchmarks I have seen so far seem to be honest in that regard https://www.tomshardware.com/news/mi...n-x-benchmarks
The cases are way too big to fit into the cache entirely, and they also show results for different number of nodes. You can also see where the advantage of Milan-X starts to break down: the "Combustor_830m" case for Fluent only shows a minor performance increase, which then grows for higher number of nodes.
Cache size always mattered a bit with unstructured CFD, but so far, last level caches were an order of magnitude smaller. Now we suddenly get cache sizes in the GB range for a dual-socket system, instead of MB. Intel will go a similar route in the future, presumably with slower but even larger HBM inside the CPU package.

Last edited by flotus1; November 25, 2021 at 08:16.
flotus1 is offline   Reply With Quote

Old   November 25, 2021, 11:11
Default
  #3
New Member
 
Harris Snyder
Join Date: Aug 2018
Posts: 24
Rep Power: 6
hsnyder is on a distinguished road
Interesting. What property of the combustor case is responsible for this difference? Just the outright size?
hsnyder is offline   Reply With Quote

Old   November 25, 2021, 11:28
Default
  #4
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,168
Rep Power: 43
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
I would think so, yes. Especially since it starts speeding up again for higher number of nodes.
flotus1 is offline   Reply With Quote

Old   February 18, 2022, 08:44
Cool
  #5
New Member
 
Andrew
Join Date: Apr 2012
Posts: 13
Rep Power: 12
Malinator is on a distinguished road
Well, talking about caches - of course take it with a grain of salt, but Intel recently bragged about some serious performance uplift that on-die fast memory (HBM) give to its upcoming server products.

Have a look here , claims performance increase in 2.8 times of Sapphire Radips with HBM versus (some) current gen Xeon or EPYC Milan. Well, as usual it isn't clear which one CPU and in what exact configuration; but most likely some that gives the same performance as Xeon, so probably not top of the line ) And, more interesting, it is 1.75x faster then Sapphire Rapids without HBM. 28million cells grid in that case is not a huge one, but pretty relevant for engineering uses.
Malinator is offline   Reply With Quote

Old   February 19, 2022, 14:06
Default
  #6
New Member
 
Join Date: Aug 2021
Posts: 18
Rep Power: 3
Cinek_Poland is on a distinguished road
What is the effect of the L3 cache size on the performance of non-CFD simulation programs such as metal forming?

Does the simulation time of mechanical problems, such as for example metal forming (implicit or explicit) decrease with increasing L3 cache capacity?

https://www.youtube.com/watch?v=i0MW66Mly8E

from the 30th minute of the Abaqus 2022 presentation, the topic of performance begins and a few examples of processor performance
Cinek_Poland is offline   Reply With Quote

Old   March 22, 2022, 02:52
Default
  #7
New Member
 
M-G
Join Date: Apr 2016
Posts: 28
Rep Power: 8
digitalmg is on a distinguished road
Hi,
Do you think new 3D V-Cache could affect maximum possible core numbers per node ?
There was a rule of thumb that (core counts) / (number of memory channels) not greater than 4. e.g no more than 32 cores per socket for a 8 channel memory capable CPU.
Now could we use 64 cores on 8 channel memory while having the linear scaleup ?

Thanks
the_phew likes this.
digitalmg is offline   Reply With Quote

Old   July 11, 2022, 17:26
Default
  #8
Member
 
Matt
Join Date: May 2011
Posts: 32
Rep Power: 13
the_phew is on a distinguished road
Quote:
Originally Posted by digitalmg View Post
Hi,
Do you think new 3D V-Cache could affect maximum possible core numbers per node ?
There was a rule of thumb that (core counts) / (number of memory channels) not greater than 4. e.g no more than 32 cores per socket for a 8 channel memory capable CPU.
Now could we use 64 cores on 8 channel memory while having the linear scaleup ?

Thanks
I'm resurrecting this thread because this is a very good question, that has huge implications for CFD clusters going forward. I'm also thinking about Intel's Sapphire Rapids+HBM; since HBM2e has ~4x the memory bandwidth per socket vs. 8 channels of DDR4, then one might surmise that you could get get linear scaling with ~4x the cores per socket of current-gen HPC processors.

Since current Xeons typically get linear scaling up to ~24 cores per socket for most CFD workloads, that would mean Xeons with HBM2E could scale linearly up to ~96 cores per socket. Unfortunately, Sapphire Rapids will top out at 60 cores per chip, so while Sapphire Rapids+HBM will have over 2X the memory bandwidth per socket as EPYC Genoa (even with its 12 channels of DDR5), Sapphire Rapids+HBM Xeons may not actually have enough CPU horsepower to make use of all that memory bandwidth for CFD workloads. A 2P 64-core Genoa (or especially Genoa-X) node could well perform very similarly to a 2P 60-core Sapphire Rapids-HBM node, even for simulations that fit within the 128 GB of HBM.

I'm anxious to see the 3rd-party Sapphire Rapids+HBM vs. Genoa-X CFD benchmarks in early 2023. Despite VERY different approaches, it wouldn't surprise me if the end result is quite similar. Throw GPU CFD into the mix, and things will get really interesting; with the next generation of HPC CPUs, the memory bandwidth gulf between CPUs and GPUs will shrink dramatically, so GPU vs. CPU will most likely come down to things like licensing costs, hardware costs, simulation memory requirements, power/space efficiency, etc.
the_phew is offline   Reply With Quote

Old   July 11, 2022, 18:07
Default
  #9
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,168
Rep Power: 43
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Quote:
since HBM2e has ~4x the memory bandwidth per socket vs. 8 channels of DDR4, then one might surmise that you could get get linear scaling with ~4x the cores per socket of current-gen HPC processors.
As far as my understanding goes, this is the best-case scenario under ideal condition. Namely all data fitting into HBM.
With real-world applications, some of the data has to come from slower memory at some point. So performance/scaling will be somewhat lower. How much exactly is the exciting part, and will probably also depend on the application and model size.

Quote:
Despite VERY different approaches, it wouldn't surprise me if the end result is quite similar
It's the showdown between larger, but slower HBM. And smaller, but faster last-level caches that also should have lower latency. Which will also be backed by higher memory bandwidth as the next tier.
My guess is that there will be noticeable differences depending on applications. And whether a code gets some architecture-specific optimizations.
Anyway, we can look forward to quite substantial performance increases compared to current-gen architectures with both approaches.
I am more worried about pricing and availability. Neither of this will be cheap, there is going to be product segmentation, and hyperscalers will take the first bites.
the_phew likes this.

Last edited by flotus1; July 12, 2022 at 04:47.
flotus1 is offline   Reply With Quote

Old   July 12, 2022, 10:27
Default
  #10
Member
 
Matt
Join Date: May 2011
Posts: 32
Rep Power: 13
the_phew is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
My guess is that there will be noticeable differences depending on applications. And whether a code gets some architecture-specific optimizations.
Indeed; for example, the Ryzen 5800X3D (with its 'large for a desktop CPU' L3 cache) fails to beat the Ryzen 5800X (with 1/3rd the L3 cache but higher clocks) for most productivity (non-game) workloads under Windows 10/11. But in Linux it's a completely different story, with sometimes-incredible speedup for CAD/FEA/CFD workloads thanks to the extra cache. Windows clearly doesn't know what to do with that extra cache for any application that isn't a game.

Quote:
Originally Posted by flotus1 View Post
I am more worried about pricing and availability. Neither of this will be cheap, there is going to be product segmentation, and hyperscalers will take the first bites.
Yeah, as much as we'd all love to pick the perfect hardware for our CFD workloads, in reality we'll be constrained by what we can actually get our hands on. It was 4-6 months between the 'launches' of EPYC Milan and Xeon 'Cascade Lake Refresh' and when regular Joes could actually buy them at retail. Milan-X processors are still hard to come by for anywhere near MSRP, Threadripper Pro 5000WX is STILL unobtainium four months after 'launch', yet AMD says Genoa is 'launching' imminently. It's gotten to the point where the hyperscalers are always a CPU generation ahead of the rest of the market.
the_phew is offline   Reply With Quote

Old   July 13, 2022, 09:51
Default
  #11
New Member
 
Daniel
Join Date: Jun 2010
Posts: 11
Rep Power: 14
DVSoares is on a distinguished road
Jumping in just to highlight two things:
- The 5800X3D result in the benchmarking thread here is the fastest single core run I could find, depending on what one plans for work, that is a beast.
- Yet another Phoronix test with the latest AMD huge cache server chip (7773) was recently published https://www.phoronix.com/scan.php?pa...3x-redux&num=3 and may deserve a look (with some grain of salt but nice to give a look at)
Cheers!
DVSoares is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
General recommendations for CFD hardware [WIP] flotus1 Hardware 15 March 21, 2022 13:11
AMD Epyc CFD benchmarks with Ansys Fluent flotus1 Hardware 55 November 12, 2018 06:33
AMD EPYC 7281 (32 MB L2 cache) vs 7301 (64 MB cache)? zwilhoit Hardware 0 November 9, 2018 16:00
AMD Ryzen Threadripper vs Intel Xeon, importance of cache and memory channels JohnMartinGodo Hardware 4 March 21, 2018 13:07
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36


All times are GMT -4. The time now is 22:58.