CFD Online Discussion Forums - General recommendations for CFD hardware [WIP]

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - General recommendations for CFD hardware [WIP] (https://www.cfd-online.com/Forums/hardware/234076-general-recommendations-cfd-hardware-wip.html)

General recommendations for CFD hardware [WIP]

*WORK IN PROGRESS*

The goal of this thread is to cover most of the basics that come up in almost every CFD hardware recommendation topic. There will be some examples with hardware available as of early 2021, but these are mostly to illustrate some of the points that are being made. It is specifically NOT a recommendation for any of the hardware mentioned, as new and sometimes better hardware gets released all the time.
This thread should give you an idea what to look for in a CFD workstation or cluster node. The general concepts presented should allow you to make an informed decision, even with new hardware not mentioned here. And to be frank: it keeps me from sounding like a broken record, by simply linking this thread instead of explaining the same ideas over and over again. Laziness brought me here ;)
Without further ado:

0. Checklist before posting
When posting a question about buying a new computer, please try to answer as many of these questions as you can in the first post:

Which software do you intend to use?
Are you limited by license constraints? I.e. does your software license only allow you to run on N threads?
What type of simulations do you want to run? And what's the maximum cell count?
If there is a budget, how high is it?
What kind of setting are you in? Hobbyist? Student? Academic research? Engineer?
Where can you source your new computer? Buying a complete package from a large OEM? Assemble it yourself from parts? Are used parts an option?
Which part of the world are you from? It's cool if you don't want to tell, but since prices and availability vary depending on the region, this can sometimes be relevant. Particularly if it's not North America or Europe.
Anything else that people should know to help you better?

Links:
Buyers guide for AMD Epyc Genoa: https://www.cfd-online.com/Forums/ha...guide-cfd.html

1. CPU - solver performance
The most central piece of every computer indeed. Or is it?
The initial reflex might be to just buy the newest processor, with the highest core count and frequency. While that might get you the highest solver performance, you are probably over-spending. Or maybe you have a limited budget anyway...
Most CFD solvers tend to have a low computational intensity. This is defined as the amount of floating point operations, divided by the amount of data transferred from and to RAM. Which means that in order to keep a high amount of cores fed with enough data, the CPU also needs enough memory bandwidth. Otherwise, you end up with a memory bandwidth bottleneck, where running more and more threads does not increase performance. Memory bandwidth is a product of memory speed (e.g. DDR4-3200) and the number of memory channels.
Time for some actual numbers, taken from this thread: https://www.cfd-online.com/Forums/ha...-hardware.html
Attachment 82913
Here we have a few popular choices for CPUs, ranging from a latest-gen mainstream CPU (AMD Ryzen 5600x) up to server-grade hardware (2x AMD Epyc 7302). The same benchmark was run with an increasing number of threads, shown on the x-axis. The y-axis indicates normalized solver performance. Normalization was done with the fastest single-core result from the Ryzen 5600x. Some of the conclusions we can draw from this chart, for various classes of CPUs.

Mainstream CPUs with 2 memory channels (Ryzen 5600x): highest single-core performance, but quickly falls behind the more threads are used, because the memory subsystem can't keep up. At 6 threads, it has been overtaken by every other entry in the chart. We can further extrapolate that there will be some more scaling up to 8 threads, but not beyond that. Which means that variants of these CPUs with 12 or even 16 cores are a waste of money for our purpose.
Entry-level HEDT parts with 4 memory channels (I7-9800x): much better balance of floating point and memory performance. Starts out slower, but ultimately beats the much newer and more expensive mainstream CPU at 6 or more threads.
High-end HEDT parts with 4 memory channels (TR-3960x): A textbook example of a memory bandwidth bottleneck. There are just way too many cores for only 4 memory channels.
Older dual-socket server CPUs with 2x4 memory channels (Xeon E5-2673v3): the budget-friendly choice for high solver performance. Thanks to 2x4 memory channels -albeit at a lower memory frequency- this type of hardware can still compete with newer and more expensive hardware
Top-of-the-line server CPUs with 2x8 memory channels (Epyc 7302): gets beaten by some of the other choices for low thread counts. But the abundance of memory bandwidth allows linear scaling up to 16 cores, resulting in superior performance for 8 threads and above. Also note that scaling is no longer linear above 16 threads. This is partially caused by the unconventional chiplet approach, but also hints at an important conclusion: while these CPUs are available with up to 64 cores, these high core count models are a waste of money for CFD. They would run into the same memory bandwidth bottleneck that the TR-3960x encounters.

Conclusion: core count and number of memory channels should be balanced. The general rule of thumb is between 2-4 CPU cores per memory channel. Aim for the lower end if your solver has an expensive per-core licensing scheme. After all, you want to get the most out of these expensive licenses. For software with low or no licensing cost, you can get a few more cores. They will be less effective, but solver times will still be lower overall.
Also: 2 CPUs are usually better than one, because the memory controller resides within the CPU package. By choosing e.g. two 32-core CPUs instead of a single 64-core CPU, you get effectively twice the memory bandwidth.

2. System memory/RAM
While the choice of memory is closely tied to the CPU, two questions usually come up when discussing memory loadout: capacity and memory type.
I will try my best to keep this as short as possible. If you want to know more, this is an excellent starting point: https://frankdenneman.nl/2015/02/18/...y-blog-series/

2a. Memory capacity
How much RAM you need mostly depends on two factors: maximum model size (i.e. how many cells the mesh consists of) and the solver type.
General purpose CFD software like Ansys Fluent requires in the range of 1-4 GB of RAM per million cells. SIMPLE solver in single precision marks the lower end of that range, coupled solver in double precision sits at the higher end.
This should give you a rough idea how much memory is enough. There are always exceptions of course. If you are not sure about your specific application, you can run one or two smaller cases on hardware you already have, and extrapolate to the cell counts you want to tun with your new machine.
And yes, you actually need enough memory to run your models properly. Unlike in some some FEA solvers, out-of-core execution (persistent storage used to extend RAM) is not really a thing in CFD. Even the fastest SSDs are an order of magnitude slower than RAM. Avoid it at all costs.
Big caveat here: there is a lower limit for the amount of memory, dictated by the CPU and memory controller. Let's say you get a CPU with an 8-channel memory controller like an AMD Epyc 7302. To make use of these 8 memory channels, you need at least 8 DIMMs. The smallest compatible DIMMs -more on that later- come in 8GB capacity. Which means that 64GB, populated as 8x8GB, is the lower limit for such a CPU. Double that if you opt for two CPUs.
Big OEMs and system integrators can be oblivious to this, so check your quote very carefully before pulling the trigger. You really don't want your new 15000$ CFD workstation with two high-end processors choked by single-channel memory.

2b. Memory type
The amount of options and nuances here might be overwhelming. But if your goal is just a working system, as opposed to breaking world records, a few simple rules are enough. Your CPU choice dictates which memory you need.

Server CPUs like AMD Epyc 7xxx or Intel Xeon Platinum/Gold/W should be paired with registered ECC memory (also called reg ECC or ECC RDIMM). There might be some rare occasions where unbuffered (UDIMM) might work, but let's leave that to adventurous folks. Memory transfer rate (DDR4-2666, DDR4-3200 etc.) is also determined by the CPU. The product page will list the maximum supported transfer rate, so stick to that.
If you need an extraordinarily large amount of memory, which can not be achieved by filling all DIMM slots with the largest RDIMM available, you can switch to load-reduced LRDIMM.
Virtually all other CPUs need unbuffered unbuffered (UDIMM) memory. Exceptions confirm the rule, but are usually compatible with both types. Again, the product page for your CPU of choice lists the supported transfer rate, which is your first clue which speed bin is right for you. But in contrast to server CPUs, this spec is usually the minimum guaranteed frequency. As we saw in chapter 1, memory bandwidth is a key factor to high performance. So if you happened to buy a CPU and a motherboard that support memory overclocking, getting memory beyond the minimum guaranteed frequency is an easy way to squeeze some more performance out of your CFD workstation.
Of course, it must be mentioned that this is technically overclocking, albeit a very easy one thanks to XMP profiles, with very little risk of damaging your hardware. So always check for system stability. And if millions of dollars or even peoples lives depend on the correctness of your results, maybe stick to guaranteed specifications instead.

One question remains with non-server CPUs: error checking and correction (ECC). Only some combinations of CPUs and motherboards do support it officially. Some have unofficial support, i.e. the CPU manufacturer did not disable the feature intentionally, and leaves implementation up to the motherboard manufacturers. And others don't support it at all. Side-note: Unbuffered ECC memory works on most platforms that do support UDIMM. You just don't get the ECC feature.
Let's assume you bought a platform that supports ECC, either officially or through board partners. Ask yourself the question: how often can you afford to get into the office in the morning, just to realize that your simulation has failed for no apparent reason. If the answer is not at all, you probably want ECC. But then again, you probably also want redundant power supplies, a UPS to protect against short power outages, redundancy for your storage etc.
In practice, the decision for ECC memory with non-server CPUs should come first. Because it dictates which CPUs and motherboard you can get.

3. Graphics card/GPU
Graphics cards can serve two distinct purposes in a CFD workstation: render the image on the screen, and help with the computations.

3a. Graphics cards as a display/rendering device
Hardware requirements for this aren't particularly high. Even an otherwise high-end CFD workstation doesn't necessarily need a high-end graphics card.
The most important specification is the amount of memory on the graphics card. As of 2021, I consider 4GB as the absolute minimum, and recommend at least 8GB. Even more can be helpful if you need to render complex scenes for meshes in the order of 50 million cells or more. Without enough VRAM, one of two things will happen: either performance while interacting with the scene drops to unacceptable levels, or the program stops working entirely.
On the contrast, if the GPU core of the graphics card just isn't very fast (i.e. you saved money by buying an entry-level or midrange card with enough VRAM), interacting with the model will just be slower than optimal. I do consider this an acceptable compromise when on a budget.
"Professional" or "Gaming" graphics cards: There is usually no point in spending more money on a graphics card from the professional lines.
They are made with the same chips as the consumer cards, which means similar performance and feature sets. The main differences are the drivers, which can make a performance difference in some CAD programs. Most professional applications come with a list of recommended or tested GPUs, and the only GPUs on these lists are from the professional lines. So if you need absolutely guaranteed compatibility for all features, and support if something doesn't work as intended, stick to the SKUs on the list. But these days, a GPU not being on the list of "compatible" devices doesn't necessarily mean things won't work as intended.
Integrated graphics: with this chapter being written during the great graphics card shortage of 2020/2021, a word on integrated graphics.
Some mainstream CPUs come with a GPU integrated into the CPU. While they can not replace a decent graphics card, you can get surprisingly far with these, for the same reasons mentioned above: a graphics card doesn't need to be high-end, it mostly needs to have enough memory.
But be warned: since the integrated GPU gets its VRAM from system memory, and also shares memory bandwidth with the CPU cores, doing anything graphically demanding, at the same time as solving a CFD problem, will tank performance. And you won't have all system memory available.

3b. Graphics cards and GPUs to accelerate CFD computations
GPU acceleration is still a topic with many caveats and pitfalls. See also: https://www.cfd-online.com/Forums/ha...ys-fluent.html
In theory, GPUs have much higher raw floating point performance and memory bandwidth compared to CPUs, which is the whole appeal of GPU acceleration.
In practice, leveraging these capabilities for CFD is not trivial. To put it bluntly: with a limited hardware budget, GPU acceleration should not be a priority. Focus on CPU performance instead.
If you are still determined to leverage GPU acceleration for your CFD workstation, you need to do your own research. Important points to answer before buying a GPU for computing are:

Does your CFD package support GPU acceleration?
Do the solvers you intend to run benefit from GPU acceleration?
Are your models small enough to fit into GPU memory? Or vice versa, how much VRAM would you need to run your models?
Does GPU acceleration for your code work via CUDA (Nvidia only) or OpenCL (both AMD and Nvidia)
Which GPUs are allowed for GPU acceleration? Some commercial software comes with whitelists for supported GPUs for acceleration, and refuses to work with other GPUs.
Single or double precision? All GPUs have tons of single precision floating point performance. But especially for Nvidia, only a few GPUs at the very top end also have noteworthy double precision floating point performance.

In addition to these points, my personal opinion on the current state of this matter: GPU acceleration for commercial CFD packages is artificially made viable with the licensing scheme. Using very few additional license tokens for adding GPUs, compared to adding more CPUs, skews the scale towards GPUs. Without this trick, GPU acceleration for commercial CFD codes would be even more of a niche than it is today. This also means that for CFD codes without license fees, making GPU acceleration viable is even harder.

That's all for now, more to come. But if I don't start somewhere, I'll never get this done.
If you have any suggestions how to structure this article, I really want to hear it. It is not intended as a deep-dive into all the nuances and edge-cases, but rather to cover 90% of the questions and misconceptions that come up regularly.
And of course, contributions or ideas for topics are welcome.
Note to moderators: it would be nice if I could keep editing privileges to this post for a longer period of time, as I don't know when I will get around to adding more stuff. And maybe when it's polished enough at some point, we could pin it to the top.

Changelog
22.02.2021: thread started with chapter 1 on CPU solver performance
23.02.2021: added chapter 2 on memory
25.02.2021: added chapter 0, a checklist for posting questions. And added this changelog :D
11.11.2021: added chapter 3 on GPUs, pinned the thread
09.11.2023: started a section with links to other useful threads

Reserved for advanced topics

Beware of x16 memory modules
It's a complicated topic, but the gist of it: x16 DIMMs come with higher latency and lower bandwidth compared to the regular x8 and x4 DIMMs.
This has not been a widespread issue with DDR4 memory. Only some of the cheapest laptop memory (the kind of stuff that OEMs use) was x16.
But with DDR5, x16 DIMMs seem to become more common, even for desktop memory. Stick to x8 or x4 instead.
Here is a pretty lengthy video that explains the difference: https://www.youtube.com/watch?v=w2bFzQTQ9aI&t=0s
It is for DDR4 in particular, but my current understanding is that similar performance differences exist for DDR5. Still waiting for more conclusive DDR5 benchmarks, but in the meantime I would just avoid x16 modules.

Memory ranks per channel
Why it matters: given the same transfer rate, two ranks per memory channel consistently outperform configurations with a single rank per channel.
So if you are limited to a certain maximum transfer rate, use one DIMM per channel, with dual-rank DIMMs.
A selection of benchmarks, the internet is full of them:
https://downloads.dell.com/manuals/a...rs89_en-us.pdf
https://www.igorslab.de/en/performan...yberpunk-2077/
https://www.anandtech.com/show/17269...urers-matter/2
Based on the results from anandtech, DDR5 seems to perform poorly with two single-rank DIMMs per channel. This was not the case with DDR4, where 2R-1DPC was pretty much the same as 1R-2DPC.
Whether these are teething issues of the new technology, or an inherent feature of DDR5, time will tell. Until then, the same rule applies: One dual-rank DIMM per channel is your best bet.
Side-note: some platforms lower the maximum supported transfer rate with increasing number of ranks per channel. You can sometimes ignore this limit and force transfer rate back to a higher value. But be aware that the more ranks per channel you have, the lower your success rate.

I agree that the broad principals of hardware selection are easily summarized. Perhaps two 1D tables would be useful.

The first table would be example hardware selection where the number of core licenses are controlled (eg fiuent). So an illustrative HW selection for 4,8,16 and 32 licenses.

The second table would be illustrative HW selection where there are no core licenses (eg Openfoam) and HW cost is the controlling factor. So examples of 1K,4K,8K and 16K euros.

The pace of change would only require this list to be updated every 2 years.

Pretty much sums it up. There may still be questions of which computer to buy, but then you can just answer:

Buy this one; XXX.

Why? Because.

Quote:

Originally Posted by danbence (Post 797045)

Perhaps two 1D tables would be useful.
The first table would be example hardware selection where the number of core licenses are controlled (eg fiuent). So an illustrative HW selection for 4,8,16 and 32 licenses.
The second table would be illustrative HW selection where there are no core licenses (eg Openfoam) and HW cost is the controlling factor. So examples of 1K,4K,8K and 16K euros.
The pace of change would only require this list to be updated every 2 years.

I see where you are coming from. But I am not too thrilled to compile such lists.
The most immediate reason being that they will be largely outdated in a few months, when Epyc Milan hits the retail market. But I have a few other doubts about their usefulness.

For part 1 with fixed core counts due to licenses, you theoretically want the maximum performance that money can buy - within reason. Licenses are expensive, and so is the time of a CFD engineer, and development time in general. Logically, the hardware costs should not matter a whole lot. If applicable, you can even throw in a few GPUs, because that's mostly what GPU acceleration with commercial CFD codes is all about: skewing the licensing scheme in order to make it a viable solution. So that's another variable that can't be covered in a simple table. By the way, I do intend to add a chapter on graphics cards and GPU acceleration. But in practice, at least from my personal experience, company money is way tighter than it should be, even if skimping out on hardware makes no sense financially.

And for part 2...well, prices and availability vary greatly depending on the region. Fine, let's limit it to North America and Europe. There is still a huge difference between spending 4k for a workstation from Dell, HP, Lenovo and the likes, and buying 4k worth of parts and assembling it yourself. And let's not start about used CPUs, where the real price/performance is hiding. And what's in the budget for each price bracket? Storage? That's highly dependent on the use-case. Same as the amount of RAM.

So I don't think I will compile such lists. If someone else wants to give it a go in another thread, I will certainly put in a link.
And lastly: this thread is not supposed to replace opening a new thread and asking for advice. Then, taking into account the answers from the checklist, people can be nudged towards a solution that best fits their needs. Just without explaining each time why 16-core Ryzen CPUs are a waste of money, and why a cheap GPU (or even an expensive one) won't magically make simulations run 5x faster.

does your chart represent constant cpu clock speed or is it possible that the clock speed is boosted at lower core counts? That might explain some of the flattening of the curves.

No, clock speeds were not artificially limited in these tests.
While higher boost frequency with low core counts definitely does account for some of the less-than-ideal scaling, the contribution is rather small. Some of these CPUs like the Epyc 7302 have a very flat boosting behavior anyway.

Ok, that is interesting. I see the EPYC line at low core counts is very straight, so it makes sense with the flat boosting you mention.

I'm quite interested in systems in the sub-8 core range due to licensing cost for commercial codes. Having read up on the importance of memory bandwidth, I'd like to see an example of a bandwidth limited system. For example an Epyc 7262 with 8 vs 4 ram slots populated, i.e. identical cpu with half the memory bandwidth. What would you expect that to do to solver performance?

I think I'll start an 8-core thread as it feels like the gains around this license point are substantial but overlooked and under-reported.

For this thread I wonder if you could comment on whether memory bandwidth has a minimum requirement, beyond which extra bandwidth is redundant?

Also, how significant is memory speed and cpu clock speed? There are scenarios where a higher clocked cpu is available with less memory channels, or more memory channels are avaialble with lower memory speed, is there a rule of thumb for memory channels / memory speed and cpu frequency?

Quote:

Originally Posted by dominicafonso (Post 799353)

Also, how significant is memory speed and cpu clock speed? There are scenarios where a higher clocked cpu is available with less memory channels, or more memory channels are avaialble with lower memory speed, is there a rule of thumb for memory channels / memory speed and cpu frequency?

Memory speed is connected to memory bandwidth which is the most important parameter.

Number of memory channels and memory speed is not the important factor, but rather the memory bandwidth.

A quad channel (e.g. Xeon 2690) with DDR3 1600 MHz memory will roughly have the same memory bandwidth as a dual channel (e.g. Ryzen 3600) with DDR4 3200 MHz memory.

For non-enterprise CPUs you usually have the option to tweak settings and overclock memory. Usually the memory overclocking gain is not linear, but it may still be well worth the investment.

For a maximum 8 core license I would definitely look at the HEDT segment or even the consumer PC segment (Ryzen 5000 series) if you go for high speed memory. There are many better options performance-wise but if you are interested in price performance then that may be a good place to start.

We're all reading the same specs and benchmarks, how is it that there is no system configurator that takes available core licenses, required minimum ram and price point as inputs and returns the ideal parts? How hard should it be? Relevant component data: cpu clock speed, memory bandwidth, perhaps some kind of cpu rating to reflect ipc and other generational changes outside of pure clock speed, anything else?

Is the data from the open foam benchmark thread collated anywhere?

Quote:

Originally Posted by flotus1 (Post 796969)

Conclusion: core count and number of memory channels should be balanced.

What is memory channels, can someone explain this?

Is it number of slots for RAM sticks?

What autohor want to say, it si better to have 4x16GB sticks then 2x32GBsticks?

You can find the number of memory channels in the CPU specification. Most motherboards (but not all) aim to have all the memory channels connected. This means at least one memory slot per memory channel. A lot of the Chinese consumer motherboards support only 2 channels of memory even when the processor is a Xeon with 4 channels. Don't be fooled by the 4 DIMM slots or the 4 in the designation of the board.

The number of slots per channel ranges between 1 and 4. More slots allow for more GB when the largest supported DIMM GB capacity DIMMS are used. However, with more slots in use, the maximum memory rate is often lower.

For a CPU with 4 memory channels, 4x16GB sticks is better than 2x32GBsticks, as long as the DIMMS are placed in the correct slots to take advantage of all channels.

For a CPU with just 2 memory channels, 2x32GB sticks is better in most cases.

Quote:

Originally Posted by wkernkamp (Post 824418)

Does all Hp Z840 has 16 memory slots?
Goal is to have all slots filled with memory modules/sticks?

Yes, all HP Z840 workstations have 16 DIMM slots.
The goal should be to fill at least 8 of them with identical memory modules in the right order, which enables all memory channels on both CPUs. The CPUs for this generation have 4 memory channels each.
All 16 slots filled is fine too.

Quote:

Originally Posted by flotus1 (Post 824432)

If PC has 16slots and I want 256GB, is better to put 8x32GB or 16x16GB?
If I want to add some RAMs one day can I mix diffrent size moduls or all must have same GB ?

Specifically talking about this HP Z840 workstation, and 2 CPUs with 4 memory channels each: it doesn't matter a whole lot whether you fill 8 or 16 DIMM slots. There can be a minor performance impact, but more on that later.

Mixing different capacity memory modules usually works, provided they are the same type.
BUT: this always comes with a significant performance impact. I would not recommend it. Same for mixing different DIMMs with the identical capacity, but different internals.

You can stop reading here if you don't want to get bogged down by minor details. But for the sake of answering your question conclusively:
Internally, memory modules are organized into "ranks". There are modules with 1, 2, 4 or even 8 ranks, though the latter is reserved for LRDIMM. For registered memory, the number of ranks is right on the label: "2Rx4" for example denotes a module with 2 ranks.
In order to get the maximum amount of sequential memory throughput -which we want for CFD- at least 2 ranks are required on each memory channel. We are talking about a real-world performance difference in the order of 10%.
How you get 2 ranks on each memory channel is up to you. Again, the HP Z840 has 2 DIMM slots for each memory channel. So you can either fill both slots with single-rank modules, or one of the two slots with a dual-rank module.
Caveat: older hardware generations, and OEM stuff like this in particular, can enforce a lower memory transfer rate with more than 1 rank per channel. The more ranks, the lower the transfer rate. Thus negating most of the benefit from populating more ranks in the first place. You might be able to enforce a higher memory transfer rate in bios, but there is no guarantee the required options are even available, or work as intended.

@flotus1: Same question for Dell T7910 with E5-2680 v4. I am changing from 16 x16 GB 2133 to 2400 - wonder if 8 x 32 vs 16 x16 slots for openFOAM home lab ?

Quote:

Originally Posted by flotus1 (Post 824468)

Your best bet is with 8x32GB. Make sure to buy dual-rank modules, i.e. 2Rx4. So you still get two ranks per channel.

Quote:

Originally Posted by flotus1 (Post 796974)

Still waiting for more conclusive DDR5 benchmarks, but in the meantime I would just avoid x16 modules.

Would you still keep that statement currently?

I was thinking about buying the following workstation, but after reading your statement I have some concerns about the RAM part. Do you think I should change the type of RAM?

-AMD EPYC 9354P Processor (3.25 GHz, 256 MB Cache, 32 Cores, 64 Threads, Turbo up to 3.80 GHz)
-384 GB DDR5-4800 RAM (12x RDIMM 32 GB PC5-38400 ECC Reg.)
-Server-Mainboard with AMD System-on-Chip
-1x PNY Nvidia T1000 with 4 GB GDDR6-RAM
-2x4TB M.2 NVMe SSD WD Black SN850X (1,200,000 IOPS, 2400 TBW, PCIe 4.0 x4)
-AMD Controller onboard
-2x 1 Gbit LAN onboard
-Server Tower Case (black/silver)
7,689€