Workstation for Ansys Fluent

Beans8 · October 20, 2022, 13:29

I am looking for a workstation with the following conditions:

- Software: Ansys Fluent
- Lincense constraints: No
- Type of simulations and cell count: LES simulations, single phase incompressible flow, SIMPLEC solver.
- Cell count: Between 25 and 50 million.
- Budget: Around 10.000 €
- GPU: Just for post-processing. No ideas of using it for GPU-accelerated simulations
- Location: Europe
- Plan for the workstation: New parts, and if possible asembled when received.

I am looking for the maximum performance at the maximum nubmer of cores; that is, I am not planning to use the workstation for simulating cases with lower cell counts, RANS, and so on.

I have done some research, and I have noticed the importance of the balance between CPU speed and memory, but I do not feel myself confident to take a decision. There are some specific recommendations in old posts, but the processors are old as well. In any case, I have some specific questions?

- What number of cores should I target? Based on what I have read, I estimate this value to be 40-48.
- What should be the RAM memory? I estimate this to be between 128 Gb and 384 Gb.
- What family of CPUs should I target? A dual Intel 6342 is good? It is from the same family as other recommended processors.
- For the GPU, a NVIDIA A4000 seems to be a general recommendation. Is this okay?

I have asked different computer shops, but it seems that the requirements of FVM are not known. In particular, I have received offers including processors such as AMD Threadripper that do not have good opinions here.

Thank you

flotus1 · October 20, 2022, 15:54

Quote:

- What number of cores should I target? Based on what I have read, I estimate this value to be 40-48.

Since you are not limited by licenses, as many as you can get.
But 48-64 would be a reasonable target for dual-socket systems

Quote:

- What should be the RAM memory? I estimate this to be between 128 Gb and 384 Gb.

256GB, populated as 16x16GB DDR4-3200 reg ECC

Quote:

- What family of CPUs should I target? A dual Intel 6342 is good? It is from the same family as other recommended processors.

Not a bad choice.
Though if you can get them from your suppliers, AMD Epyc Milan CPUs would give you a little higher performance.
In order of how fast they are: Epyc 7413 (24-core), Epyc 7513 (32-core), Epyc 7543 (32-core, twice the L3 cache)

Quote:

- For the GPU, a NVIDIA A4000 seems to be a general recommendation. Is this okay?

That might be a bit on the expensive side, without much benefit for you. Any semi-modern graphics card with at least 8GB will do. For example the RTX A2000 12GB. Careful, there is also a 6GB version of this card. Older cards are also fine if your supplier sells them significantly cheaper than the A2000, as long as they have enough memory.

Beans8 · October 20, 2022, 16:00

Quote:

Originally Posted by flotus1

Since you are not limited by licenses, as many as you can get.
But 48-64 would be a reasonable target for dual-socket systems

256GB, populated as 16x16GB DDR4-3200 reg ECC

Not a bad choice.
Though if you can get them from your suppliers, AMD Epyc Milan CPUs would give you a little higher performance.
In order of how fast they are: Epyc 7413 (24-core), Epyc 7513 (32-core), Epyc 7543 (32-core, twice the L3 cache)

That might be a bit on the expensive side, without much benefit for you. Any modern graphics card with at least 8GB will do. For example the RTX A2000 12GB. Careful, there is also a 6GB version of this card. Older cards are also fine if your supplier sells them significantly cheaper than the A2000, as long as they have enough memory.

Thank you very much for your answers. With that configuration, what parallel performance can I expect with the maximum number of cores?

flotus1 · October 20, 2022, 16:12

I'd ask what you mean by "parallel performance"
-Scaling, i.e. how fast this runs with with 64 threads compared to 1 thread
-Total performance: how long it takes to solve your models

But unfortunately, I can't answer either of these question. It depends...
The important thing is that this is the best you can get for around 10000€.

Beans8 · October 20, 2022, 16:23

Quote:

Originally Posted by flotus1

I'd ask what you mean by "parallel performance"
-Scaling, i.e. how fast this runs with with 64 threads compared to 1 thread
-Total performance: how long it takes to solve your models

But unfortunately, I can't answer either of these question. It depends...
The important thing is that this is the best you can get for around 10000€.

I was referring to the scaling. Thank you in any case! Your previous answer gives me much more confidence.

the_phew · October 21, 2022, 12:20

For a single workstation, dual 7473X would probably be the best bang-for-buck within your price range, just make sure every DIMM slot is populated. GPU isn't too important if only used for post-processing, just get whatever is cheapest from the last couple generations with enough VRAM.

Although since you claim no core count licensing limit, a cluster comprised of prior-gen 2P nodes (dual EPYC Rome or dual 2nd-gen Xeon Scalable) with 80-96 total cores would be pretty hard to beat. But you'd be limited to the used/surplus market and all the risks therein.

Beans8 · October 21, 2022, 15:57

Quote:

Originally Posted by the_phew

For a single workstation, dual 7473X would probably be the best bang-for-buck within your price range, just make sure every DIMM slot is populated. GPU isn't too important if only used for post-processing, just get whatever is cheapest from the last couple generations with enough VRAM.

Although since you claim no core count licensing limit, a cluster comprised of prior-gen 2P nodes (dual EPYC Rome or dual 2nd-gen Xeon Scalable) with 80-96 total cores would be pretty hard to beat. But you'd be limited to the used/surplus market and all the risks therein.

Thank you for the suggestion.

Based on your comments and other feedback that I got, it seems clear that I should focus on the AMD Epyc instead of the Intel Xeon. What I have 100% clear are these two parts:

SSD: Kingston KC3000 PCIe 4.0 NVMe SSD M.2 - 1 TB (I have other servers to store data)
GPU: NVDIA RTX A2000 12 GB

Regarding the processors, considering that maybe I can buy the components separately (and then a colleague builds the workstation) I have some additional doubts:

1) Your suggestions point out the Series 7003:

https://www.amd.com/en/processors/epyc-7003-series

EPYC 7543: 32 cores, 2.8GHz, 256MB
EPYC 7473X: 24 cores, 2.8GHz, 768Mb

What about these other options? (Assuming that I can afford to buy them)

EPYC 7773X: 64 cores, 2.2GHz, 768MB
EPYC 7573X: 32 cores, 2.8GHz, 768MB

Maybe the answer is ''they perform better, of course, but they are more expensive'', but it is not clear to me what can be the differences in terms of real performance. If paying 20% more can provide 15% more performance, I can think of this as a real option. In any case, the upgrade from 256 Mb to 768 Mb makes an important difference? Or it depends on the processor?

2) For the RAM memory, I have found different motherboards for the 7003 series: For example, the Gigabyte MZ72, the MSI D4020, and the H11SSL-i. However, the Gigabyte has 16 slots, while the other two have only 8 slots; that is, in the first case I would have 16x16 Gb, while in the second I would have 8x32 Gb. Is there any difference? Or the performance is the same if I meet the requirement of filling all of the channels?

Thank you in advance

flotus1 · October 21, 2022, 16:14

There are reasons why I did not include any Milan-X CPUs in my list of recommendations:
1) They do not fit the budget. The 24-core version retails for 4200€. No way to have a workstation built with 2 of those in the 10000€ range. Not even if you bought the parts and asembled it yourself
2) They are the right tool when constrained by parallel licenses. Due to higher performance per core, they help you make the most out of the limited licenses you have. The first post explicitly stated that licenses are not an issue.

Quote:

What about these other options? (Assuming that I can afford to buy them)
EPYC 7773X: 64 cores, 2.2GHz, 768MB
EPYC 7573X: 32 cores, 2.8GHz, 768MB

~10-15% performance difference in Ansys Fluent between the 32-core and the 64-core variant. Which is why I recommended the 48-64 core range for a dual-socket workstation. You have a limited budget, and price/performance takes a nose-dive beyond 64 cores.

Quote:

2) For the RAM memory, I have found different motherboards for the 7003 series: For example, the Gigabyte MZ72, the MSI D4020, and the H11SSL-i. However, the Gigabyte has 16 slots, while the other two have only 8 slots; that is, in the first case I would have 16x16 Gb, while in the second I would have 8x32 Gb. Is there any difference? Or the performance is the same if I meet the requirement of filling all of the channels?

MSI D4020 is a single-socket board. Not the right pick here for two CPUs. Same for the H11SSL-i. But that's nothing you need to worry about when buying a workstation fully assembled.
Either way, a list of motherboards that would work: https://geizhals.eu/?cat=mbsp3&xf=16...C7003%7E4921_2
I can recommend the Gigabyte MZ72-HB0 for a workstation.
All CPUs mentioned so far have 8 memory channels. You need to populate them all for maximum performance. I.e. 16 DIMMs for 2 CPUs. Which is why I specified that the 256GB of RAM need to be populated as 16x16GB. Leaving half the memory channels empty can be a similar performance hit as using 1 instead of 2 CPUs. Even worse if you ignore memory population guidelines in the motherboard manual.

the_phew · October 21, 2022, 16:45

Quote:

Originally Posted by flotus1

There are reasons why I did not include any Milan-X CPUs in my list of recommendations:
1) They do not fit the budget. The 24-core version retails for 4200€. No way to have a workstation built with 2 of those in the 10000€ range. Not even if you bought the parts and asembled it yourself
2) They are the right tool when constrained by parallel licenses. Due to higher performance per core, they help you make the most out of the limited licenses you have. The first post explicitly stated that licenses are not an issue.

I don't know about in Europe, but Milan-X CPUs seem to be plummeting in price here in the U.S. with EPYC Genoa imminent (already released for exascale customers, I imagine):http://www.acmemicro.com/Product/184...TDP-1P-2P-TRAY
Listed as backordered, but they still list 3 day lead time, so who knows. I have never ordered from this etailer.

Heck, you could almost swing 2x32 cores of Milan-X:
https://www.ebay.com/itm/17534318207...Bk9SR-yr-sT_YA

Depending on grid size, 3D cache seems to offer more speedup than more cores above 3 cores per memory channel.

Beans8 · October 21, 2022, 17:12

Quote:

Originally Posted by flotus1

There are reasons why I did not include any Milan-X CPUs in my list of recommendations:
1) They do not fit the budget. The 24-core version retails for 4200€. No way to have a workstation built with 2 of those in the 10000€ range. Not even if you bought the parts and asembled it yourself
2) They are the right tool when constrained by parallel licenses. Due to higher performance per core, they help you make the most out of the limited licenses you have. The first post explicitly stated that licenses are not an issue.

~10-15% performance difference in Ansys Fluent between the 32-core and the 64-core variant. Which is why I recommended the 48-64 core range for a dual-socket workstation. You have a limited budget, and price/performance takes a nose-dive beyond 64 cores.

MSI D4020 is a single-socket board. Not the right pick here for two CPUs. Same for the H11SSL-i. But that's nothing you need to worry about when buying a workstation fully assembled.
Either way, a list of motherboards that would work: https://geizhals.eu/?cat=mbsp3&xf=16...C7003%7E4921_2
I can recommend the Gigabyte MZ72-HB0 for a workstation.
All CPUs mentioned so far have 8 memory channels. You need to populate them all for maximum performance. I.e. 16 DIMMs for 2 CPUs. Which is why I specified that the 256GB of RAM need to be populated as 16x16GB. Leaving half the memory channels empty can be a similar performance hit as using 1 instead of 2 CPUs. Even worse if you ignore memory population guidelines in the motherboard manual.

Thank you very much for your comments.

I was thinking of the option of 1x64 as an alternative to 2x32 (or 2x28). My reasoning is "Even if the clock is lower in the case of 1x64, maybe the communication is better with a single CPU instead of with two-CPUs''. But I understand that this is not the case.

Between the EPYC 7473X and the EPYC 7573X I can expect an increase of 25% of performance? (Just based on the number of cores)

Thank you

flotus1 · October 22, 2022, 05:17

Don't know any exact numbers, there are very few detailed benchmark for the Milan-X CPUs. And the impact of the increased L3 cache is very dependent on the workload.

But...the 7473X and 7573X now retail for about the same amount of money in my part of the world. Around 4200€. So if you are going to buy parts, definitely get the 7573X. Considering the cost of the whole computer, it is probably still worth it even when buying from some SI.
If you can stretch your budget far enough, definitely get the 7573X. It is the best CPU for CFD after all. At least until the end of the year when the next generation launches.

Beans8 · October 23, 2022, 06:27

Quote:

Originally Posted by flotus1

Don't know any exact numbers, there are very few detailed benchmark for the Milan-X CPUs. And the impact of the increased L3 cache is very dependent on the workload.

But...the 7473X and 7573X now retail for about the same amount of money in my part of the world. Around 4200€. So if you are going to buy parts, definitely get the 7573X. Considering the cost of the whole computer, it is probably still worth it even when buying from some SI.
If you can stretch your budget far enough, definitely get the 7573X. It is the best CPU for CFD after all. At least until the end of the year when the next generation launches.

Thank you for your recommendation. In principle I was thinking of buying the assembled workstation from a supplier, but I have noticed that the difference between that option and assembling it myself is huge. It seems that this in part due to the high differences in the processor price between different suppliers (The price of the 7573x ranges from 4200 € to 7000 €)

I think that I can afford to pay a little more than 10.000 € if this allow me to have a decent workstation. I have done some additional research, and I have some questions:

MOTHERBOARD
If I understand well, there are two variants of the Gigabyte MZ72-HB0. The 1x and the 3.0/4.0. Based on the information provided on the website, the former was built for the 7002 series, while the latter was built for the 7003 series. I am correct? If yes, I must consider only the 3.0/4.0 variant? I mention this because 100 % of the stores that I have found in Europe have the 1x version. I can contact more companies in any case.

As an alternative, the SUPERMICRO H12DSi-NT6 costs more or less the same and it can be used for both 7002 and 7003 series. Is this a good option?

RAM
With the specifications of 16 Gb DDR4 3200 there are different models whose price ranges from 50 € to 100 €. Is there any additional parameter that I should take into account to buy the RAM? The voltage is something that varies depending on the model, but I am not sure if this makes a difference

CASE
The Fractal Torrent seems to be valid for the Supermicro and Gigabyte motherboards, and includes five fans. I have read some reviews and this seems to be a good option on terms of cooling. What is your opinion about this case?

POWER SUPPLY
Based on the calculators that are available on the internet, I have deduced that. I need a power supply of 1000 W. Something like the CORSAIR 1000RMe works?

I think that with all of these components I would be ready to assemble them (a colleague can do that, but I should provide all of the components).

Thank you

flotus1 · October 23, 2022, 07:20

Motherboard
I'm not 100% on this, but the older revisions 1 and 2 might support Milan CPUs after a bios update. That would be a question for Gigabyte support.
But yeah, I would probably want revision 3-4. Just contact the seller in advance and ask them about this. The information they put on their website might just be outdated or a placeholder. Most sellers don't specify revisions at all, so you have to ask anyway.

What I liked about the Gigabyte board in particular:
1) Sufficient cooling for VRMs to work in a workstation case without much hassle. Supermicro are server boards first, relying on server-grade airflow (=loud) to cool the components.
2) Built-in fan control. Supermicro boards are always a pain in the ass to use for workstations. The default fan thresholds will identify normal fans as faulty. And dialing in a fan curve requires some serious effort with 3rd party tools.
Gigabytes solution is just much more elegant, easier to use and works better.

RAM
You need registered ECC memory. There aren't many degrees of freedom here. Maybe you were looking at UDIMM instead?
https://geizhals.de/?cat=ramddr3&xf=..._RDIMM+mit+ECC

CASE
Fractal torrent is not a bad case for air cooling and will likely work fine on account of brute force.
However, you will probably end up using Noctua CPU coolers on these SP3 sockets. They blow bottom-to-top. The top, where you would want to exhaust the heat from CPUs+RAM, is closed off in the Fractal Torrent.
The ideal case for air cooling dual-socket Epyc is the Phanteks Enthoo Pro 2. It has plenty of room for fans in the bottom an the top. It doesn't come with any fans installed though. Arctic F12 PWM (3x bottom) and Arctic F14 PWM (3x top) are a great low-cost option at 5€ a piece.

POWER SUPPLY
CORSAIR 1000RMe will work.

Beans8 · October 23, 2022, 12:20

Quote:

Originally Posted by flotus1

Motherboard
I'm not 100% on this, but the older revisions 1 and 2 might support Milan CPUs after a bios update. That would be a question for Gigabyte support.
But yeah, I would probably want revision 3-4. Just contact the seller in advance and ask them about this. The information they put on their website might just be outdated or a placeholder. Most sellers don't specify revisions at all, so you have to ask anyway.

What I liked about the Gigabyte board in particular:
1) Sufficient cooling for VRMs to work in a workstation case without much hassle. Supermicro are server boards first, relying on server-grade airflow (=loud) to cool the components.
2) Built-in fan control. Supermicro boards are always a pain in the ass to use for workstations. The default fan thresholds will identify normal fans as faulty. And dialing in a fan curve requires some serious effort with 3rd party tools.
Gigabytes solution is just much more elegant, easier to use and works better.

RAM
You need registered ECC memory. There aren't many degrees of freedom here. Maybe you were looking at UDIMM instead?
https://geizhals.de/?cat=ramddr3&xf=..._RDIMM+mit+ECC

CASE
Fractal torrent is not a bad case for air cooling and will likely work fine on account of brute force.
However, you will probably end up using Noctua CPU coolers on these SP3 sockets. They blow bottom-to-top. The top, where you would want to exhaust the heat from CPUs+RAM, is closed off in the Fractal Torrent.
The ideal case for air cooling dual-socket Epyc is the Phanteks Enthoo Pro 2. It has plenty of room for fans in the bottom an the top. It doesn't come with any fans installed though. Arctic F12 PWM (3x bottom) and Arctic F14 PWM (3x top) are a great low-cost option at 5€ a piece.

POWER SUPPLY
CORSAIR 1000RMe will work.

Thank you for your suggestions.

Regarding the RAM, I was missing the ECC. I have decided to buy the Kingston*DDR4-3200 ECC Reg KSM32RS8/16MFR, and I will follow your suggestions and I will try to buy the revision 3-4 of the Gigabyte motherboard, and the recommended case and fans.

I do not know how much time will take to built it and start using the workstation, but once I have it I will post a reply with the performance results in Ansys Fluent.

Thank you very much for your help, it has been very useful

Beans8 · October 23, 2022, 14:39

By the way, even if I read the post about performances in OpenFoam and other post similar tomine, I still do not catch where is the key of the performance in CFD. This is what I understand:

Loosely speaking, I understand that there are three types of memory: The SSD, the RAM, and the caché. The differences between them (ordered from slower to faster) are of about one order of magnitude.

When solving the NS equations in parallel, the SSD only plays the role of saving data post or pre-simulation, and the RAM stores intermediate values in the calculation such as matrices.

Based on the above, I have some questions:

- What does the caché memory store?

- What is the cause of the limited performance of CFD solvers with high number of cores? In particular, the scaling over 16 cores (the value depends on the processor) I was thinking that the reason was the slow memory, but I read in other post that the L3 caché only increases the performance in all number of cores and not the scalability. So having L3 caché of the order of the RAM memory would not improve the scalability? If the answer is no, what should be the change (in an ideal world) to have 100% of scalability at all number of cores?

- OpenFOAM and Fluent run (mainly) with implicit solvers. Is the scaling problem solved with explicit solvers? (such as the ones used in high-order methods)

- I have read that running with more than around 2x32 cores does not improve performance (at least with limited budget). So what is the procedure followed in supercomputers? (Maybe the answer to this question is the same as the one to the second question)

Thank you

flotus1 · October 23, 2022, 15:17

CPU caches are...complicated.
One easy way to look at them: they are just another tier of even faster storage. Before fetching data from RAM (slow compared to CPU cache), the CPU checks if the values can be found in one of the caches for whatever reason.
Two of the reasons such a cache hit might occur:
1)The value has been copied into cache previously because it was recently used in a calculation
2)The CPU predicted that the value might be needed soon, and pre-fetched it from RAM into cache.
Such cache hits save the time it would take fetch the values from memory when they are already needed. Which in terms of CPU cycles is an extremely long time.
Bigger caches means higher probability of cache hits, and thus reduced dependence on slow RAM.

Less than ideal intra-node scaling in CFD is mostly caused by a memory bandwidth bottleneck. The CPU cores can chew through the computations faster than the memory can provide the data required for the computations.
In addition, memory latency increases when the memory interface is utilized close to maximum bandwidth. Making memory access even slower.
Search terms if you want to know more: "roofline model" "arithmetic intensity" "loaded memory latency"

Supercomputers -i.e. clusters- just add more CPUs.
The limitations here are caused by so-called shared CPU resources. A CPU has a finite amount of them (last level cache, memory bandwidth...) and the CPU cores compete for utilization. They share them. Adding more CPU cores doesn't help when the shared CPU resources are already fully utilized.
That's where clusters come in. Each compute node in a cluster is an additional set of shared CPU resources. And only the CPU cores within that node need to compete for them.
Intra-node scaling, the type of scaling we discussed up to this point, is limited by the shared CPU resources.
Inter-node scaling is completely unrelated. Adding more compute nodes adds shared CPU resources at the same rate. Which is why CFD codes which stop scaling within a node around 4 cores per memory channel, can still scale linearly to thousands of cores when run on several compute nodes.

There are no "ideal" CPUs for memory bound workloads because they are a relatively small niche in computing. Development will always focus on maximum density for compute-bound applications first.
Caches are expensive. Both in terms of the area they occupy on a die, and in terms of power consumption. And a rule of thumb is: bigger caches are slower. AMD kind of got around this rule with "3D V-cache", but one of the reasons is that their regular L3 caches are relatively high latency to begin with.
Same for memory bandwidth. Just adding more channels isn't easy. Not only, but also due to the fact that a fairly large percentage of the ~4000 pins on current-gen server CPUs is allocated to memory. More channels means even more pins. Larger sockets, smaller pins, no more sockets at all because the CPU gets soldered to the board directly. All expensive solutions, or stuff customers aren't willing to accept yet.
And you can always use a cluster if you need more performance for memory-bound applications. Well not always, but at least for what we usually need.

Beans8 · October 23, 2022, 15:44

Quote:

Originally Posted by flotus1

CPU caches are...complicated.
One easy way to look at them: they are just another tier of even faster storage. Before fetching data from RAM (slow compared to CPU cache), the CPU checks if the values can be found in one of the caches for whatever reason.
Two of the reasons such a cache hit might occur:
1)The value has been copied into cache previously because it was recently used in a calculation
2)The CPU predicted that the value might be needed soon, and pre-fetched it from RAM into cache.
Such cache hits save the time it would take fetch the values from memory when they are already needed. Which in terms of CPU cycles is an extremely long time.
Bigger caches means higher probability of cache hits, and thus reduced dependence on slow RAM.

Less than ideal intra-node scaling in CFD is mostly caused by a memory bandwidth bottleneck. The CPU cores can chew through the computations faster than the memory can provide the data required for the computations.
In addition, memory latency increases when the memory interface is utilized close to maximum bandwidth. Making memory access even slower.
Search terms if you want to know more: "roofline model" "arithmetic intensity" "loaded memory latency"

Supercomputers -i.e. clusters- just add more CPUs.
The limitations here are caused by so-called shared CPU resources. A CPU has a finite amount of them (last level cache, memory bandwidth...) and the CPU cores compete for utilization. They share them. Adding more CPU cores doesn't help when the shared CPU resources are already fully utilized.
That's where clusters come in. Each compute node in a cluster is an additional set of shared CPU resources. And only the CPU cores within that node need to compete for them.
Intra-node scaling, the type of scaling we discussed up to this point, is limited by the shared CPU resources.
Inter-node scaling is completely unrelated. Adding more compute nodes adds shared CPU resources at the same rate. Which is why CFD codes which stop scaling within a node around 4 cores per memory channel, can still scale linearly to thousands of cores when run on several compute nodes.

There are no "ideal" CPUs for memory bound workloads because they are a relatively small niche in computing. Development will always focus on maximum density for compute-bound applications first.
Caches are expensive. Both in terms of the area they occupy on a die, and in terms of power consumption. And a rule of thumb is: bigger caches are slower. AMD kind of got around this rule with "3D V-cache", but one of the reasons is that their regular L3 caches are relatively high latency to begin with.
Same for memory bandwidth. Just adding more channels isn't easy. Not only, but also due to the fact that a fairly large percentage of the ~4000 pins on current-gen server CPUs is allocated to memory. More channels means even more pins. Larger sockets, smaller pins, no more sockets at all because the CPU gets soldered to the board directly. All expensive solutions, or stuff customers aren't willing to accept yet.
And you can always use a cluster if you need more performance for memory-bound applications. Well not always, but at least for what we usually need.

Thank you very much for the detailed answer.

If the way of working of clusters (with nodes) is much more effective, it is not possible to build a small cluster, why there are no clusters of let’s say, 16 nodes of 4 cores each? Or that solution is much more expensive than a workstation of 2x32 cores?

flotus1 · October 23, 2022, 16:04

Clusters consisting of nodes with relatively cheap commodity hardware are a thing, yes.
There are downsides of course:
1) The setup is not as trivial as for a single computer.
2) Multiple costs. You need motherboards, PSU and a case for each node. The components can be cheaper, but it still adds up
3) Node interconnects. Good old Ethernet can work for very small clusters. But at some point, inter-node scaling becomes limited by the node interconnect. That's when you need to look into Infiniband and the likes. This hardware is prohibitively expensive when bought new. If you put in the research effort, you can buy used adapters and switches fairly cheap on ebay.
4) Other limitations: what if your meshing process requires a large amount of memory in a single shared memory system. But all you have is 16 nodes with 16GB of memory each. You might get around this with one of the nodes having more memory than the others. But that's just another bit of added cost for the cluster

Long story short: it can be cheaper and/or faster than a single node. But you have to put in a lot of research effort first. And you give up some "quality of life features". I would only recommend that for "enthusiasts". People who enjoy tinkering with hardware and software, and don't consider it a waste of time that could be spent with more productive tasks.

Beans8 · October 23, 2022, 16:08

Quote:

Originally Posted by flotus1

Clusters consisting of nodes with relatively cheap commodity hardware are a thing, yes.
There are downsides of course:
1) The setup is not as trivial as for a single computer.
2) Multiple costs. You need motherboards, PSU and a case for each node. The components can be cheaper, but it still adds up
3) Node interconnects. Good old Ethernet can work for very small clusters. But at some point, inter-node scaling becomes limited by the node interconnect. That's when you need to look into Infiniband and the likes. This hardware is prohibitively expensive when bought new. If you put in the research effort, you can buy used adapters and switches fairly cheap on ebay.
4) Other limitations: what if your meshing process requires a large amount of memory in a single shared memory system. But all you have is 16 nodes with 16GB of memory each. You might get around this with one of the nodes having more memory than the others. But that's just another bit of added cost for the cluster

Long story short: it can be cheaper and/or faster than a single node. But you have to put in a lot of research effort first. And you give up some "quality of life features". I would only recommend that for "enthusiasts". People who enjoy tinkering with hardware and software, and don't consider it a waste of time that could be spent with more productive tasks.

Everything clear. Thank you!

evcelica · October 27, 2022, 15:16

Quote:

Originally Posted by flotus1

Long story short: it can be cheaper and/or faster than a single node. But you have to put in a lot of research effort first. And you give up some "quality of life features". I would only recommend that for "enthusiasts". People who enjoy tinkering with hardware and software, and don't consider it a waste of time that could be spent with more productive tasks.

I couldn't agree more with this. I spent so much time configuring and maintaining the Infiniband clusters I've built in the past, mainly because I thought it was fun at the time. But now I just use a single large workstation, as the extra effort just wasn't worth it in my case.

October 20, 2022, 13:29	Workstation for Ansys Fluent	#1
Beans8 New Member Join Date: Oct 2022 Posts: 24 Rep Power: 3	I am looking for a workstation with the following conditions: - Software: Ansys Fluent - Lincense constraints: No - Type of simulations and cell count: LES simulations, single phase incompressible flow, SIMPLEC solver. - Cell count: Between 25 and 50 million. - Budget: Around 10.000 € - GPU: Just for post-processing. No ideas of using it for GPU-accelerated simulations - Location: Europe - Plan for the workstation: New parts, and if possible asembled when received. I am looking for the maximum performance at the maximum nubmer of cores; that is, I am not planning to use the workstation for simulating cases with lower cell counts, RANS, and so on. I have done some research, and I have noticed the importance of the balance between CPU speed and memory, but I do not feel myself confident to take a decision. There are some specific recommendations in old posts, but the processors are old as well. In any case, I have some specific questions? - What number of cores should I target? Based on what I have read, I estimate this value to be 40-48. - What should be the RAM memory? I estimate this to be between 128 Gb and 384 Gb. - What family of CPUs should I target? A dual Intel 6342 is good? It is from the same family as other recommended processors. - For the GPU, a NVIDIA A4000 seems to be a general recommendation. Is this okay? I have asked different computer shops, but it seems that the requirements of FVM are not known. In particular, I have received offers including processors such as AMD Threadripper that do not have good opinions here. Thank you

October 23, 2022, 16:04		#18
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Clusters consisting of nodes with relatively cheap commodity hardware are a thing, yes. There are downsides of course: 1) The setup is not as trivial as for a single computer. 2) Multiple costs. You need motherboards, PSU and a case for each node. The components can be cheaper, but it still adds up 3) Node interconnects. Good old Ethernet can work for very small clusters. But at some point, inter-node scaling becomes limited by the node interconnect. That's when you need to look into Infiniband and the likes. This hardware is prohibitively expensive when bought new. If you put in the research effort, you can buy used adapters and switches fairly cheap on ebay. 4) Other limitations: what if your meshing process requires a large amount of memory in a single shared memory system. But all you have is 16 nodes with 16GB of memory each. You might get around this with one of the nodes having more memory than the others. But that's just another bit of added cost for the cluster Long story short: it can be cheaper and/or faster than a single node. But you have to put in a lot of research effort first. And you give up some "quality of life features". I would only recommend that for "enthusiasts". People who enjoy tinkering with hardware and software, and don't consider it a waste of time that could be spent with more productive tasks. wkernkamp and trans(sonic)_pride like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
choose AMD CPU of workstation for ANSYS fluent	jonswap	Hardware	1	October 28, 2021 15:50
Fluent Workstation for online rent	hares	FLUENT	4	December 13, 2016 13:32
32 CPUs Workstation V.S. Cluster for Fluent	Anna Tian	FLUENT	40	July 17, 2014 00:10
Fluent and Silicon Graphics workstation	Swati Mohanty	FLUENT	0	September 24, 2006 23:02
workstation for Fluent	burley	FLUENT	1	January 9, 2000 07:59

October 20, 2022, 16:12		#4
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	I'd ask what you mean by "parallel performance" -Scaling, i.e. how fast this runs with with 64 threads compared to 1 thread -Total performance: how long it takes to solve your models But unfortunately, I can't answer either of these question. It depends... The important thing is that this is the best you can get for around 10000€.

October 21, 2022, 12:20		#6
the_phew Member Matt Join Date: May 2011 Posts: 43 Rep Power: 14	For a single workstation, dual 7473X would probably be the best bang-for-buck within your price range, just make sure every DIMM slot is populated. GPU isn't too important if only used for post-processing, just get whatever is cheapest from the last couple generations with enough VRAM. Although since you claim no core count licensing limit, a cluster comprised of prior-gen 2P nodes (dual EPYC Rome or dual 2nd-gen Xeon Scalable) with 80-96 total cores would be pretty hard to beat. But you'd be limited to the used/surplus market and all the risks therein.

October 22, 2022, 05:17		#11
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Don't know any exact numbers, there are very few detailed benchmark for the Milan-X CPUs. And the impact of the increased L3 cache is very dependent on the workload. But...the 7473X and 7573X now retail for about the same amount of money in my part of the world. Around 4200€. So if you are going to buy parts, definitely get the 7573X. Considering the cost of the whole computer, it is probably still worth it even when buying from some SI. If you can stretch your budget far enough, definitely get the 7573X. It is the best CPU for CFD after all. At least until the end of the year when the next generation launches.

October 23, 2022, 07:20		#13
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Motherboard I'm not 100% on this, but the older revisions 1 and 2 might support Milan CPUs after a bios update. That would be a question for Gigabyte support. But yeah, I would probably want revision 3-4. Just contact the seller in advance and ask them about this. The information they put on their website might just be outdated or a placeholder. Most sellers don't specify revisions at all, so you have to ask anyway. What I liked about the Gigabyte board in particular: 1) Sufficient cooling for VRMs to work in a workstation case without much hassle. Supermicro are server boards first, relying on server-grade airflow (=loud) to cool the components. 2) Built-in fan control. Supermicro boards are always a pain in the ass to use for workstations. The default fan thresholds will identify normal fans as faulty. And dialing in a fan curve requires some serious effort with 3rd party tools. Gigabytes solution is just much more elegant, easier to use and works better. RAM You need registered ECC memory. There aren't many degrees of freedom here. Maybe you were looking at UDIMM instead? https://geizhals.de/?cat=ramddr3&xf=..._RDIMM+mit+ECC CASE Fractal torrent is not a bad case for air cooling and will likely work fine on account of brute force. However, you will probably end up using Noctua CPU coolers on these SP3 sockets. They blow bottom-to-top. The top, where you would want to exhaust the heat from CPUs+RAM, is closed off in the Fractal Torrent. The ideal case for air cooling dual-socket Epyc is the Phanteks Enthoo Pro 2. It has plenty of room for fans in the bottom an the top. It doesn't come with any fans installed though. Arctic F12 PWM (3x bottom) and Arctic F14 PWM (3x top) are a great low-cost option at 5€ a piece. POWER SUPPLY CORSAIR 1000RMe will work.

October 23, 2022, 14:39		#15
Beans8 New Member Join Date: Oct 2022 Posts: 24 Rep Power: 3	By the way, even if I read the post about performances in OpenFoam and other post similar tomine, I still do not catch where is the key of the performance in CFD. This is what I understand: Loosely speaking, I understand that there are three types of memory: The SSD, the RAM, and the caché. The differences between them (ordered from slower to faster) are of about one order of magnitude. When solving the NS equations in parallel, the SSD only plays the role of saving data post or pre-simulation, and the RAM stores intermediate values in the calculation such as matrices. Based on the above, I have some questions: - What does the caché memory store? - What is the cause of the limited performance of CFD solvers with high number of cores? In particular, the scaling over 16 cores (the value depends on the processor) I was thinking that the reason was the slow memory, but I read in other post that the L3 caché only increases the performance in all number of cores and not the scalability. So having L3 caché of the order of the RAM memory would not improve the scalability? If the answer is no, what should be the change (in an ideal world) to have 100% of scalability at all number of cores? - OpenFOAM and Fluent run (mainly) with implicit solvers. Is the scaling problem solved with explicit solvers? (such as the ones used in high-order methods) - I have read that running with more than around 2x32 cores does not improve performance (at least with limited budget). So what is the procedure followed in supercomputers? (Maybe the answer to this question is the same as the one to the second question) Thank you

October 23, 2022, 15:17		#16
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	CPU caches are...complicated. One easy way to look at them: they are just another tier of even faster storage. Before fetching data from RAM (slow compared to CPU cache), the CPU checks if the values can be found in one of the caches for whatever reason. Two of the reasons such a cache hit might occur: 1)The value has been copied into cache previously because it was recently used in a calculation 2)The CPU predicted that the value might be needed soon, and pre-fetched it from RAM into cache. Such cache hits save the time it would take fetch the values from memory when they are already needed. Which in terms of CPU cycles is an extremely long time. Bigger caches means higher probability of cache hits, and thus reduced dependence on slow RAM. Less than ideal intra-node scaling in CFD is mostly caused by a memory bandwidth bottleneck. The CPU cores can chew through the computations faster than the memory can provide the data required for the computations. In addition, memory latency increases when the memory interface is utilized close to maximum bandwidth. Making memory access even slower. Search terms if you want to know more: "roofline model" "arithmetic intensity" "loaded memory latency" Supercomputers -i.e. clusters- just add more CPUs. The limitations here are caused by so-called shared CPU resources. A CPU has a finite amount of them (last level cache, memory bandwidth...) and the CPU cores compete for utilization. They share them. Adding more CPU cores doesn't help when the shared CPU resources are already fully utilized. That's where clusters come in. Each compute node in a cluster is an additional set of shared CPU resources. And only the CPU cores within that node need to compete for them. Intra-node scaling, the type of scaling we discussed up to this point, is limited by the shared CPU resources. Inter-node scaling is completely unrelated. Adding more compute nodes adds shared CPU resources at the same rate. Which is why CFD codes which stop scaling within a node around 4 cores per memory channel, can still scale linearly to thousands of cores when run on several compute nodes. There are no "ideal" CPUs for memory bound workloads because they are a relatively small niche in computing. Development will always focus on maximum density for compute-bound applications first. Caches are expensive. Both in terms of the area they occupy on a die, and in terms of power consumption. And a rule of thumb is: bigger caches are slower. AMD kind of got around this rule with "3D V-cache", but one of the reasons is that their regular L3 caches are relatively high latency to begin with. Same for memory bandwidth. Just adding more channels isn't easy. Not only, but also due to the fact that a fairly large percentage of the ~4000 pins on current-gen server CPUs is allocated to memory. More channels means even more pins. Larger sockets, smaller pins, no more sockets at all because the CPU gets soldered to the board directly. All expensive solutions, or stuff customers aren't willing to accept yet. And you can always use a cluster if you need more performance for memory-bound applications. Well not always, but at least for what we usually need.