CFD Online Discussion Forums - Socket 2011-3 processors

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - Socket 2011-3 processors - an overwiew (https://www.cfd-online.com/Forums/hardware/184821-socket-2011-3-processors-overwiew.html)

Socket 2011-3 processors - an overwiew

Lineup
It has come to my attention that there is no source to get a quick overview on the current lineup of Intel processors for socket 2011-3, specifically the Broadwell-E and Broadwell-EP CPUs. And to be honest I was not aware of every single option that might be interesting from a CFD point of view. So here is my attempt to put as much relevant information as possible into a single table.
Some CPUs are missing because they are not freely available (e.g. exclusive for some OEMs) or rather irrelevant for CFD. The "all core turbo" frequency in the last column is -to my best knowledge- the frequency for execution of AXV code.

http://i.imgur.com/dUvLsu8.png

An attempt to rate the performance
Since all the processors above have the same architecture, it is possible to rate their relative performance based on the specifications. A very simple model to do this is Amdahl's Law. It makes the assumption that when N cores instead of one are used, the additional cores are not N-times faster. Instead, you see diminishing returns when using more cores. This can be caused for example by portions of the code that were not parallelized or by running out of memory bandwidth to keep the additional cores fed with data. The model is not perfect, but adding more complexity will not help making things clearer.
There are two more factors we have to take into account: Memory speed and cache per core since these factors are not equal along the lineup. So we make the assumption that less memory speed translates into less CFD computing power with an efficiency of 75%. For example: 11% less memory speed result in a penalty of 8.3%. More cache per core is given 20% efficiency, so 50% more cache per core give a bonus of 10%.
The result is a number that can be used to estimate the relative performance of these CPUs.
Note that for the dual-socket CPUs, twice the number of cores was used for Amdahl's law.
While we are at it we can try to rate the price/performance ratio aswell. We simply take the performance number and divide it by the cost for one or two CPUs plus the "typical" cost for the rest of the workstation. I used 2000$ for a dual-socket workstation and 1300$ for a single-socket system.
Without further ado, here are the results for five different scaling efficiencies. All numbers are normalized to the lowest value:

http://i.imgur.com/Mp0eLWX.png

Now all you have to know is how does your software scale. An efficiency of 99% is highly unlikely for a CFD code, at least when using a very large number of cores. This is what is usually referred to as "memory bandwidth bottleneck". Which is why the CPUs with very high core counts are usually not beneficial for CFD or at least not the best use of your money. You should probably focus on the range 95%-98%.
To back up this claim we can take a look at this whitepaper from HP: http://www.peraglobal.com/upload/con...2555_13361.pdf
As a mean value for several Fluent benchmarks, they report a speedup of ~16 for a single node with 32 cores. This translates to a scaling efficiency of ~96.8%.
The CFX benchmarks show a speedup of ~18 on a single node with 24 cores which results in a scaling efficiency of 97.8% for the model of Amdahl's law.

Questions and answers:
Q: So which one is the best processor for CFD
A: It depends. The answer will usually be one of the processors with medium core counts and a high frequency. But you also have to factor in licensing issues. If all you have is parallel licenses for 8 cores, buying a 24-core workstation is usually not the best use of your budget.
Q: Why are some processors missing?
A: I focused on processors you can buy off the shelf. The quad-socket processors are also missing because you will have a hard time finding suitable motherboards, cases and power supplies as a normal customer. And at least from my limited experience with quad-socket, scaling is not always as good as you would expect.
Q: Why are the I7 processors even in this list? I heard that they are less reliable than Xeon processors.
A: This is not true. They are the same processors with a few features deactivated. For example, they do not officially support ECC memory. But on the other hand they are unlocked and can use faster memory which makes them an interesting alternative, especially for CFD.
Q: Isn't this about a year too late? These processors have been around for a while now.
A: Maybe, sorry about that. But they will remain your only option for at least a few more months.

Disclaimer: I take no responsibility for errors in the tables above. Please let me know if you find any.
Additional sources:
http://ark.intel.com/products/family...Family#@Server
http://ark.intel.com/products/family...ssors#@Desktop
https://www.microway.com/knowledge-c...ep-processors/
http://hexus.net/tech/news/cpu/91676...masked/?page=3

Hi,

thanks for the detailed analysis, it is really useful indeed!

I just have a question: I used to ponder on the fact that Xeon processors are mounted on motherboards having 4 RAM channel per socket, and I think this plays and important role in determining the performance gain for particular CPU models.

Let's consider a 4-cores cpu and a CFD-3D simulation as a reference, this is the "best" setup I can think since each core has its own memory channel. If I try to perform the same simulation with a 8-cores cpu than ideally I would expect to obtain the results in half of the time (100% efficiency), then as you mentioned real efficiency will be lower, this also because this time each core has to share its memory channel with another core (2 cores/memory channel). What happens if I try to use a 6-cores cpu? I suppose things get worse because now I have an unbalanced system: 2 of the 6 cores have their own memory channel, while each couple of the rest of the cores has to use a single memory channel. What do you think about scaling efficiency in this second case? I expect that the 2 "lucky" cores will wait for the other 4 if memory bandwidth is the limit, therefore the scaling efficiency will be much lower than expected.

I think this would affect a lot the scaling efficiency of each cpu having a number of cores which is not simply divisible by 4, what do you think? Therefore we should somehow consider that scaling efficiency of these CPUs is lower than what we achieve theoretically.

Regards

I think your concern is based on a misconception how memory access works.

Quote:

2 of the 6 cores have their own memory channel, while each couple of the rest of the cores has to use a single memory channel

The CPU cores do not have their "own" memory channel, i.e. they do not handle memory access themselves. Instead, the CPU cores communicate with the integrated memory controller that in turn handles memory access via a queuing system. This additional layer ensures that memory bandwidth is distributed evenly among the cores. So having a number of cores that is not an even multiple of the number of memory channels is not really an issue.

If you want to learn more about the whole topic, this is a good place to start: http://frankdenneman.nl/2015/02/18/m...y-blog-series/

Thanks a lot for the explanation and the reference, that solves my doubts! Regards

Sent from my HUAWEI TAG-L01 using CFD Online Forum mobile app

I am building a workstation to run CFD simulations with OpenFoam and ~10 million cells. Because of budget, I am limited to 8 or 10 cores only. I was hesitating between the i7-6900k, E5-1660v4, E5-2630v4 or E5-2640v4.

If I understand flotus1 analysis correctly, it seems that having 10 slower cores (E5-2x) is better in terms of performance and performance/$ than having 8 faster cores (i7 or the E5-1x). Or is the comparison 20 slower cores vs. 8 faster cores?

Does it make sense to have one E5-2x on a single processor computer?

And a bit off topic, but will 64 GB (4x16GB) of RAM be enough for this type of simulations? Will it also be enough for pre/post-processing and visualization of simulations with ~50 million cells that I will run in another server?

Thanks,

Mary

Quote:

If I understand flotus1 analysis correctly [...] Or is the comparison 20 slower cores vs. 8 faster cores?

This. Consequently, a single-socket system is usually the better choice for a small budget.

Quote:

Does it make sense to have one E5-2x on a single processor computer?

Unless for some reason you need a large amount of cores for a workflow that is not memory bound: no. You usually get more performance for your money with the single-socket CPUs. This is why I did not attempt to compare the dual-socket CPUs in single-CPU setups.

Quote:

And a bit off topic, but will 64 GB (4x16GB) of RAM be enough for this type of simulations?

That should be more than sufficient.

Quote:

Will it also be enough for pre/post-processing and visualization of simulations with ~50 million cells that I will run in another server?

It might be enough. Post-processing with ParaView should be no problem. I don't really know about the pre-processors for OpenFOAM. However, RAM is expensive these days. See if it works with 64GB. Otherwise you can still upgrade with 4 additional DIMMs.

Hi

I'm struggling between E5-2620 v3(6c 2.4GHz) and E5-2620 v4(8c 2.1GHz).
They are almost the same price.
Would you give me some adverse from a CFD view?

Thanks

Pros and cons for the E5-2620 v4 compared to its predecessor
+more cores
+supports faster memory
+newer architecture (->higher IPC compensates for slightly lower clock speed)
+more L3 cache
+does not cost more
-can't think of any...

First of all, great overview flotus1!

I do have some questions though:
1. Within your explanation or the webpages talking about Amdahl's Law, there is no talk about the clockspeed of the cores. Why is this?
2. As you say, these processors are easier to compare because of their similar architecture. But how would you go about quantifying performance between a E5-2637 v3 and a E5-2667 v4? And how would this performance be related to a FLUENT solver rating?

Quote:

Originally Posted by F1aerofan (Post 653657)

1. Within your explanation or the webpages talking about Amdahl's Law, there is no talk about the clockspeed of the cores. Why is this?

Because it is implied. Cores with higher clock speed are faster. The scaling is usually not 100%, especially when bandwidth limits occur. But the performance estimates I gave account for differences in clock speed.

Quote:

Originally Posted by F1aerofan (Post 653657)

2. As you say, these processors are easier to compare because of their similar architecture. But how would you go about quantifying performance between a E5-2637 v3 and a E5-2667 v4? And how would this performance be related to a FLUENT solver rating?

Did you really have to pick two CPUs with different amounts of cores :rolleyes:?
Well I would be tentative to estimate the differences quantitatively, but here is what I would consider: The IPC improved somewhere between 5% and 10% between the two generations. V3 CPUs only supported DDR4-2133, so another 10% penalty for the v3 CPUs, at least for CFD workloads. Then I would look up the turbo clock speeds for both CPUs (E5-2637v3: 3.6GHz, E5-2667v4: 3.5 GHz), so a slight bonus for Haswell-EP. And in the end one could feed these numbers along with the core counts into some more or less complicated model, for example Amdahl's law.
If you really need quantitative numbers that are accurate, better look at benchmarks.

Why do you ask? Are you considering buying used hardware for your new cluster?

Oke, thanks I will give that a go then. :)

Quote:

Originally Posted by flotus1 (Post 653660)

If you really need quantitative numbers that are accurate, better look at benchmarks.

Trouble with the Ansys FLUENT benchmarks is that most recent benchmarks are with different OEMS, at different memory speeds and only between 2.1 and 2.6 GHz processors with 32 or 36 cores/node.

Quote:

Originally Posted by flotus1 (Post 653660)

Why do you ask? Are you considering buying used hardware for your new cluster?

No the V3 is in our current workstation, the v4 is considered (other thread) for the new cluster. However, I am asked to make more detailed estimations of performance gain for the proposed new setups.;) But more on that maybe later.