Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon

evcelica · January 13, 2015, 11:42

Here are the results of some CFX benchmarks I have been doing an collecting for a while:

Model:
Geometry: 1m x 1m x 5m long duct
Mesh: 100 x 100 x 500 "cubes" all 1x1x1cm (5M cells)
Flow: Default Water enters @ 10m/s at 300K, goes out other side at 0Pa. Walls are 400K.
High Resolution Turbulence and advection
Everything else default.
Double Precision: ON
20 iterations (you must reduce your convergence criteria or it will converge in less iterations.)

The i7's are 3930K/4930K @ 4.2GHz, each with 64GB of 2133 MHz RAM. Connected with 20Gbps Infiniband.
The Dual Xeons have 128GB GB RAM, the Quad XEON has 256GB, all 1600 MHz with memory channels balanced properly.
(Performance was atrocious before they were balanced, The Quad CPU Xeon was only performing at 46% of the speed it is now with balanced memory)

I am comparing "CFD solver wall clock" times, not "Total wall clock times".
I added Acasas' results to the plot, thanks for sharing! I'll gladly add anyone else's results to the plot as well if they feel like running the benchmark.

CFX Benchmark.jpg

acasas · January 13, 2015, 12:07

thank´s for this info

wyldckat · January 13, 2015, 15:37

Greetings to all!

@Erik: Many thanks for this data!

I see that you've restrained the core usage on the E5-4617 to only 4 cores per socket, when you had 6 cores available. I guess the performance wasn't worth registering?
And I'm assuming their all of the first generation models of Xeon E5.

A note before continuing: for the 4x E5-4617 configuration running at 8 cores total, the performance seems to be coherent with a use case of using only 2 cores per socket, hence more bandwidth was available and more maximum CPU frequency.

Let's see if I can do some mathematical estimations based only on the specs at ark.intel.com and then compare with the results you've gotten, taking into account only using 4 cores on the E5-4617:

4x E5-4617:
- Theoretical total CPU GHz: 4 * 4 * 3.15 = 50.4 GHz
  - Assumed that since it's a pure 6 core, with a thermal cap to GHz in a range of 2.9 to 3.4 GHz, that it should be able to handle 4 cores at least at 3.15 GHz.
- Total memory bandwidth available per core: 51.2 / 4 = 12.8 GB/s
- 15MB cache per socket, i.e. roughly 3.75 MB per core.
- Lithography: 32nm
2x E5-2643:
- Theoretical total CPU GHz: 2 * 4 * 3.4 GHz = 27.2 GHz
- Total memory bandwidth available per core: 51.2 / 4 = 12.8 GB/s
- 10MB cache per socket, i.e. roughly 2.5 MB per core.
- Lithography: 32nm
2x E5-2680 - 8 cores total:
- Theoretical total CPU GHz: 2 * 4 * 3.4 GHz = 27.2 GHz
- Total memory bandwidth available per core: 51.2 / 4 = 12.8 GB/s
- 20MB cache per socket, i.e. roughly 5 MB per core.
- Lithography: 32nm
2x E5-2680 - 16 cores total:
- Theoretical total CPU GHz: 2 * 8 * 3.1 GHz = 49.6 GHz
- Total memory bandwidth available per core: 51.2 / 8 = 6.4 GB/s
- 20MB cache per socket, i.e. roughly 2.5 MB per core.
- Lithography: 32nm
3x i7-4930K:
- Theoretical total CPU GHz: 3 * 4 * 4.2 GHz = 50.4 GHz
- Total memory bandwidth available per core: 59.7 * (2133/1866) / 4 ~= 17 GB/s
- 12MB cache per socket, i.e. roughly 3 MB per core.
- Lithography: 22 nm
  - Theoretical scale up in performance to 32nm: 50.4 * 32 / 22 = 73.31 GHz
- Infiniband: 20 Gbps / 8 (bit/byte) = 2.5 GB/s
3x i7-4930K (assuming borderline OC (bOC)):
- Theoretical total CPU GHz: 3 * 4 * 3.9 GHz = 46.8 GHz
- Total memory bandwidth available per core: 59.7 / 4 ~= 14.925 GB/s
- 12MB cache per socket, i.e. roughly 3 MB per core.
- Lithography: 22 nm

So, if we sort only by theoretical CPU performance (model, GHz, GHz/ref):

3x i7-4930K (litho-boost): 73.31 GHz -> 2,70
3x i7-4930K: 50.4 GHz -> 1,85
(3x i7-4930K bOC: 46.8 GHz -> 1,72)
4x E5-4617: 50.4 GHz -> 1,85
2x E5-2680 (16 cores): 49.6 GHz -> 1.82
2x E5-2680 (8 cores): 27.2 GHz -> 1.0
2x E5-2643: 27.2 GHz -> 1.0 (reference)

Now, if we take only into account memory performance (model, Bandwidth/core, cache/core, hypothesis/ref):

3x i7-4930K: 17 GB/s, 3 MB -> 1.33
(3x i7-4930K bOC: 14.925 GB/s, 3 MB -> 1.17)
4x E5-4617: 12.8 GB/s, 3.75 MB -> 1.05
2x E5-2680 (16 cores): 6.4 GB/s, 2.5 MB -> 0.75
2x E5-2680 (8 cores): 12.8 GB/s, 5 MB -> 1.10
2x E5-2643: 12.8 GB/s, 2.5 MB -> 1.0 (reference)

The "hypothesis/ref" (hpr) factor is a rough mental estimate. These calculations still need work.

Now comes the really hard part, factoring in both details:

3x i7-4930K (litho-boost): 2.70 GHz/ref * 1.33 hpr -> ~3.6
3x i7-4930K: 1,85 GHz/ref * 1.33 hpr -> 2.46
(3x i7-4930K bOC: 1,72 GHz/ref * 1.17 hpr -> 2.01)
4x E5-4617: 1,85 GHz/ref * 1.05 hpr -> 1.94
2x E5-2680 (16 cores): 1.82 GHz/ref * 0.75 hpr -> 1.365
2x E5-2680 (8 cores): 1.0 GHz/ref * 1.10 hpr -> 1.10
2x E5-2643: 1.0 (reference)

Now if we compare to the actual results you got (model, solver_time_ref/solver_time):

3x i7-4930K: 2.11
4x E5-4617: 1.92
2x E5-2680 (16 cores): 1.30
2x E5-2680 (8 cores): 1.11
2x E5-2643: 1.0 (reference)

Wow, these were some pretty nice estimates I did this time.

The cluster is hard to estimate, because of the performance drop related to using an Infiniband interconnect... uhm, actually, the scale up is pretty much linear with the Infiniband interconnect... the three i7 cluster is 2.97 time faster than one i7. Then the problem is related to the 2 layers of overclocking, which don't provide a proper scale up estimate.
Let me review the mathematics assuming OC at max stock perfomance... it's the bOC entries: OK, looks like the overclocking only helps marginally to get additional performance, which is usually expected from overclocking for HPC.

Side note: The "lithography boost" is something that I've seen many times, namely where the same CPU can gain up in direct-inverse proportion to lithography reduction.

Erik, do you happen to have at least one run of the cluster (or just one i7) without the overclock, for a similar comparison? In other words, how much did each machine actually gain with the dual layer OC?

And by the way, how much is each solution spending in electricity for each respective test?

Best regards,
Bruno

acasas · January 15, 2015, 05:01

Hi, this is what I´ve got over those computers:

On the i7 3820 @ 3.6 Ghz and DDR3 SDRAM PC·-12800 @ 800 MHZ, with 4 real cores and 8 threads, and with affinity fully set, it took 1598 sec wall time.
On the dual Xeon E5-2650 v3, 20 real cores, no hyper threading, overclocking on, RAM memory DDR4-2133 (1066 MHz), it took 533 sec wall time.

the full thread is in here

http://www.cfd-online.com/Forums/har...tml#post527593

Sly · February 5, 2015, 11:22

Thank you guys, this is great information.
I do have a question regarding this approach. It seems that the general consensus is that the limiting factor for CFD computers is their memory bandwidth. Yet, the theoretical CPU GHz are always taken into account when estimating a given system’s performance. If the memory bandwidth is maxed out, the CPU is basically idling for the better part of a 1 second reference. Bruno is stating something in that sense:

Quote:

Originally Posted by wyldckat

Let me review the mathematics assuming OC at max stock perfomance... it's the bOC entries: OK, looks like the overclocking only helps marginally to get additional performance, which is usually expected from overclocking for HPC.

Why would overclocking yield marginal results but baseline frequency be an important factor in the estimate? I get that the overclocking of the i7-4930K is just 8% frequency increase but if we look at the E5-2680, it is a 30% increase (2.7GHz baseline, 3.5 GHz OC). I know that these sorts of calculations are trying to simply evaluate a rather complex system but would you be able to provide more information on this?

wyldckat · February 7, 2015, 04:30

Greetings Sylvain,

Quote:

Originally Posted by Sly

It seems that the general consensus is that the limiting factor for CFD computers is their memory bandwidth. Yet, the theoretical CPU GHz are always taken into account when estimating a given system’s performance. If the memory bandwidth is maxed out, the CPU is basically idling for the better part of a 1 second reference.

There are several analogies we could use where, but this is the one that comes to mind:

Take for example a person with an IQ of 300, who can do mathematical calculations with his head much faster than anyone else.
But that same person is not able to read very fast, for whatever reason... perhaps because the person forgot to put on glasses, so everything looks blurry.
Therefore, even if the person is able to calculate the square root of 34786538765978346295 in less than 1 nanosecond, it will take 2 to 15 seconds to actually read the number without glasses.

Quote:

Originally Posted by Sly

Why would overclocking yield marginal results but baseline frequency be an important factor in the estimate? I get that the overclocking of the i7-4930K is just 8% frequency increase but if we look at the E5-2680, it is a 30% increase (2.7GHz baseline, 3.5 GHz OC).

The first detail here is that even if overclocking does increase the CPU frequency considerably, this usually only affects how fast it can process code. In addition, from some of the experience I've had with overclocking, it has felt that the frequency is not exactly stable, since the CPU is forced to run way above the nominal range, which can lead to some minor hiccups in processing code. In addition, I'm not absolutely certain that all features in the CPU will be overclocked at the same scale; it might be restraining itself as a safety measure.

On the other hand, the turbo feature in E5-2680 and other similar processors is not an overclocking feature and is something that is indicated as a stable frequency at which it can run for all features. Essential the stock vs turbo range gives us an idea that it's able to operate properly within this range.
In addition, how this turbo feature shifts into gear depends on the model of the CPU itself; for example, from what I've seen:

The v1 and v2 Xeon E5 processors will set the turbo to half the possible range (the average value of the range), when using all cores and with HyperThreading off.
Nonetheless, the latest v3 models have been documented as shifting to lower frequencies, because they now have more powerful AVX processing units; therefore, the frequency is lower because it generates a lot more heat than when using the non-AVX part of the CPU.

Mmm... another analogy comes to mind, regarding memory bandwidth and the number of memory channels... imagine this:

We have 10 supercars that can run at a maximum speed of 300 km/h.
Each car is carrying 100 books.
Nonetheless, the road only has 2 lanes.
This means that each lane is only able to deliver 5 cars in a single burst.
At 4 meters of length for each car, this means that 20 meters for the cars are able to deliver 300*5 = 1500 books.
But at this rate, at 300*1000/3600 = 83.3(3) m/s, it will take roughly 20/83.3 = 0.24 s to deliver 1500 books per lane, 3000 books total.
The optimum would be to take only 4/83.3 = 0.048s to deliver all 3000 books, by having 10 lanes.
What happens if we keep 2 lanes and up the speed to 400 km/h? It will take 0.18 s to deliver the same 3000 books.

The same is applicable for the CPU frequency, and the result would be that the timings would be added:

Each car has 1 book scanner which can handle 1 book per millisecond.
Each car would take 0.100 s to scan the 100 books.
The grand total would be:
- 300 km/h: 0.1 s + 0.24 s = 0.34 s
- 400 km/h: 0.1 s + 0.18 s = 0.28 s

Even if you overclock the scanners to handle 120 books in 0.1 s, the performance increase won't be all that big.

Oh, and what about CPU's cache... an analogy... could be the scanner's book preloader, namely of being able to scan one book, while already having another book on hold that has already been retrieved from the trunk of the car

Best regards,
Bruno

PS: Sorry, apparently I woke up with a creative writing mood

--------------

edit: Have a look at this post as well: http://www.cfd-online.com/Forums/har...tml#post366672 post#17

Sly · February 9, 2015, 17:42

Thank you Bruno for your creative writing indeed.

If I understood correctly, you agree with my first statement with your analogy of the smartest person ever that nonetheless can’t put on glasses. CPU power has little impact when the memory bandwidth is maxed out. For your second analogy about memory bandwidth, what you’re saying is maximise the number of memory channels (road lanes) and maximise the memory frequency (road speed). What I don’t get is when you use the same analogy but for CPUs. You seem to say that all CPUs are the same with 0.1s. So, when assessing the hardware requirements for a new system, why is the baseline CPU frequency or achievable boost frequency taken into account? To support what I’m saying, I would like to point out the data provided on the first post of this thread.

The first thing that I noticed is the differences between the i7-4930K in 1/2/3 nodes configuration and 3 nodes configuration. Looking at the results for 3 and 4 cores, we see that they’re pretty much the same. This would suggest that the 4th core on the 1 node configuration is underused or a nuisance to the other cores. Same thing happens with 6 and 8 cores. There is only a 5% performance increase per added core when it should be around 17% assuming good scalability. And for 9 and 12 cores, the increase is 5% per core when it should be around 11%.

The other thing I noticed is that the performance difference between the i7-4930K (3 node configuration) and the E5-4617 is matching the memory frequency increase by 5%.
Memory frequency increase:
2133/1600 = 1.33
Performance increase 4 cores:
1.013/0.798 = 1.27
If we look at the data with 8 cores, the results are slightly different.
Performance increase 8 cores:
2.034/1.767 = 1.15
The performance increase is not as much as we could expect but the scalability of the E5-4617 between 4 and 8 cores is greater than 1.
(1.767/0.798)/(8/4) = 1.11
Here it would be nice to know if the core distribution was 4+0+0+0 or 1+1+1+1 for 4 cores (incidentally 4+4+0+0 or 2+2+2+2 for 8 cores). This scalability beyond 1 would imply that either something was restraining the 4 core distribution or that the 8 core distribution has something more to work with. It could be a motherboard feature like NUMA but this is getting beyond my knowledge. I think that the 1+1+1+1 distribution (and 2+2+2+2) was not used because all the board memory bandwidth would have been available from the start and also all the motherboard features. That way, the best scalability that could have been achieved would have been 1. So, this “the whole is better than the sum of its parts” hypothesis could explain why there is smaller than expected performance increase from the 8 core E5-4617 and the 8 core i7-4930K. The reference, the E5-4617, has a feature that the i7-4930K doesn’t have. Hence, the 15% performance increase against the expected 27% of the memory speed-up. Now, for the E5-4617, something obviously happened between 8 and 12 cores. I cannot explain any of it especially since the scalability between 12 and 16 cores is 1. A system that is memory bandwidth limited should behave like the E5-2680. The scalability is 0.73 between 8 and 12 cores and 0.79 between 12 and 16 cores. Can someone provide an explanation for this?

Bruno, I looked at the link you provided and here’s what I realised. Please tell me if that is the proper answer to my initial question about the CPU frequency being taken into account for a system’s performance estimate. A system’s scalability will be very close to one if the memory bandwidth is not fully used. Once the memory bandwidth is fully used, though the scalability will fall, a system will significantly benefit from a higher total CPU frequency. Hence, the total CPU frequency available in a given system should be a secondary criterion when choosing a system

wyldckat · February 10, 2015, 15:40

Quote:

Originally Posted by Sly

What I don’t get is when you use the same analogy but for CPUs. You seem to say that all CPUs are the same with 0.1s. So, when assessing the hardware requirements for a new system, why is the baseline CPU frequency or achievable boost frequency taken into account?

Sorry, I ran out of steam on that day and didn't expand on the thought process. The idea felt so clear to me, that I had hoped that it was easy to understand. Let me check the rest of your post first, to see if I don't repeat myself

Quote:

Originally Posted by Sly

To support what I’m saying, I would like to point out the data provided on the first post of this thread.

The first thing that I noticed is the differences between the i7-4930K in 1/2/3 nodes configuration and 3 nodes configuration. Looking at the results for 3 and 4 cores, we see that they’re pretty much the same.

Wow! I didn't notice the data with the 4 node configuration with 3 cores per node!! I gotta crunch some numbers...

Quote:

Originally Posted by Sly

This would suggest that the 4th core on the 1 node configuration is underused or a nuisance to the other cores. Same thing happens with 6 and 8 cores. There is only a 5% performance increase per added core when it should be around 17% assuming good scalability. And for 9 and 12 cores, the increase is 5% per core when it should be around 11%.

OK, the idea is this:

These specific i7-4930K are overclocked to 4.2 GHz.
Each i7-4930K has 4 memory channels.
Each i7-4930K has 4 cores.

I think that Erik did not write down the correct values for the last entry of the 4 nodes configuration. The values are identical and the designation seems corrupted, because it's a 4 node configuration and not 3 nodes. Erik probably copy-pasted the last entry from the 3 node configuration and then forgot to update it. Therefore, we'll have to assume that the last entry for the 4 nodes is corrupted and cannot be used for the calculations.

Therefore, assuming it's a 3 nodes with 3 cores each... mmm, I think the problem here is that the bottleneck is actually the cache per socket. It's 3MB vs 4MB per core. This means that by having 1 more MB per core, it reduces the number of times each core has to go fetch another blob of data from RAM. This is usually known as "cache misses" and has a very big impact on performance; here's a very explanation on the topic: http://stackoverflow.com/a/16699282 - which initially does lead to the same conclusion you're getting, but the writer also indicates that it's not that simple.
And having more memory access bandwidth available per core also helps

This is actually not the first time I've seen this... but I'm not able to find the blog post I'm thinking about

The idea was that a 2 socket machine with 16 cores total (8+8) was slower than a 4 socket machine also with 16 cores (4+4+4+4), even though the total speed was almost the same.

Quote:

Originally Posted by Sly

The other thing I noticed is that the performance difference between the i7-4930K (3 node configuration) and the E5-4617 is matching the memory frequency increase by 5%.
Memory frequency increase:
2133/1600 = 1.33
Performance increase 4 cores:
1.013/0.798 = 1.27
If we look at the data with 8 cores, the results are slightly different.
Performance increase 8 cores:
2.034/1.767 = 1.15
The performance increase is not as much as we could expect but the scalability of the E5-4617 between 4 and 8 cores is greater than 1.
(1.767/0.798)/(8/4) = 1.11
Here it would be nice to know if the core distribution was 4+0+0+0 or 1+1+1+1 for 4 cores (incidentally 4+4+0+0 or 2+2+2+2 for 8 cores). This scalability beyond 1 would imply that either something was restraining the 4 core distribution or that the 8 core distribution has something more to work with. It could be a motherboard feature like NUMA but this is getting beyond my knowledge. I think that the 1+1+1+1 distribution (and 2+2+2+2) was not used because all the board memory bandwidth would have been available from the start and also all the motherboard features. That way, the best scalability that could have been achieved would have been 1. So, this “the whole is better than the sum of its parts” hypothesis could explain why there is smaller than expected performance increase from the 8 core E5-4617 and the 8 core i7-4930K. The reference, the E5-4617, has a feature that the i7-4930K doesn’t have. Hence, the 15% performance increase against the expected 27% of the memory speed-up. Now, for the E5-4617, something obviously happened between 8 and 12 cores. I cannot explain any of it especially since the scalability between 12 and 16 cores is 1. A system that is memory bandwidth limited should behave like the E5-2680. The scalability is 0.73 between 8 and 12 cores and 0.79 between 12 and 16 cores. Can someone provide an explanation for this?

I'm a bit too sleepy right now, so I'm not sure if I managed to interpret it all correctly. But I think that the explanation to all of the details on this quote-block is simple: shared cache is the bottleneck here most of the time.
And very likely the configurations on the E5-4617 are populated per socket, which would explain the boost in performance, since it reduces the cache misses in half.

Quote:

Originally Posted by Sly

Bruno, I looked at the link you provided and here’s what I realised. Please tell me if that is the proper answer to my initial question about the CPU frequency being taken into account for a system’s performance estimate. A system’s scalability will be very close to one if the memory bandwidth is not fully used. Once the memory bandwidth is fully used, though the scalability will fall, a system will significantly benefit from a higher total CPU frequency. Hence, the total CPU frequency available in a given system should be a secondary criterion when choosing a system

Technically, it depends on the complexity of the CFD calculations at hand. If the CFD solvers were reusing the same memory blob more than once per cycle, the CPU frequency would matter more, since they were using the data still in the CPU's cache. Curiously, this is one of those details that will heavily restrain the success of a code implemented for a GPGPU configuration.

As for your question and based on all of this, the order of prevalence for CFD should be something like this:

The more memory bandwidth per CPU, the better.
The more cache per CPU, the better.
The faster each core can run, the better.
Core count is secondary... unless there are tons of cores per socket.

But then there is the money vs performance issue. I forgot to mention this explicitly on this thread, but here's a quote from the other one where I wrote about this:

Quote:

Originally Posted by wyldckat

I guess it's quicker to give the link I'm thinking of: http://www.anandtech.com/show/8423/i...l-ep-cores-/19
there you might find that a system with 12 cores @ 2.5GHz that costs roughly 1000 USD gives a better bang-for-your-buck than 8 cores @ 3.9 GHz that cost 2000 USD (not sure of the exact values). But the 8 core system gives the optimum performance of RAM bandwidth and core efficiency, but the 12 core system costs a lot less and spends a lot less in electrical power consumption, while running only at 76% CPU compute performance of the 8 core system.

The link provided actually demonstrates the curious detail that the data shown here isn't as easy to see, but that the E5-2680 sort-of hints at it: the more cores, the smaller the chunk of data processed per blob, but it can also help streamline the accesses to RAM. This to say that the 18 core E5-2699v3 on that link demonstrates the exact opposite of what you were trying to conclude.

In addition, I did a few years ago a test with a simple large test case with OpenFOAM, where a single-socket 6 core AMD 1055T wouldn't scale much beyond a factor of 3 vs a single core run; the main bottleneck is that this CPU has only 2 memory channels. But the crazy detail was that when I over-scheduled to 16 processes for only 6 cores, the wall clock runtime as smaller than when using only 6 processes. The explanation for this, and which should explain similar situations, is that by aligning memory accesses, we can get a bit more performance.

Sly · February 10, 2015, 18:17

Quote:

Originally Posted by wyldckat

Wow! I didn't notice the data with the 4 node configuration with 3 cores per node!! I gotta crunch some numbers...

OK, the idea is this:

These specific i7-4930K are overclocked to 4.2 GHz.
Each i7-4930K has 4 memory channels.
Each i7-4930K has 4 cores.

I think that Erik did not write down the correct values for the last entry of the 4 nodes configuration. The values are identical and the designation seems corrupted, because it's a 4 node configuration and not 3 nodes. Erik probably copy-pasted the last entry from the 3 node configuration and then forgot to update it. Therefore, we'll have to assume that the last entry for the 4 nodes is corrupted and cannot be used for the calculations.

I don't think that there are 4 nodes. I'm pretty sure the configurations were as advertised.

Core distribution of the 1/2/3 nodes configuration:
Point 1: 4+0+0 4 cores total
Point 2: 4+4+0 8 cores total
Point 3: 4+4+4 12 cores total
3 points on the curve

Core distribution of the 3 node configuration:
Point 1: 1+1+1 3 cores total
Point 2: 2+2+2 6 cores total
Point 3: 3+3+3 9 cores total
Point 4: 4+4+4 12 cores total
4 points on the curve

Quote:

Originally Posted by wyldckat

As for your question and based on all of this, the order of prevalence for CFD should be something like this:

The more memory bandwidth per CPU, the better.
The more cache per CPU, the better.
The faster each core can run, the better.
Core count is secondary... unless there are tons of cores per socket.

Thank you for all the information you provided. I’ll take this into account for the build I want to do.

wyldckat · February 11, 2015, 15:03

Quote:

Originally Posted by Sly

Core distribution of the 3 node configuration:
Point 1: 1+1+1 3 cores total
Point 2: 2+2+2 6 cores total
Point 3: 3+3+3 9 cores total
Point 4: 4+4+4 12 cores total
4 points on the curve

Good point! Yes, it does look to be something like that!

digitalmg · July 8, 2016, 10:28

Dears,
I'm a little confused about memory bandwidth definition. Please let me know which statement is incorrect.
1- For CFD, Max Memory Bandwidth (e.g 76.8 GB/s) / CPU # of cores ,is the main performance issue.

2- For Intel® Xeon® Processor E5-2698 v4 which has 20 cores and Max # of Memory Channels 4 (when 4 RAM modules installed in the correct places), then 76.8 GB/s / 20 = 3.84 GB/s would be the result.

3- Most of DDR4 Memory modules in market have at-least about 10 GB/s Read/Write speed which means considering this 10 GB/s - 3.84 GB/s = 6.16 GB/s is the wasted bandwidth of RAM modules in this configuration.

4- Four memory channels does not have independent 76.8 GB/s bandwidth, I mean 76.8 GB/s is maximum possible Bandwidth which CPU could ever have in the best configuration. 4 channels x 76.8 GB/s = 307.2 GB/s is incorrect.

5- If all above are correct, then I see that the most available Xeon E5 CPU memory bandwidth is 76.8 GB/s , So considering X99 chip-set equipped main-board which allows DDR4-3200 MHz and the fastest available memory module in Write bandwidth like Corsair CMK16GX4M4B3200C15 4GB (13,204 MB/s Write Speed ) we come into this conclusion : 76.8 GB/s / 13.2 GB/s = 5.82 which means CPUs with more than 6 cores are not suitable for CFD because waste of memory module bandwidth. Also lower number of CPU cores would be the bottleneck for using memory modules bandwidth because such RAM module cannot respond more than 13.2 GB/s and CPU would be idle for fraction of time.

6- Larger CPU Cache may compensate a little the above mentioned idle time in memory bandwidth bottleneck. but I don't know how much. and how to calculate.

7- E5-2643 V4 would be better than E5-2687W V4 for CFD although the latter has more clock speed and price.

So What would be E5-2699 v4 good for with 22 cores ?

Thanks for taking time and read my notes.

RobertB · July 11, 2016, 13:18

@Wyldkat, instead of the fancy theoretical calculations why not just use the published specfp_rate measurements?

From my experience they provide a pretty good estimate of CCM+ performance on a chip.

Sly · July 24, 2016, 11:51

M-G,
The main problem with what you wrote is that you compare CPU memory bandwidth against RAM memory bandwidth when in fact they are the same one thing. Memory bandwidth is dictated by the number of memory channnels, the DIMM frequency and the motherboard. The advertised memory bandwidth for a given CPU is based on the DIMM frequency that is garanteed to be stable by the manufacturer. Here's how it is calculated:

Memory bandwidth = DIMM frequency x 8 bytes (64 bits) x # of channels

So in the case of the Xeon E5-2698 v4 we have:

Memory bandwidth = 2400mHz x 8 bytes x 4 channels = 76.8GB/s as advertised on Intel Ark

Now, if you were to used that CPU with the X99 setup and were able to run it at 3200mHz, the memory bandwidth would be 102.4GB/s. This is all theoretical performance based on the hardware caracteristics. As to your optimization problem regarding the number core and memory bandwidth, it does not exist as soon as the case complexity reaches a certain level which happens very fast in CFD. By that, I mean that for CFD analysis, the amount of data produced by the CPU will be larger than what the memory bandwidth can handle for "relatively normal" case complexity hence the memory bottleneck.

digitalmg · September 2, 2016, 12:00

Dear Sylvian
So you mean cfprate2006 results are not applicable for CFD cases ?
I see 4 of 17 tests in cfprate are based on CFD terms.
Would you please explain more ?

evan247 · December 15, 2016, 06:57

Quote:

Originally Posted by evcelica

The i7's are 3930K/4930K @ 4.2GHz, each with 64GB of 2133 MHz RAM. Connected with 20Gbps Infiniband.

Attachment 36568

Hi evcelica, may I ask what sort of Infiniband configuration you used to connect the 3 nodes? I'm asking because I've got 2 identical Xeon workstations, and wonder if I could connect them via infiniband. My impression is that you need the server type of motherboard to do so. Thank you.

January 13, 2015, 11:42	Some ANSYS Benchmarks: 3 node i7 vs Dual and Quad Xeon	#1
evcelica Senior Member Erik Join Date: Feb 2011 Location: Earth (Land portion) Posts: 1,167 Rep Power: 23	Here are the results of some CFX benchmarks I have been doing an collecting for a while: Model: Geometry: 1m x 1m x 5m long duct Mesh: 100 x 100 x 500 "cubes" all 1x1x1cm (5M cells) Flow: Default Water enters @ 10m/s at 300K, goes out other side at 0Pa. Walls are 400K. High Resolution Turbulence and advection Everything else default. Double Precision: ON 20 iterations (you must reduce your convergence criteria or it will converge in less iterations.) The i7's are 3930K/4930K @ 4.2GHz, each with 64GB of 2133 MHz RAM. Connected with 20Gbps Infiniband. The Dual Xeons have 128GB GB RAM, the Quad XEON has 256GB, all 1600 MHz with memory channels balanced properly. (Performance was atrocious before they were balanced, The Quad CPU Xeon was only performing at 46% of the speed it is now with balanced memory) I am comparing "CFD solver wall clock" times, not "Total wall clock times". I added Acasas' results to the plot, thanks for sharing! I'll gladly add anyone else's results to the plot as well if they feel like running the benchmark. CFX Benchmark.jpg wyldckat, CapSizer, pongup and 3 others like this. Last edited by evcelica; January 15, 2015 at 12:53.

January 13, 2015, 15:37		#3
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,975 Blog Entries: 45 Rep Power: 128	Greetings to all! @Erik: Many thanks for this data! I see that you've restrained the core usage on the E5-4617 to only 4 cores per socket, when you had 6 cores available. I guess the performance wasn't worth registering? And I'm assuming their all of the first generation models of Xeon E5. A note before continuing: for the 4x E5-4617 configuration running at 8 cores total, the performance seems to be coherent with a use case of using only 2 cores per socket, hence more bandwidth was available and more maximum CPU frequency. Let's see if I can do some mathematical estimations based only on the specs at ark.intel.com and then compare with the results you've gotten, taking into account only using 4 cores on the E5-4617: 4x E5-4617: Theoretical total CPU GHz: 4 * 4 * 3.15 = 50.4 GHz Assumed that since it's a pure 6 core, with a thermal cap to GHz in a range of 2.9 to 3.4 GHz, that it should be able to handle 4 cores at least at 3.15 GHz. Total memory bandwidth available per core: 51.2 / 4 = 12.8 GB/s 15MB cache per socket, i.e. roughly 3.75 MB per core. Lithography: 32nm 2x E5-2643: Theoretical total CPU GHz: 2 * 4 * 3.4 GHz = 27.2 GHz Total memory bandwidth available per core: 51.2 / 4 = 12.8 GB/s 10MB cache per socket, i.e. roughly 2.5 MB per core. Lithography: 32nm 2x E5-2680 - 8 cores total: Theoretical total CPU GHz: 2 * 4 * 3.4 GHz = 27.2 GHz Total memory bandwidth available per core: 51.2 / 4 = 12.8 GB/s 20MB cache per socket, i.e. roughly 5 MB per core. Lithography: 32nm 2x E5-2680 - 16 cores total: Theoretical total CPU GHz: 2 * 8 * 3.1 GHz = 49.6 GHz Total memory bandwidth available per core: 51.2 / 8 = 6.4 GB/s 20MB cache per socket, i.e. roughly 2.5 MB per core. Lithography: 32nm 3x i7-4930K: Theoretical total CPU GHz: 3 * 4 * 4.2 GHz = 50.4 GHz Total memory bandwidth available per core: 59.7 * (2133/1866) / 4 ~= 17 GB/s 12MB cache per socket, i.e. roughly 3 MB per core. Lithography: 22 nm Theoretical scale up in performance to 32nm: 50.4 * 32 / 22 = 73.31 GHz Infiniband: 20 Gbps / 8 (bit/byte) = 2.5 GB/s 3x i7-4930K (assuming borderline OC (bOC)): Theoretical total CPU GHz: 3 * 4 * 3.9 GHz = 46.8 GHz Total memory bandwidth available per core: 59.7 / 4 ~= 14.925 GB/s 12MB cache per socket, i.e. roughly 3 MB per core. Lithography: 22 nm So, if we sort only by theoretical CPU performance (model, GHz, GHz/ref): 3x i7-4930K (litho-boost): 73.31 GHz -> 2,70 3x i7-4930K: 50.4 GHz -> 1,85 (3x i7-4930K bOC: 46.8 GHz -> 1,72) 4x E5-4617: 50.4 GHz -> 1,85 2x E5-2680 (16 cores): 49.6 GHz -> 1.82 2x E5-2680 (8 cores): 27.2 GHz -> 1.0 2x E5-2643: 27.2 GHz -> 1.0 (reference) Now, if we take only into account memory performance (model, Bandwidth/core, cache/core, hypothesis/ref): 3x i7-4930K: 17 GB/s, 3 MB -> 1.33 (3x i7-4930K bOC: 14.925 GB/s, 3 MB -> 1.17) 4x E5-4617: 12.8 GB/s, 3.75 MB -> 1.05 2x E5-2680 (16 cores): 6.4 GB/s, 2.5 MB -> 0.75 2x E5-2680 (8 cores): 12.8 GB/s, 5 MB -> 1.10 2x E5-2643: 12.8 GB/s, 2.5 MB -> 1.0 (reference) The "hypothesis/ref" (hpr) factor is a rough mental estimate. These calculations still need work. Now comes the really hard part, factoring in both details: 3x i7-4930K (litho-boost): 2.70 GHz/ref * 1.33 hpr -> ~3.6 3x i7-4930K: 1,85 GHz/ref * 1.33 hpr -> 2.46 (3x i7-4930K bOC: 1,72 GHz/ref * 1.17 hpr -> 2.01) 4x E5-4617: 1,85 GHz/ref * 1.05 hpr -> 1.94 2x E5-2680 (16 cores): 1.82 GHz/ref * 0.75 hpr -> 1.365 2x E5-2680 (8 cores): 1.0 GHz/ref * 1.10 hpr -> 1.10 2x E5-2643: 1.0 (reference) Now if we compare to the actual results you got (model, solver_time_ref/solver_time): 3x i7-4930K: 2.11 4x E5-4617: 1.92 2x E5-2680 (16 cores): 1.30 2x E5-2680 (8 cores): 1.11 2x E5-2643: 1.0 (reference) Wow, these were some pretty nice estimates I did this time. The cluster is hard to estimate, because of the performance drop related to using an Infiniband interconnect... uhm, actually, the scale up is pretty much linear with the Infiniband interconnect... the three i7 cluster is 2.97 time faster than one i7. Then the problem is related to the 2 layers of overclocking, which don't provide a proper scale up estimate. Let me review the mathematics assuming OC at max stock perfomance... it's the bOC entries: OK, looks like the overclocking only helps marginally to get additional performance, which is usually expected from overclocking for HPC. Side note: The "lithography boost" is something that I've seen many times, namely where the same CPU can gain up in direct-inverse proportion to lithography reduction. Erik, do you happen to have at least one run of the cluster (or just one i7) without the overclock, for a similar comparison? In other words, how much did each machine actually gain with the dual layer OC? And by the way, how much is each solution spending in electricity for each respective test? Best regards, Bruno aee, acasas and anon_q like this.

July 8, 2016, 10:28	Request for more explanation about memory bandwidth	#11
digitalmg New Member M-G Join Date: Apr 2016 Posts: 28 Rep Power: 9	Dears, I'm a little confused about memory bandwidth definition. Please let me know which statement is incorrect. 1- For CFD, Max Memory Bandwidth (e.g 76.8 GB/s) / CPU # of cores ,is the main performance issue. 2- For Intel® Xeon® Processor E5-2698 v4 which has 20 cores and Max # of Memory Channels 4 (when 4 RAM modules installed in the correct places), then 76.8 GB/s / 20 = 3.84 GB/s would be the result. 3- Most of DDR4 Memory modules in market have at-least about 10 GB/s Read/Write speed which means considering this 10 GB/s - 3.84 GB/s = 6.16 GB/s is the wasted bandwidth of RAM modules in this configuration. 4- Four memory channels does not have independent 76.8 GB/s bandwidth, I mean 76.8 GB/s is maximum possible Bandwidth which CPU could ever have in the best configuration. 4 channels x 76.8 GB/s = 307.2 GB/s is incorrect. 5- If all above are correct, then I see that the most available Xeon E5 CPU memory bandwidth is 76.8 GB/s , So considering X99 chip-set equipped main-board which allows DDR4-3200 MHz and the fastest available memory module in Write bandwidth like Corsair CMK16GX4M4B3200C15 4GB (13,204 MB/s Write Speed ) we come into this conclusion : 76.8 GB/s / 13.2 GB/s = 5.82 which means CPUs with more than 6 cores are not suitable for CFD because waste of memory module bandwidth. Also lower number of CPU cores would be the bottleneck for using memory modules bandwidth because such RAM module cannot respond more than 13.2 GB/s and CPU would be idle for fraction of time. 6- Larger CPU Cache may compensate a little the above mentioned idle time in memory bandwidth bottleneck. but I don't know how much. and how to calculate. 7- E5-2643 V4 would be better than E5-2687W V4 for CFD although the latter has more clock speed and price. So What would be E5-2699 v4 good for with 22 cores ? Thanks for taking time and read my notes.

July 11, 2016, 13:18		#12
RobertB Senior Member Robert Join Date: Jun 2010 Posts: 117 Rep Power: 16	@Wyldkat, instead of the fancy theoretical calculations why not just use the published specfp_rate measurements? From my experience they provide a pretty good estimate of CCM+ performance on a chip. wyldckat likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Dual Channel memory vs Quad (4930k)	natty_king	Hardware	1	April 22, 2014 08:25
Dual xeon? or Dual i7	cartman	Hardware	8	June 8, 2012 19:42
Dual Nodes is Slower Than Single Node (Reposting)	Mrxlazuardin	Hardware	1	May 26, 2010 10:25
Dual Nodes is Slower Than Single Node	Mrxlazuardin	FLUENT	0	May 21, 2010 01:48
Questions about CPU's: quad core, dual core, etc.	Tim	FLUENT	0	February 26, 2007 14:02

January 13, 2015, 12:07		#2
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 12	thank´s for this info

January 15, 2015, 05:01		#4
acasas Member Antonio Casas Join Date: May 2013 Location: world Posts: 85 Rep Power: 12	Hi, this is what I´ve got over those computers: On the i7 3820 @ 3.6 Ghz and DDR3 SDRAM PC·-12800 @ 800 MHZ, with 4 real cores and 8 threads, and with affinity fully set, it took *1598 sec* wall time. On the dual *Xeon E5-2650 v3, 20 real cores, no hyper threading, overclocking on, RAM memory DDR4-2133 (1066 MHz), it took 533 sec* wall time. the full thread is in here http://www.cfd-online.com/Forums/har...tml#post527593

February 9, 2015, 17:42		#7
Sly New Member Sylvain Boulanger Join Date: Nov 2014 Posts: 17 Rep Power: 11	Thank you Bruno for your creative writing indeed. If I understood correctly, you agree with my first statement with your analogy of the smartest person ever that nonetheless can’t put on glasses. CPU power has little impact when the memory bandwidth is maxed out. For your second analogy about memory bandwidth, what you’re saying is maximise the number of memory channels (road lanes) and maximise the memory frequency (road speed). What I don’t get is when you use the same analogy but for CPUs. You seem to say that all CPUs are the same with 0.1s. So, when assessing the hardware requirements for a new system, why is the baseline CPU frequency or achievable boost frequency taken into account? To support what I’m saying, I would like to point out the data provided on the first post of this thread. The first thing that I noticed is the differences between the i7-4930K in 1/2/3 nodes configuration and 3 nodes configuration. Looking at the results for 3 and 4 cores, we see that they’re pretty much the same. This would suggest that the 4th core on the 1 node configuration is underused or a nuisance to the other cores. Same thing happens with 6 and 8 cores. There is only a 5% performance increase per added core when it should be around 17% assuming good scalability. And for 9 and 12 cores, the increase is 5% per core when it should be around 11%. The other thing I noticed is that the performance difference between the i7-4930K (3 node configuration) and the E5-4617 is matching the memory frequency increase by 5%. Memory frequency increase: 2133/1600 = 1.33 Performance increase 4 cores: 1.013/0.798 = 1.27 If we look at the data with 8 cores, the results are slightly different. Performance increase 8 cores: 2.034/1.767 = 1.15 The performance increase is not as much as we could expect but the scalability of the E5-4617 between 4 and 8 cores is greater than 1. (1.767/0.798)/(8/4) = 1.11 Here it would be nice to know if the core distribution was 4+0+0+0 or 1+1+1+1 for 4 cores (incidentally 4+4+0+0 or 2+2+2+2 for 8 cores). This scalability beyond 1 would imply that either something was restraining the 4 core distribution or that the 8 core distribution has something more to work with. It could be a motherboard feature like NUMA but this is getting beyond my knowledge. I think that the 1+1+1+1 distribution (and 2+2+2+2) was not used because all the board memory bandwidth would have been available from the start and also all the motherboard features. That way, the best scalability that could have been achieved would have been 1. So, this “the whole is better than the sum of its parts” hypothesis could explain why there is smaller than expected performance increase from the 8 core E5-4617 and the 8 core i7-4930K. The reference, the E5-4617, has a feature that the i7-4930K doesn’t have. Hence, the 15% performance increase against the expected 27% of the memory speed-up. Now, for the E5-4617, something obviously happened between 8 and 12 cores. I cannot explain any of it especially since the scalability between 12 and 16 cores is 1. A system that is memory bandwidth limited should behave like the E5-2680. The scalability is 0.73 between 8 and 12 cores and 0.79 between 12 and 16 cores. Can someone provide an explanation for this? Bruno, I looked at the link you provided and here’s what I realised. Please tell me if that is the proper answer to my initial question about the CPU frequency being taken into account for a system’s performance estimate. A system’s scalability will be very close to one if the memory bandwidth is not fully used. Once the memory bandwidth is fully used, though the scalability will fall, a system will significantly benefit from a higher total CPU frequency. Hence, the total CPU frequency available in a given system should be a secondary criterion when choosing a system

July 24, 2016, 11:51		#13
Sly New Member Sylvain Boulanger Join Date: Nov 2014 Posts: 17 Rep Power: 11	M-G, The main problem with what you wrote is that you compare CPU memory bandwidth against RAM memory bandwidth when in fact they are the same one thing. Memory bandwidth is dictated by the number of memory channnels, the DIMM frequency and the motherboard. The advertised memory bandwidth for a given CPU is based on the DIMM frequency that is garanteed to be stable by the manufacturer. Here's how it is calculated: Memory bandwidth = DIMM frequency x 8 bytes (64 bits) x # of channels So in the case of the Xeon E5-2698 v4 we have: Memory bandwidth = 2400mHz x 8 bytes x 4 channels = 76.8GB/s as advertised on Intel Ark Now, if you were to used that CPU with the X99 setup and were able to run it at 3200mHz, the memory bandwidth would be 102.4GB/s. This is all theoretical performance based on the hardware caracteristics. As to your optimization problem regarding the number core and memory bandwidth, it does not exist as soon as the case complexity reaches a certain level which happens very fast in CFD. By that, I mean that for CFD analysis, the amount of data produced by the CPU will be larger than what the memory bandwidth can handle for "relatively normal" case complexity hence the memory bottleneck.

September 2, 2016, 12:00		#14
digitalmg New Member M-G Join Date: Apr 2016 Posts: 28 Rep Power: 9	Dear Sylvian So you mean cfprate2006 results are not applicable for CFD cases ? I see 4 of 17 tests in cfprate are based on CFD terms. Would you please explain more ?