CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Discussions about memory bandwidth and scaling

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 1, 2019, 21:11
Default Discussions about memory bandwidth and scaling
  #1
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
Although there are many many threads about this in the forum (you could easily just have read one or two of the threads on the main page of the hardware forum...) I will give you some general suggestions.


If you are going to do CFD simulations then you should opt for the following (in this particular order)


1. Memory bandwidth
2. Memory bandwidth
3. Memory bandwidth


This means that you wish to run a system with A) many memory channels and B) High memory frequency.



In my country a 1900X Threadripper costs about the same as a 9600K. The threadripper has 4 memory channels whereas the Skylake refresh refresh refresh refresh has 2 memory channels. However, the total cost will be higher due to the expensive TR motherboards and the (likely) added amount of memory.


You should also opt for dual rank (2R) memory if possible. ASUS has some motherboards that support DDR4 speeds up to 3600 MHz in dual rank (or 4000+ MHz single rank, which probably gives you the same performance). Check the qualified vendor memory list before you purchase anything.



Finally; if you accept buying used systems then you should look at the Xeon 26xx v2 processor family. A dual CPU 2690v2 system gives you 8 memory channels (filled cheap DDR3 memory instead of really expensive DDR4).
I'm sorry, but I am going to have to disagree with the priorities listed here.

Unlike sparse/direct solvers for finite element analysis, CFD is actually not as memory intensive in the sense that it doesn't actually load up the RAM quite nearly as much as a sparse or direct FEA problem/solution does.

Therefore; given that, memory bandwidth ISN'T the top priority in terms of running CFD.

If you're going to be running CFD, the thing that is going to be your most significant limiting factor will be how many cores you have and how fast (in floating point operations per second or FLOPS) it can perform.

That's the true limit of CFD.

The ONLY time that memory comes into the picture is when you are solving for hundreds of millions of cells. Then sure. But if you're just running on a desktop or even a prosumer computer/workstation, you're going to only be able to run as big of a problem as your RAM allows, and most problems are computationally limited rather than memory limited (unless you only have like < 4 GB of RAM; but that's also because the OS itself has gotten significantly bigger as well, leaving less RAM available for the user).

I can run upto 10 million cells with less than 32 GB of RAM and still have the run take upto 42 days to complete because it's a transient, DES model on a 6-core CPU.

The same run with 32-cores finishes in just short of 10 days.

Memory bandwidth has almost nothing to do with it.
alpha754293 is offline   Reply With Quote

Old   January 1, 2019, 21:13
Default
  #2
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Nikolas View Post
So I'm starting to enter the realm of workstation apps and most of all I would like to have a PC that can handle and work with CFD, among other workstation apps, so I have here list that I would like to build and your thoughts on it would be much appreciated. I'm trying to have a build that is future proof and multi-dimensional as well in the fact that it can handle other programs.

i5-9600K or i5-8600K
MSI X470
Kingston SSDNow UV400 480GB
G.SKILL FORTIS Series 32GB (2x16GB) DDR4 2400MHz CL15

As for video card and power supply I'm not too sure about...
Get the fastest CPU with the most cores that you can afford and a motherboard to go with it.

Your RAM looks fine and same with your SSD selection.

(Unless you're solving really big problems or solving lots of transient runs and saving all of the transient data, you're also not likely to be disk I/O bound either.)
alpha754293 is offline   Reply With Quote

Old   January 1, 2019, 21:22
Default Discussion about memory bandwidth and scaling 2
  #3
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by LuckyTran View Post
The steady and transient solvers are virtually identical.

Any problem worth considering will be limited by RAM bandwidth. If you are not limited by bandwidth, then either you are not doing CFD the right way or your problem is too simple and there is nothing to consider. The bigger the mesh, the more you will be limited by bandwidth. But if you get a new workstation, or any recent hardware, there are few options for bandwidth. You either have 2/3/4 memory channels. It is usually a very easy choice (whatever is within your budget).

Then you go for optimizing the other (and probably more significant) bottlenecks. As already mentioned, writing files fastly is a huge bottleneck in transient simulations if that is what you are trying to do.

The hardware requirements don't change from steady to transient per se. But in steady state you don't need to output the solution at every iteration because you are interested in the final converged solution. A new opportunity arises in transient simulations since now you have the option to save the solution every time-step. This would be like writing the solution every iteration in a steady case (but no one really does that because the information is useless). But in transient simulations, that data is usable to some people.
I, again, disagree.

Memory bandwidth has very little to do with it and this is given by the fact that CFD solutions typically don't utilize as much RAM as compared to a sparse or direct solver FEA solution/problem.

Memory LATENCY, on the other hand, CAN have an impact, but again, if your problem is memory I/O limited, then the problem is probably embarassingly small such that all of the calculations can be performed in less than one full clock cycle of the CPU.

If the total number of floating point operations takes more than one full clock cycle to complete (or more than one second (e.g. one Hertz) to complete), then your limit will be CPU's FLOPS, and not the memory bandwidth.

For example, even the Intel's first gen QPI (8.0 GT/s) link is a 256 Gbps bandwidth link (32 GB/s). DDR3-1600 is a 102.4 Gbps (12.8 GB/s) memory bandwidth link. I switched from DDR3-800 to DDR3-1600 RAM and my runs only improved by 7-10% max despite the fact that the bandwidth LITERALLY DOUBLED.

This demonstrates that CFD is NOT memory bandwidth limited. If it were memory bandwidth limited, the scalability would be much closer in proportion to the increase in the memory bandwidth.
alpha754293 is offline   Reply With Quote

Old   January 2, 2019, 07:18
Default
  #4
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post
I'm sorry, but I am going to have to disagree with the priorities listed here.

Unlike sparse/direct solvers for finite element analysis, CFD is actually not as memory intensive in the sense that it doesn't actually load up the RAM quite nearly as much as a sparse or direct FEA problem/solution does.

Therefore; given that, memory bandwidth ISN'T the top priority in terms of running CFD.

If you're going to be running CFD, the thing that is going to be your most significant limiting factor will be how many cores you have and how fast (in floating point operations per second or FLOPS) it can perform.

That's the true limit of CFD.

The ONLY time that memory comes into the picture is when you are solving for hundreds of millions of cells. Then sure. But if you're just running on a desktop or even a prosumer computer/workstation, you're going to only be able to run as big of a problem as your RAM allows, and most problems are computationally limited rather than memory limited (unless you only have like < 4 GB of RAM; but that's also because the OS itself has gotten significantly bigger as well, leaving less RAM available for the user).

I can run upto 10 million cells with less than 32 GB of RAM and still have the run take upto 42 days to complete because it's a transient, DES model on a 6-core CPU.

The same run with 32-cores finishes in just short of 10 days.

Memory bandwidth has almost nothing to do with it.
Quote:
Originally Posted by alpha754293 View Post
Get the fastest CPU with the most cores that you can afford and a motherboard to go with it.

Your RAM looks fine and same with your SSD selection.

(Unless you're solving really big problems or solving lots of transient runs and saving all of the transient data, you're also not likely to be disk I/O bound either.)



I think you have to disprove this thread or come up with some other form of evidence to your claims.
Simbelmynė is offline   Reply With Quote

Old   January 2, 2019, 08:42
Default
  #5
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
I think you have to disprove this thread or come up with some other form of evidence to your claims.
Once I figure out how to get the RDMA statistics every second, then I can literally measure it directly from the Infiniband port by purposely forcing the parallelization to run/span across nodes so that I can measure the RDMA throughput (since intra-CPU core memory bandwidth is very difficult to measure given that it is nearly impossible to profile a commercial solver, although it might be possible to profile OpenFOAM if you compile it from source.)

But for me, it's vastly easier to just use ethtool to measure the RDMA bandwidth/usage by forcing a run to use only two cores, one on each node, and I can measure the amount of memory transferred via the RDMA unicast statistics much more directly.

So as soon as I can figure a way to loop the ethtool command, then I'll be able to generate the data for us all to look at.

*edit*
My QPI (8.0 GT/s) has a bandwidth of 256 Gbps (32 GB/s). My RAM on the other hand (DDR3-1600 2Rx4 ECC Registered) has a bandwidth of 102.4 Gbps (12.8 GB/s) and my 4x EDR IB has a max bandwidth of 100 Gbps (12.5 GB/s), of which, I know that I can hit between the 96-97 Gbps range without any tuning, but it is dependent on the message size.

In other words, if the theory is true, and that it is memory bandwidth limited, and intra-CPU core bandwidth, as I mentioned, is very difficult to measure unless you're profiling your own compiled code; then by forcing a simulation to run on multiple nodes, I can use the RDMA/ethtool to "snope" and "peer" into the view of the "memory" bandwidth transfer given that it is a RDMA link. And having previously established that my interconnect is 97.66% of the actual RAM bandwidth, of which, I can hit 97% of the RDMA bandwidth (or about 94% of the RAM bandwidth), it should give me plenty of insight into how much memory bandwidth CFD actually uses for any given/particular run.

So...if you have a specific benchmark or case that you would like me to run this with, please let me know.

Please also note that per the OpenFOAM website, because their Linux pre-compiled binaries run in a Docker container, and they even explicitly write that running OpenFOAM across multiple machines is NOT a trivial task, therefore; I would prefer something that is either easier or already pre-configured to run across multiple machines (like one of the Fluent benchmark cases for example). Or you can construct a case of your own for me to test, provided the solver is already pre-compiled to be able to run across multiple nodes for this very specific purpose/reason.

(96 Gbps = 12 GB/s. So when I run ethtool -S ib0 | grep "rdma_unicast_bytes" 2>&1 | tee -a ib0_stats.txt (in a loop, at one second intervals), that should write the data to a text file and then I can bring it into a spreadsheet to compute the deltas and that should tell us just exactly how memory limited it is.)

Yes, I realise that this won't measure the host CPU core to host memory transfers, but I am hoping that if there are any RDMA requests that are of the host CPU core to remote memory transfers - that those transfers will show up in the RDMA statistics.

Let me know how you would like me to set up this experiment and I'll figure out how to run it on my system and report back the results to you and the rest of the group here.

PS. My IB currently does not have a CPU affinity mask set partly because I am just getting into it, and partly because I'm lazy (and also partly because I am not sure if it will make any difference because my system is a dual socket Xeon E5-2690 (v1) system, with 4 DIMM slots per CPU, currently set up in a 8x 16 GB configuration). So if we REALLY want to, we can bind the IB to a specific CPU core and also perform one simulation using two cores, one core per node, and bind/set the CPU affinity mask for the solver processes as well (on each node) in order to get as accurate of a test as you would want it.

I know that the memory controller on those CPUs is almost always on the first core (I have Hyperthreading disabled on my systems ALWAYS), but I am not sure if there will be problems if I try to bind the IB to the same core as the core that's used for the solution which is also the same core what the memory controller is tied to. (i.e. all on cpu0)

Again, you let me know how you would want me to test it and I can figure out how to set it up that way.

My last Fluent MPI run (acrossed 32-cores) - the average number of MPI messages was only around 54k and this was measured when my interconnect was only a gigabit ethernet (and it didn't even get remotely close to pegging that).

But for the sake of argument, now that I have IB, I can measure the RDMA directly and that will definitely give us a significantly greater insight into this statement, and with the data, be able to demonstrate whether the statement is true or not.

Last edited by alpha754293; January 2, 2019 at 10:32.
alpha754293 is offline   Reply With Quote

Old   January 2, 2019, 12:22
Default
  #6
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
If you wish to test it without the (minor) hassle of Docker, then you can always install OpenFOAM v6 on Ubuntu and enjoy patch releases.


If you check the first page you will see that many CPUs hit a soft wall at number_of_cores > (memory_channels x 2).

I don't think infiniband is relevant to the original post, but here are some results using heterogeneous clusters (taken from the benchmark thread):



Quote:
Originally Posted by spaceprop View Post
Updating my earlier table with more results.

X-axis is iter/s. The processor(s) and RAM are called out in the graph. All OF+v1712, OpenMPI 3.1.0, CentOS 7.5. I applied the following BIOS settings to all: HT disabled (if applicable), anything related to maximum performance on. This is slightly cleaned up from the last one and has the E5-2690v2 results added.



This one includes the results from my homelab 5-node cluster:



19.09s (5.24 ips) . Infiniband <3
Simbelmynė is offline   Reply With Quote

Old   January 2, 2019, 15:46
Default
  #7
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
If you wish to test it without the (minor) hassle of Docker, then you can always install OpenFOAM v6 on Ubuntu and enjoy patch releases.
Quote:
How do I run parallel on multiple computers?
This is not trivial inside the Docker environment. Also you might want to include optimised communication libraries (MPI) so it probably makes more sense to perform a native compilation.
Source: https://www.openfoam.com/download/in...nary-linux.php

Again, as I have said, this isn't ME that's saying that running OpenFOAM across multiple computers is non-trivial. It's OpenFOAM that's saying running OpenFOAM across multiple computers is non-trivial.

So...I'm not entirely sure why you would want to disagree with the people who are responsible for developing OpenFOAM about the triviality of running OpenFOAM acrossed multiple computer given that they have explicitly stated that it is non-trivial.

I mean...its your call. If you still want to disagree with the people who are responsible for developing OpenFOAM, there's certainly nothing that will stop you from doing so. Or at minimum, I won't be the reason why you should stop (disagreeing with them).

I'm just relaying what they've already stated.

My production environment is SLES 12 SP1. I am unlikely to switch distros for this test given that my production environment already works.

If anything, if I were to compile it on my system, I would be able to do so and like spaceprop says, he appears to have been successful in getting OpenFOAM to run across multiple nodes. (I'm not sure what the n=100 means, but I'll take it at face value. And he's also only running at QDR, which means that the max data rate that he has on the interconnect is 32 Gbps (4 GB/s), which means that there is a way to get it to work.

I also don't use OpenFOAM normally anyways, so it is unlikely that I will retain the installation after I am done testing with it.

Quote:
Originally Posted by Simbelmynė
If you check the first page you will see that many CPUs hit a soft wall at number_of_cores > (memory_channels x 2).
Pardon my ignorance, but the first page of what?

By the way, the charts that you linked in this post - all of the data shows that there are more cores than there are memory channels.

There is no data present on those charts that shows where the number of cores = the number of memory channels. Therefore; the data that is shown in the charts that has been linked here does not contain enough data to show or demonstrate the statement in regards to the "soft wall" when the number of cores are greater than the number of memory channels.

All of the data shown here are all where the total number of cores are all greater than the number of memory channels. The only way that this statement can be shown as being true would be if there was a bar on that chart that shows that the number of cores equals the number of memory channels.

But that's not the case.

Here is a screenshot of the spreadsheet that I put together looking at the data that you posted.

You will see that in all of the cases, the number of cores is greater than the number of memory channels.



Quote:
Originally Posted by Simbelmynė
I don't think infiniband is relevant to the original post, but here are some results using heterogeneous clusters (taken from the benchmark thread):
Well, the relevance of the Infiniband speaks to two points specifically:

1) The original statement about the performance of computing hardware for CFD applications being dependent on memory bandwidth and/or is limited by it

and 2) to measure intracore memory bandwidth is very difficult, if not impossible (without profiling the code at runtime), and so, by forcing the traffic to go through a medium where I CAN measure the throughput/bandwidth/latency (i.e. taking advantage of the fact that Infiniband has or is or implements RDMA), then that gives us a window that we can use to look at/through in order to "see"/peer into and see how much data is actually being transmitted during the course of the run.

If the volume of data is, say, significantly lower than interface bit rate/bandwidth limit/availability, then it will automatically tell us that CFD, would be, in fact, not bandwidth limited, but it would be latency limited (which, is not the same).

You can have very little traffic, but having a very low latency means that the traffic can move very fast.

Conversely, if you have a lot of traffic, it still means that all of that volume of data can be pushed at the high speeds.

In fact, a significant portion in terms of why Infiniband is so fast is because of the latency improvements rather than the actual improvements of how fast you can push electrons through the copper (if they're direct attached cables) or the speed of light going through the optical encoder/decoder if you're using active optic cables. (We all know that light travels very fast. What people know less about is how fast the QSFP28 ports can encode an optical signal for transmission through the AOC.)

Again, since looking at the intracore bandwidth is very difficult to do, forcing it go through IB and taking advantage of the RDMA design and property of IB allows us to peer into and gain insight into the very specific nature of the statement.

You will also notice that when spaceprop goes from n=40, gigabit, 2x E5-4627v3QS + 2x E5-2690v2 to n=40, QDR IB, 2x E5-4627v3QS + 2x E5-2690v2, his iterations per second only increase from like ~1.7 iter/s to ~2.1-2.2 iter/s.

That's only a 29.4% increase despite the interconnect having increased in bandwidth by a factor of 3200%.

That means that out of a possible 32 Gbps bandwidth link, it only increased performance (by changing from 1 Gbps link to a 32 Gbps link) by 29.4%, and consumed only 0.9% increase in bandwidth. (29.4% / 3200%) (It got faster by 29.4%. The total bandwidth increase is 3200%. Therefore, it only "used" 29.4% of the 3200% bandwidth increase, which is about 0.9%.)

This is why I say that CFD isn't memory bandwidth limited.

If it were, you should see significantly better scaling with a significantly higher bandwidth link between the systems/nodes/solver processes that are running.

If it were memory bandwidth limited, the 32x increase in the interconnect bandwidth should yield a significant increase in the number of iterations that it can perform per second.

You change from a 1 Gbps (ethernet) interconnect to a 32 Gbps (I AM assuming that it's an 4x QDR IB link that spaceprop is talking about), if it was memory bandwidth limited, you would think that you could get closer to the 32x scaling than just 29.4% improvement for the cluster.

Remember, memory bandwidth isn't the same as memory latency.

You can have very high memory bandwidth by reducing the latency whilst keeping everything else exactly the same.

If you're trying to push 100 Gb (12.5 GB) of data through a 1 Gbps pipe, it's going to take you 100 seconds (yes, this is a theorectical calculation assuming no losses, and no latency issues, packet drops, etc.).

You try to push 100 Gb (12.5 GB) of data through a 32 Gbps pipe, it's going to take you 3.125 seconds to push all that data through.

That's memory bandwidth limited.

But supposed that your average message size is 64 kiB (524288 bits), and with an average latency of 8.7 microseconds (8.7E-6 seconds), then a single stream bandwidth bit rate would be 60.3 Gbps. A 1x EDR IB link is 25 Gbps. A 100 Gbps 4x EDR IB link is able to carry 1.66 times that rate of data (link limit), or about 190734, 524288 bit messages per second. (Again, link limit.)

In my testing, my 100 Gbps 4x EDR IB link is capable of 96.41 Gbps total bandwidth, or 183895, 523288 bit messages per second.

If you only have say, 54,000 MPI messages that is being sent through (I am not sure if OpenFOAM actually has the statistic for the average number of MPI messages that are being transmitted/passed through per iteration), but if you have that, and the average message size is 524288 bits, then you're only transmitting 2.83e10 bits while the message size divided by message size latency gives you a bandwidth of 6.03e10 bits/s, which means that you're not even able to fully utilize the bandwidth that the message size latency provides to you.

Therefore; this is latency limited, not bandwidth limited.

If you only have 1 Gb of data that needs to be transmitted, then on a 1 Gbps link, it will take you 1 second.

If you have 1 Gb of data that needs to be transmitted, and assuming that the average message size is 524288 bits with an average latency of 8.7e-6 seconds, then you will be able to transfer that 1 Gb of data in about 0.0166 seconds.

This is what I am learning now that I have my Infiniband interconnect up and running. I knew that my sparse/direct solver FEA cases ARE memory bandwidth limited.

I also knew that I was able to run my CFD cases with just a gigabit ethernet as the interconnect just fine because the total volume of data can be EASILY handled with just the gigabit ethernet.

This is the very reason why I moved to the 4x EDR Infiniband as my interconnect now. Yes, having the 100 Gbps interconnect bandwidth is nice, but it's only really actually useful for spared/direct solver FEA applications.

For CFD, it's the 8.7e-6 second message latency (for a 64 kiB message packet) - that's the important that. It's not the bandwidth that's important. It's the latency.

And they're not the same given by the total volume of data that's being transmitted (which, if you're running across multiple machines, you can measure the volume of data with ethtool).

The RDMA capability of IB allows me to more directly measure the "memory bandwidth" of CFD runs in a way that would otherwise be very difficult or nearly impossible to measure without profiling the code (also by using ethtool).
alpha754293 is offline   Reply With Quote

Old   January 3, 2019, 11:03
Default
  #8
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Here are some good data, taken from the first page of the benchmark thread.





Quote:
Originally Posted by Simbelmynė View Post
That's strange. A thread like this has definitely the possibility to be "sticky".

Oh, well, browsing to the last post only requires one extra mouse click


7940X, 32 (4x8) GB 3200 MHz RAM, CentOS 7.x, kernel 3.10.0
Code:
# cores   Wall time (s):
------------------------
1 764.36
2 419.98
4 233.26
6 188.29
8 169
12 160.28
14 168.73
Threadripper 1950X, 32 (4x8) GB 3200 MHz RAM, CentOS 7.x, kernel 4.14.5 (SMT on)
Code:
# cores   Wall time (s):
------------------------
1 827.21
2 465.01
4 235.17
6 198.81
8 170.73
12 154.26
16 154.9
8700K, 32 (4x8) GB 3200 MHz RAM, Mint 18.3, kernel 4.13.0
Code:
# cores   Wall time (s):
------------------------
1 531.44
2 312.15
4 249.55
6 247.83
It is also interesting to analyze the meshing time.

For the 8700K system we have:
Code:
# cores   real time:
------------------------
1            16m35s
2            10m56s
4            07m01s
6            05m30s
While the 1950X performs as:
Code:
# cores   real time:
------------------------
1            23m32s
2            16m01s
4            08m44s
6            06m50s
8            05m48s
12          04m38s
16          04m12s
It seems that the meshing part is not as memory bound as the CFD solver.


The 7940X has a 1 core turbo boost of 4.6 GHz whereas the 8700k has a 6-core turbo boost of 4.3 GHz. So, if we ignore memory bandwidth then the 8700k should be faster in 1, 2, 4 and 6 core loads.



If you look at the 7940X results you will see that it is slower than the 8700K for 1 and 2 cores, but faster at 4+ cores. This is clearly a memory bandwidth problem.


Then if you look at both the Threadripper and the 7940X (4 memory channels) you see that they show very diminishing returns above 8 cores. The 8700k (2 memory channels) show diminishing (none actually) returns after 4 cores. This is what I meant by many CPUs hit a soft wall at number_of_cores > (memory_channels x 2).
Simbelmynė is offline   Reply With Quote

Old   January 3, 2019, 15:11
Default
  #9
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,126
Rep Power: 18
me3840 is on a distinguished road
Quote:
Again, as I have said, this isn't ME that's saying that running OpenFOAM across multiple computers is non-trivial. It's OpenFOAM that's saying running OpenFOAM across multiple computers is non-trivial.

So...I'm not entirely sure why you would want to disagree with the people who are responsible for developing OpenFOAM about the triviality of running OpenFOAM acrossed multiple computer given that they have explicitly stated that it is non-trivial.
You're coming off a bit belligerent here. There are more than one group of folks working on OpenFOAM. Notice how Simbelmynė stated OpenFOAM v6 has an Ubuntu package. You're looking at the site for OpenFOAM v1812, which comes from the OpenCFD. CFD direct has v6:

https://cfd.direct/openfoam/download/


Their version doesn't use docker and is pretty easy to compile, even if you're only going to use it once.
me3840 is offline   Reply With Quote

Old   January 3, 2019, 15:53
Default
  #10
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
Here are some good data, taken from the first page of the benchmark thread.

The 7940X has a 1 core turbo boost of 4.6 GHz whereas the 8700k has a 6-core turbo boost of 4.3 GHz.
Sorry, but this reply is rife with errors.

To start with some basic data, the Intel Core i9-7940X has a single core max turbo (2.0) of 4.3 GHz rather than your stated 4.6 GHz max single core TurboBoost.

The Core i7-8700K has a max single core turbo of 4.7 GHz.

Quote:
Originally Posted by Simbelmynė
So, if we ignore memory bandwidth then the 8700k should be faster in 1, 2, 4 and 6 core loads.

If you look at the 7940X results you will see that it is slower than the 8700K for 1 and 2 cores, but faster at 4+ cores. This is clearly a memory bandwidth problem.
How can you be sure that this is a memory bandwidth issue?

Did you bind the solver processes to the respective CPU cores by setting the processor affinity mask in order to prevent process migration and having it "bounce" around between the cores?

As you mention, the wall clock time for a single core on the 7940X is 764.36 seconds. The wall clock time for a single core on the 8700K is 531.44 seconds. The ratio of the single core, turbo boosted clock speeds between the 7940X and the 8700K is only 9.3% (4.7 GHz / 4.3 GHz). However, the speed increase on the 8700K is 30.47% FASTER for the single core run than the 7940X.

Since you are saying that CFD is memory bandwidth limited, how would you account for the fact that the clock speed difference only explains 9.3% of the 30.47% difference in the wall clock time it took to solve this problem?

Again, did you bind the single solver process to a specific CPU core in order to prevent process migration?

The difference in the 2-core turbo boosted speed between the 7940X and the 8700K is only 7.0%. Yet, the difference in the wall clock time again, is 25.68% faster on the 8700K than it is on the 7940X. The clock speeds would only account for 7% of the 25.68% difference. How do you explain the rest of it then?

As it is also noted on the wiki page of the benchmark itself:
Quote:
Please note that there are some differences in OpenFOAM versions used, memoy configurations and so on that could influence the results. Use at your own risk
Do you have data that would be able to address the point in your data, that you are comparing CentOS 7.x, kernel 3.10.0 vs. Mint 18.3, kernel 4.13.0?

Don't you think that's a bit of an apples vs. oranges comparison when it has literally been noted on the OpenFOAM benchmark wiki page that there are differences between the kernels and the versions of OpenFOAM?

P.S. By the way, how do you know that as the core count goes up, that the reason why you aren't seeing the scalability increase linearly isn't due to the process coherency as opposed to it being a memory bandwidth issue?

Quote:
Originally Posted by Simbelmynė
Then if you look at both the Threadripper and the 7940X (4 memory channels) you see that they show very diminishing returns above 8 cores. The 8700k (2 memory channels) show diminishing (none actually) returns after 4 cores. This is what I meant by many CPUs hit a soft wall at number_of_cores > (memory_channels x 2).
I'll tell you what - your theory itself is something that can be tested.

I was going to say that it can be tested "easily", but it's actually not that easy because I've read spaceprop's blog where he details all of the steps that he had to take to get OpenFOAM running on multiple nodes with an Infiniband interconnect.

So, two things can come of it:

1) If I were to actually pop in another SSD into my system to test this, I can probably run the entire procedure that spaceprop went through so that I can tie two of my nodes together to test this. That won't be an issue.

Each node that I have are dual socket nodes, and each Xeon E5-2690 (v1) has dual 8 GT/s QPI (for processor to memory and processor to processor communication) tied to quad memory channels (DDR3-1600), which means that I will have access to eight memory channels per node, with one DIMM per channel.

2) By tying two nodes together over IB, that will double the total number of available channels from eight to sixteen (split across two nodes) and with a fair bit of research in regards to setting the processor affinity mask for slave processes running on remote nodes, I can test your theory using 16 cores and 16 channels in a (4+4)+(4+4) configuration by using binding to cpu0,2,4,6 on the first processor of each node and cpu7,9,11,13 on the second processor of each node so as to maximize the bandwidth.

And then I will be able to run it with all 32 cores in total, with 16 memory channels.

And then I can also run it with 8 cores and 8 channels (on a single node) and 16 cores and 8 channels (also on a single node) and that will collect the data that we will need to ascertain your statement/hypothesis.

There isn't enough information in the raw numbers alone that shows how the benchmarks were performed and whether care was taken in order to minimize and/or completely eliminate process migration because there is no data contained inherent within as to whether the processor affinity mask was set at run time.

My counter hypothesis is that I think that people here are confusing bandwidth and latency.

I also think that people here are confusing the idea that just because you have bandwidth means that you will utilize it to its full capacity always.

If you have 100 gigabits of data to transmit/transfer, and suppose that you have a single gigabit per second link/interface/connection (like GbE -- think of it like a lane on a highway), then it will take you 100 seconds to send all of that data through.

If you have two lanes and still 100 gigabits of data to transfer, then your link is now 2 gigabits per second and it will only take you 50 seconds still to transfer all of that.

A 10 gigabit per second link/interface/connection/"ten-lane highway" means that transferring 100 gigabits of data will only take you 10 seconds.

That's bandwidth.

But if you only have 0.5 gigabits of data to transfer, you can do that in 0.5 seconds with a gigabit per second link/"one-lane highway".

On a 2 gigabit per second link/"two-lane highway", you can send that through in 0.25 seconds.

On a 10 gigabit per second link/"ten-lane highway", you can send 0.5 gigabits of data in 0.05 seconds.

That's latency.

Yes, having more lanes means more bandwidth, but it doesn't actually necessarily and automatically mean that it is "bandwidth" limited.

If you total amount of "stuff" you're trying to send down the pipe is less than the maximum amount of stuff that any particular number of "lanes of a highway" can handle, then you're talking about latency, not bandwidth (even though, yes, for theorectical peak calculations, they're interrelated.)

But again, if I have 100 bits that I want to send down a 100 gigabit per second pipe, then the time that it would take for all 100 bits to gets transferred, at the same time, would be 1E-9 seconds.

However, if the "highway" isn't capable of that (i.e. it takes more time to get up on the onramp and get off on the offramp), then you're not going to see those 100 gigabit per second speeds because you're limited by what the highway can do.



Having more lanes open so that the data can be transferred closest to the latency limit of the link doesn't necessarily mean that's bandwidth limited.

It's latency limited.

It would be bandwidth limited if you have more data in total that you're trying to transmit at any moment/point in time than the link will allow you to.

Going back to the example of having a total of 100 gigabits of data that you want to transmit - on a one gigabit per second link, it's going to take you 100 seconds to transfer that.

If you open up more lanes or speed up the lanes and/or both, that'll increase your bandwidth so that you can push that total volume of data through faster.

But if you have 100 gigabits of data that you're trying to push through a 100 exabit per second link, then again, the total time it will take will be 1E-9 seconds. Maybe by then, latencies might have dropped to the nanosecond level so this might be actually, physically possible.

But for now, if you're trying to push 100 bits of data over a 100 gigabits per second link (such that the time it should take 100 bits to transfer in a 100 gigabits per second link is 1E-9 seconds), that's not possible right now with this 100 gigabits per second technology. (100 bits = 12.5 bytes). The bandwidth, on a 100 gbps link, for message sizes of 8 bytes is 0.71 gbps and for message sizes of 16 bytes is 1.41 gbps, which means that the average would be about 1.06 gbps (on a "highway" that's capable of 100 gbps).

You can see that's not bandwidth (because there's plenty of bandwidth capacity left.

You can also see that the limiting factor is latency.

And like I said in my counter hypothesis, I think that people here are confusing the two.

(Also like I said, I monitor the amount of network traffic that used to go through my gigabit per second ethernet interconnect when I am running my CFD and I know for a fact that it isn't the VOLUME of data that it is trying to push through that's the limiting factor, but the rate in which the messages/packets/packages can be transmitted as a function of time. It'd be a very different story if the network interface was pegged for some time while the data was being shuffled back and forth, causing the CPU to have to wait for the data I/Os to complete before the CPUs can resume working, but again, that's not the case. I bought my 4x EDR IB stuff not only because of it's 100 gbps available bandwidth, but because of it's ultra low latency (8.7 microseconds for a message size of 65536 bytes vs. gigabit ethernet's 1541.47 microsecond for the same message size).)

I understand why it is easy for people to get confused between the two because of their interent, mathematical relationship. But they are two fundamentally different properties and aspects.

You can have lots of lanes, but that doesn't mean that you're going to fully utilize all of the available lanes to its maximum capacity, all the time.

Test it.

Run your OpenFOAM benchmark across multiple nodes with only a gigabit ethernet as your interconnect and then run it with an ultra low latency interconnect. And monitor just how much data/traffic is going through even your gigabit ethernet interconnect "pipe".

(I haven't done it with OpenFOAM, but I won't be surprised if it exhibits similar behaviour to Fluent and CFX on these lines - not much in the total overall volume of data being transmitted. But the gigabit ethernet latency is killing your ability to run it any faster which means you need to address the latency question in order to speed it up.)

Like I said, I can test it.

Per spaceprop's blog, it's not easy and it isn't trivial, but it can be done with time.

(Luckily, I'm going to be only doing this on two nodes, and since he's done pretty much all of the necessary research to get it up and running, I can probably, literally replicate his setup, albeit with only fewer nodes.)

Thanks.
alpha754293 is offline   Reply With Quote

Old   January 3, 2019, 16:28
Default
  #11
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by me3840 View Post
You're coming off a bit belligerent here.
I wonder how many people have actually paid attention and monitored how much data transfer actually occurs during the course of a run.

From the benchmark data/statistics, I would say that like 99% of the results returned so far are all single node systems, and as Simbelmynė mentioned, trying to measure the volume of data traffic going through the memory bus interface is really, REALLY, REALLY difficult.

On the contrary, because I run multinode setup (and I've been running multi-node CFD on a micro compute cluster) for the past two years, I regularly monitor how much data is going through my then-GbE-interconnect specifically with the expressed intent and interest to find out whether I'm limited because of the volume of data that is being transmitted or the speed (latency) at which the messages/packets can be transmitted.

And like I said, after testing it for two years, I went and bought the IB stuff.

(And also like I said, I ran the same GbE interconnect/network check when I am running FEA using sparse/direct solvers as well. And for FEA, the 100 Gbps bandwidth is what helps the FEA. For CFD, it's the latency that helps it. And IB does both.)

I am always of the opinion that if I am going to make a statement (or counter someone else's statement) that there had better be data and/or evidence to support it.

It amazing to me how difficult it is to have a data/evidence-based discussion on a technical forum, from technical poeple.

Either I, myself, or someone else makes a statement.

You'll note that my first question back in regards to that is to obtain the data that was used to arrive at that conclusion. (Again, I've been testing it with my own stuff for the past two years. And in my two years experience of monitoring my interconnect traffic data, the data that I have collected over the past two years does not support and contradicts the statement, and so, that's what made me really curious as to why someone would make a statement like that. So, again, my first question back was to see a copy of the data that was used to arrive at that conclusion/statement.)

Quote:
Originally Posted by me3840
There are more than one group of folks working on OpenFOAM. Notice how Simbelmynė stated OpenFOAM v6 has an Ubuntu package. You're looking at the site for OpenFOAM v1812, which comes from the OpenCFD. CFD direct has v6:

https://cfd.direct/openfoam/download/


Their version doesn't use docker and is pretty easy to compile, even if you're only going to use it once.
I just googled "openfoam download" and the OpenFOAM v1812 from OpenCFD is the first link that comes up.

To that end though, I don't think that it would be necessarily appropriate to automatically assume that people who are here automatically know the differences between what the different OpenFOAM versions are from the various providers. This is also further complicated by the fact that OpenFOAM is a registered trademark of OpenCFD Ltd. (www.opencfd.com), but the website, openfoam.org, is for the OpenFOAM Foundation, which licenses OpenFOAM from OpenCFD. I'm not sure why they would do that since it can lead to so much confusion about and between the two.

I think that's an assumption on his part.

And of course, making undeclared assumptions is always a dangerous proposition to undertake.

Re: OpenFOAM v6
The Ubuntu package doesn't use Docker.

But if I want to run it on another Linux distro (like SLES 12 SP1 - my production environment), it says:

Quote:
...but can be installed on 64 bit distributions of Linux using Docker to provide a self-contained environment that includes code, runtime, system tools and libraries, independent of the underlying operating system.
Source: https://openfoam.org/download/6-linux/

With that being said, if I can get my IB stuff to work with Ubuntu, then I might be able to test this much faster than what it took for spaceprop to get his cluster up and running.

(Luckily, I have to 100 GB SSDs that are sitting around, doing nothing, so I might be able to pop those in for this testing purposes instead. TBD... (My compute nodes are currently running jobs right now, but once that is complete, then I can probably set this up and try it. It might not be for about a month though, at the earliest.))

Thanks.
alpha754293 is offline   Reply With Quote

Old   January 3, 2019, 16:59
Default
  #12
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,126
Rep Power: 18
me3840 is on a distinguished road
I don't think anyone is having a hard time making an evidence-based discussion here. I was just stating in my head with your writing style you seem a bit aggressive, but alas those are the pains of reading only text. We're all pretty receptive to new data.

Anywho, your experience differs from mine both with my own cluster and commercial clusters; I tend to agree with Simbelyne. However I have retired my personal cluster and am working on designing a new one, so I can't really back myself up at the moment. Anecdotally, I could get a first-gen i7 to perform the same as a fourth-gen i5, with similar clockspeeds. Same OS, with locked process affinity. The first-gen has triple-channel memory, which I suspected made up much of the difference. I was not that scientific about it, all I was interested in was "how fast can what I have go". I think I saved the data someplace and I'll see if I can get anything useful out of it for our discussion.

At work (IIRC) we did testing with 28 and 32 CPU nodes and found we didn't get the expected speedup from the 32-core boards, also suspecting the drop in bandwidth per core to be the cause. I don't have the data for that, since I left the company.

I suspect that the mantra that memory bandwidth is the limiter is not in the strict sense true, but practically true. There are lots of things that happen within a single iteration, not all of them are Ax=b, and not all of them have the same limitations on hardware. So suggesting that one can get 3200% performance by moving from gigabit ethernet to infiniband is most certainly false, however increasing memory bandwidth is likely to give one the greatest performance boost, outside of adding CPUs. Typically the number of CPUs you can get is the gorilla constraining the budget, so knowing what to do to increase performance after that is very valuable, especially when designing single workstations like folks do here often.

Getting my Infiniband switch to run with Ubuntu (I think it was 16.04) was trivial, so I'm not sure what spaceprop had to go through. Mine was QDR as well, a Voltaire 4036 with Mellanox HCAs.

Now I wish I had my compute nodes again so I could test as well; it may take me some time to get a machine ready but I'd like to duplicate your experiment.

I might suggest we try to to use a benchmark which has identical loading for each CPU, like a channel flow application on a regular grid instead of the motorbike benchmark, to remove as many variables as possible. Or we can just run both, for fun. Scotch/Metis are practical but if we could get more consistent decompositions that might reveal a bit more.

It will take me a bit to go through your other posts and see if I agree with things, though.
me3840 is offline   Reply With Quote

Old   January 3, 2019, 17:13
Default
  #13
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post
Sorry, but this reply is rife with errors.

To start with some basic data, the Intel Core i9-7940X has a single core max turbo (2.0) of 4.3 GHz rather than your stated 4.6 GHz max single core TurboBoost.

The Core i7-8700K has a max single core turbo of 4.7 GHz.

You are correct. The 1 core turbo boost of 7940X is 4.3 not 4.6. I call that a typo, not "rifle with errors". The conclusion is the same - from just looking at the same Skylake (refresh refresh refresh..) architecture the 8700k should be faster for 1, 2, 4 and 6 cores. IF we have no problems with memory bandwidth.



Do you suggest that the discrepancy comes from memory latency?




Quote:
Originally Posted by alpha754293 View Post
How can you be sure that this is a memory bandwidth issue?

Did you bind the solver processes to the respective CPU cores by setting the processor affinity mask in order to prevent process migration and having it "bounce" around between the cores?


No I did not, it did not improve the results. In the benchmark thread you can see different affinity tests and how those affect the results.





Quote:
Originally Posted by alpha754293 View Post
As you mention, the wall clock time for a single core on the 7940X is 764.36 seconds. The wall clock time for a single core on the 8700K is 531.44 seconds. The ratio of the single core, turbo boosted clock speeds between the 7940X and the 8700K is only 9.3% (4.7 GHz / 4.3 GHz). However, the speed increase on the 8700K is 30.47% FASTER for the single core run than the 7940X.

Since you are saying that CFD is memory bandwidth limited, how would you account for the fact that the clock speed difference only explains 9.3% of the 30.47% difference in the wall clock time it took to solve this problem?


I cannot explain that. However, you can also compare against a 7820X (this is using a later version of the kernel), that also max out at approximately the same solution time as the 7940X. The 7940X is using R2 memory, not sure if the 7820X is using 1R or 2R.



Quote:
Originally Posted by The_Sle View Post
7820X@4,6Ghz, 4x8 GB 3400MHz RAM, Ubuntu 17.10, kernel 4.13.0-32

Code:
# cores   Wall time (s):  Speedup:
----------------------------------
1          756.42           1
2          376.09           2,0
4          205.46           3,7
6          168.24           4,5
8          160.05           4,7
Could it be that past 6 cores 4 channel memory is causing a bottleneck?

Here are some results from benchmarks of the same CPU. Linux mint 18.x, kernel 4.15.x, two different memory types used one for 2133-2400 and another for 3200+



Code:
core i7 7600k standard turbo frequencies
1C 4.2 GHz
2C 4.1 GHz
3C 4.1 GHz
4C 4.0 GHz


core i7-7600K, DDR4 2133 MHz CL 15
# cores   Wall time (s):
------------------------
2 423.92
4 357.42


core i7-7600K, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 389.9
4 322.65


core i7-7600K, DDR4 2400 MHz CL 13
# cores   Wall time (s):
------------------------
2 388.86
4 320.97

core i7-7600K, DDR4 3200 MHz CL 15
# cores   Wall time (s):
------------------------
2 364.07
4 291.68

core i7-7600K, DDR4 3466 MHz CL 15
# cores   Wall time (s):
------------------------
2 335.94
4 268.92


UNDERCLOCKED (all cores) ******************



 core i7-7600K @ 3000MHz, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 441.56
4 334.5


OVERCLOCKED RESULTS (all cores) *********************



 core i7-7600K @ 4500MHz, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 385.99
4 319.42


core i7-7600K @ 4700MHz, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 379.26
4 321.16


core i7-7600K @ 4400MHz, DDR4 3200 MHz CL 15
# cores   Wall time (s):
------------------------
2 341.53
4 278.77

core i7-7600K @ 4700MHz, DDR4 3200 MHz CL 15 *Probably throttling, more testing needed*
# cores   Wall time (s):
------------------------
2 344.83
4 287.32

core i7-7600K @ 4500 MHz, DDR4 3466 MHz CL 15
# cores   Wall time (s):
------------------------
2 326.45
4 267.25
Simbelmynė is offline   Reply With Quote

Old   January 3, 2019, 17:43
Default
  #14
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
So let me start with this:



This is the network utilization plot from my 8-core system (single socket, also Xeon E5-2690 (v1), 8-core, 2.9 GHz base, all core turbo 3.3 GHz, max turbo 3.6 GHz), 128 GB of DDR3-1600 ECC Reg 4Rx4 RAM running at DDR3-800 speeds only because of the 4R.

This is currently running a 16 process, across two 8-core nodes (both are HP Z420 workstations, and both have the exact same memory configuration), running over gigabit ethernet as the system interconnect.

You will see that the network doesn't go much above 4% network utilization (0.04 Gbps on a 1 Gbps network/interconnect).

The problem itself is a DES transient external aerodynamics problem with 13.5 million polyhedra cells, running with a time step of 1e-4 seconds for 20,000 time steps (2 seconds total simulation time).

Ignoring the details and specifics about the run, (as this isn't a discussion about the model, but rather the "memory bandwidth" question), the processor has dual 8 GT/s QPI links and there are 8 DIMM slots on the HP Z420 motherboard for a single socket, which means that I have 2 DIMMs per channel, and the processor has quad DDR3-1600 memory channels (which, again, because of the 4R, is running at DDR3-800 speeds).

Therefore; across the two nodes, I have four 8 GT/s QPI links, 8 memory channels, and all running at DDR3-800 speeds.

The QPI links alone are 256 Gbps links each. (And I have a total of four of them, which means that my total QPI bandwidth is 1024 Gbps.)

The DDR3-800 RAM are 51.2 Gbps channels, *4 = 204.8 Gbps * 2 nodes = 409.6 Gbps.

And this is passed through a single gigabit ethernet network/interconnect, which means that I am severely restricting the bandwidth down from the QPI link and down from the memory channel/interface bit rates.

Despite all of that, the internode traffic doesn't even break the 0.05 Gbps mark.

So...either Fluent is really smart about what and how it uses the MPI interface so that it ONLY transfers what it needs to transfer over the interconnect (i.e. somehow, the MPI messages needs to be able to tell whether the message is INTRAcore vs. INTERnode) OR that the MPI messages doesn't know/can't tell the difference between whether it's INTRAcore vs. INTERnode nor does it care.

I don't know enough about MPI programming to know if you can even "tag" a MPI message like that such that it will know what its target/destination is so that it passes only the necessary information INTERnode and possibly passes MORE information within the MPI message when it is INTRAcore.

I surmise that the MPI specification doesn't have that level of granularity, and that the MPI messages can do wherever it needs to go, which would mean that if it doesn't even break the 0.05 Gbps mark for a 16 process, dual node, single socket, octo-core processor per node; there's not that much traffic, and therefore; not that memory bandwidth limited.

I would expect that if it were more memory bandwidth limited (rather than it being MPI message latency limited), that my gigabit ethernet network/interface would be pegged a various points in time so that the contents of the memory can be exchanged (a la a sparse/direct solver would). (Again, if you want to see a memory bandwidth heavy program, run a sparse/direct FEA solver with a couple hundred thousand second order tetra elements using double precision. Then you'll really see what a memory bandwidth intensive application looks like (because that WILL peg my 100 Gbps IB to its limit during the course of the run), whereas even with a DES, transient, 13.5 million polyhedra CFD run, it doesn't even touch 0.05 Gbps.)
alpha754293 is offline   Reply With Quote

Old   January 3, 2019, 18:04
Default
  #15
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
No comments on my benchmark of the 7600K?


As far as I understand you are using an 8-core CPU with 4 memory channels?


You can see (benchmark thread) that a dual 2695v2 manages a solution time of 140 seconds when using 16 cores (twice the number of memory channels). At 20 or 24 cores the solution time is 137 seconds..


The implementation of Fluent is likely quite different. Have a look at this thread for a good benchmark series.
Simbelmynė is offline   Reply With Quote

Old   January 3, 2019, 18:11
Default
  #16
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,126
Rep Power: 18
me3840 is on a distinguished road
Quick note, MPI can definitely tell which ranks are across nodes and which share boards:

https://www.open-mpi.org/faq/?catego...run-scheduling

IIRC you can even differentiate between sockets on the same board.
me3840 is offline   Reply With Quote

Old   January 3, 2019, 18:21
Default
  #17
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė
No I did not, it did not improve the results. In the benchmark thread you can see different affinity tests and how those affect the results.

Quote:
Originally Posted by Mr.Turbulence View Post
Hi everyone,

I tried to run the OF benchmark with the 2x AMD EPYC 7351, 16x 8GB DDR4 2666MHz, with OpenFOAM 5.0 on Ubuntu 16.04.

I ran the calculation by binding the processes to core, to socket and none on 16 and 32 cores.

There are the results below :

HTML Code:
# cores   Wall time (s):         Wall time (s):       Wall time (s):
              core                       socket                  none
---------------------------------------------------------------------
1                                                                    922
16             153.34                 55.7                    65.78
32             70.8                     38.68                  38.8
I don't understand why the calculations are so long by binding the processes to core compare to the others results.

Do you think it could come from the fact that the hyper-threading is on ?

Thanks in Advance
Weren't other people also saying the fact that the Epyc processors is actually a dual die multi-chip module (MCM) and that only one of the memory controllers (out of the two chips) are active per CPU socket package?

In other words, I thought that I was also reading on the benchmark thread that because the memory I/O requests have to be routed through the memory controller on the other MCM to ACTUALLY talk to the RAM that this was part of the root cause in terms of the slow down, i.e. the "16-core" Epyc is actually two eight core dies that are fused onto a single MCM, with the memory controller on the second die disabled and routed through the MCM?

Wouldn't that VASTLY increase the memory latency between the CPU core that's "closest" to the memory controller vs. the CPU core that's "furthest" away from the memory controller?

(P.S. The old quad socket AMD Opterons were also notorious for that because the old HyperTransport required a minimum of two hops from the core that was furthest away from the memory controller compared to the core that was closest to the memory controller that had access to all of the RAM. The old quad socket AMD Opterons only one of the processors had direct access to the RAM. The rest had to route through cpu0 to get to the RAM. So the cores that were adjacent to the cpu0 had only one hop to go. The cores that was "diagonally acrossed" from cpu0 had to go through either of the adjacent cpu cores before it got to the memory controller.)

Quote:
Originally Posted by Simbelmynė
I cannot explain that. However, you can also compare against a 7820X (this is using a later version of the kernel), that also max out at approximately the same solution time as the 7940X. The 7940X is using R2 memory, not sure if the 7820X is using 1R or 2R.

Here are some results from benchmarks of the same CPU. Linux mint 18.x, kernel 4.15.x, two different memory types used one for 2133-2400 and another for 3200+

Code:
core i7 7600k standard turbo frequencies
1C 4.2 GHz
2C 4.1 GHz
3C 4.1 GHz
4C 4.0 GHz


core i7-7600K, DDR4 2133 MHz CL 15
# cores   Wall time (s):
------------------------
2 423.92
4 357.42


core i7-7600K, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 389.9
4 322.65


core i7-7600K, DDR4 2400 MHz CL 13
# cores   Wall time (s):
------------------------
2 388.86
4 320.97

core i7-7600K, DDR4 3200 MHz CL 15
# cores   Wall time (s):
------------------------
2 364.07
4 291.68

core i7-7600K, DDR4 3466 MHz CL 15
# cores   Wall time (s):
------------------------
2 335.94
4 268.92


UNDERCLOCKED (all cores) ******************



 core i7-7600K @ 3000MHz, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 441.56
4 334.5


OVERCLOCKED RESULTS (all cores) *********************



 core i7-7600K @ 4500MHz, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 385.99
4 319.42


core i7-7600K @ 4700MHz, DDR4 2400 MHz CL 15
# cores   Wall time (s):
------------------------
2 379.26
4 321.16


core i7-7600K @ 4400MHz, DDR4 3200 MHz CL 15
# cores   Wall time (s):
------------------------
2 341.53
4 278.77

core i7-7600K @ 4700MHz, DDR4 3200 MHz CL 15 *Probably throttling, more testing needed*
# cores   Wall time (s):
------------------------
2 344.83
4 287.32

core i7-7600K @ 4500 MHz, DDR4 3466 MHz CL 15
# cores   Wall time (s):
------------------------
2 326.45
4 267.25
Quote:
Originally Posted by Simbelmynė View Post
Do you suggest that the discrepancy comes from memory latency?
Yes!!!

(They used to have a tool that ran in Windows command prompt that would measure the cache and RAM latencies. I surmise that there are probably newer versions of such tools that will run on Linux/SLES/Ubuntu that would be able to test the same just as I am able to test the IB latency because that would be the closest way of testing the RAM latency on a remote node (in addition to testing the RAM latency within a single node). I was also watching a Youtube video of Brian Lunduke interviewing Paul Biband (https://www.youtube.com/watch?v=SdWWAvspoy4) and Paul was saying about how with the new AMD Epyc platform, that even though you have a second set of DIMM slots, if the second CPU isn't installed, the first CPU won't be able to access the memory (and everything else) associated with the second CPU socket. To quote/paraphrase: "it just shuts off". Paul also makes the allusion that the Intel platform doesn't have this issue. (though I vaguely recall that in the early 2000s, they had either the same or a similar issue, but now that Intel is pushing for people to use more of their dual socket products moreso than their quad or even octo socket stuff, it wouldn't make sense for them to make it so that if you don't have the second CPU installed, that you would shut off, for example, half of your available PCI Express lanes. (I am hoping that I heard and understood what Paul was saying correctly.))

Again, my evidence in regards to why I think that this is a latency issue rather than a bandwidth issue is given by the fact that when I perform a CFD run across multiple nodes, I can peer into how much data is actually being transferred/transmitted between the cores via the network interface/interconnect.

Again, if this was to be a memory bandwidth intensive application (rather than a latency sensitive application), then I would expect the network interface/interconnect data transfers to go through the roof like a sparse/direct FEA solver would (which I've literally seen that happen. And that is confirmed by either the I/O read/write bytes in Windows Task Manager per process or "top" in Linux, where a single FEA run with between 500,000-700,000 second order tetrahedral elements, running in double precision will move on the order of 50 TB of data acrossed all of the processes. CFD, generally and typically won't move that much data, even when you have hundreds of millions of nodes. Not unless you're doing either something very specific or very funky or just having it do something that you didn't want to nor intend for it to do (and "oops"). That amount of data transfer is not "regular business" in the CFD world (based on my experience with it and having performed benchmark after benchmark after benchmark with my own test cases). But that amount of data WOULD be "regular business" for just about any sparse/direct FEA solver when you have that many elements (more or less, depending on what you're simulating. One of my runs takes between 250-300 equilibrium iterations for it to solve.)



I'm running almost exactly the same case (as the Fluent run), but this time with 32-processes, two nodes, each node is dual socket, each socket is also the Xeon E5-2690 (v1) processor, 128 GB of RAM, DDR3-1600 2Rx4 (so it's actually able to run at the DDR3-1600 speeds).

(I won't run through the total bandwidth calculations like I did with my dual Z420 workstations, but you can pretty much just double most of it (except the total number of DIMMs is the same at 8 per node). Everything else you can double.)

Again, you can see that it doesn't pass the 10 Mbps mark on a GbE network/interconnect, which means that it's 0.01 Gbps.

This run "only" has 7.2 million tetrahedrals, also DES, dt=1e-4, niter=20000, single precision.

Again, if it were to be memory bandwidth intensive, I would expect the network interface/interconnect to be pegged, per each coefficient loop/iteration. And it's not. Far from it, in fact.

This is why I don't think that CFD is a memory intensive application because I don't think that the MPI messages are smart enough to be able to tell whether the message destination is INTRAnode vs. INTERnode.

I think that the MPI messages just goes wherever it needs to or wants to go.

And as such, if, with 32 processes running across two nodes, with 16 CPU cores per node is showing that the network interface/interconnect only shows about a 0.01 Gbps network/interconnect utilization, I would not call that memory bandwidth limited.

Latency sensitive/limited, for sure, but I don't think that it's bandwidth because it just isn't pushing that much data between the nodes, and thus, between the cores.

Thanks.
alpha754293 is offline   Reply With Quote

Old   January 3, 2019, 18:38
Default
  #18
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by me3840 View Post
Quick note, MPI can definitely tell which ranks are across nodes and which share boards:

https://www.open-mpi.org/faq/?catego...run-scheduling

IIRC you can even differentiate between sockets on the same board.
The processes are scheduled between the MPI implementation and the OS/kernel schedule in "round robin" order.

So if and when I get around to testing this, if I want to be able to peer into how much data is being transferred between cores (i.e. intercore, internode traffic), I would purposely be running a 2 core run like this:

Code:
$ cat my-hosts
node0 slots=1 max_slots=16
node1 slots=1 max_slots=16
$ mpirun --hostfile my-hosts -np 2 ...
so that it will force parallel OpenFOAM to spawn only one process per node with two nodes running so that I can "take a look inside" what and how much data is being transferred between the two cores.

But that isn't to say that if I were to double that (run a 4-core run across two nodes by changing the "slots=1" to "slots=2" and then changing "-np 2" to "-np 4" and then knowing which MPI messages are being sent from

Code:
node0/cpu0 -> node0/cpu1
node0/cpu0 -> node1/cpu0
node0/cpu0 -> node1/cpu1
node0/cpu1 -> node1/cpu0
node0/cpu1 -> node1/cpu1
node1/cpu0 -> node1/cpu1
Like I said, I don't know much about programming (of any sort/kind, in general), so I know even less about MPI programming. But I am going to go out on a limb here and declare an assumption that I am going to make that I don't think that with the MPI messages and MPI messaging, that you can actually specifically tell it which of the 6 paths a specific message goes. I'm not sure how it knows where to go, but I am even less sure about how it knows where to go and whether you can tell it where to go.

IF MPI can know and tell whether the messages are going INTRAnode (and will prefer to send work to send more data amongst itself that way) and only send as little data as possible across nodes, then that's pretty smart.

But then that would negate the need for a hybrid SMP/MPP programming paradigm (if MPI was that smart) where INTERnode message transfers are handled via OpenMP and INTRAnode message transfers are handled via MPI. (See hybrid LS-DYNA as an example of an implementation of hybrid SMP/MPP programming, where the use of OpenMP for INTERnode transfers is to reduce the communication overhead that MPI has, especially when you have tens of thousands of processes running acrossed tens of thousands of cores.)

(And I can also say that having performed SPH runs across two nodes, that was slower than running the same SPH run within a single node with 16 MPI, MPP processes rather than 32 MPI/MPP processes (with a GbE interconnect).)
alpha754293 is offline   Reply With Quote

Old   January 3, 2019, 18:52
Default
  #19
Senior Member
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 1,957
Rep Power: 30
flotus1 will become famous soon enoughflotus1 will become famous soon enough
And I disagree with that

Quote:
Memory bandwidth has very little to do with it and this is given by the fact that CFD solutions typically don't utilize as much RAM as compared to a sparse or direct solver FEA solution/problem
Total memory utilization has no direct impact on how bandwidth intensive a code runs. But I can agree that FEA is one of the few applications that are even more bandwidth-heavy than FV-CFD.

Quote:
if your problem is memory I/O limited, then the problem is probably embarassingly small such that all of the calculations can be performed in less than one full clock cycle of the CPU.
"Embarrassingly" small problems run in cache and thus benefit from the even higher bandwidth that L2 and L3 caches provide compared to memory. Then again, the problem size does not change the code balance. So I don't see where you are going with this argument.

Quote:
If the total number of floating point operations takes more than one full clock cycle to complete (or more than one second (e.g. one Hertz) to complete), then your limit will be CPU's FLOPS, and not the memory bandwidth.
Without caches and prefetching there might be some truth to this. But we have CPUs with caches and quite sophisticated prefetching methods.

Quote:
For example, even the Intel's first gen QPI (8.0 GT/s) link is a 256 Gbps bandwidth link (32 GB/s). DDR3-1600 is a 102.4 Gbps (12.8 GB/s) memory bandwidth link. I switched from DDR3-800 to DDR3-1600 RAM and my runs only improved by 7-10% max despite the fact that the bandwidth LITERALLY DOUBLED.
I am not going to speculate what caused your anomalous result. Every test I ran with commercial FV solvers confirmed the common knowledge: CFD with commercial finite volume codes on today's multi-core CPUs is bandwidth limited to a much greater extent than your results suggest.
Sure, you can write your own code that e.g. relies on spatial or temporal blocking to alleviate the memory bandwidth bottleneck. But these techniques have their limits.
__________________
Please do not send me CFD-related questions via PM
flotus1 is offline   Reply With Quote

Old   January 3, 2019, 18:57
Default
  #20
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
No comments on my benchmark of the 7600K?
The benchmark data doesn't show the memory latencies as the variables are being changed throughout the suite of benchmark tests.

The benchmark time data also doesn't tell me what was the total volume of data that was passed between the cores and when those data transfers occured.

Unlike bandwidth, there isn't as straight-forward method of calculating what the theorectical latencies are or should be compared to being able to calculate, using basic arithmetic, what the theorectical maximum bandwidth rates are.

Therefore; the data is incomplete.

Again, when you compute, let's say, for example, the percent reduction in wall clock time.

With each change, you should be able to compute what the percent increase in memory bandwidth is as well.

Like the cases where the difference between the Core i9-7940X and the Core i7-8700K, for the single core and dual-core cases, where only a part of the ~25-30% increase in wall clock time could be explained by the increase in the CPU clock speed, and where the rest of it, an explanation wasn't available -- you would have to repeat that exercise here.

Looking at just your first two data points:

7600K with DDR4-2133 CL 15 vs. 7600K with DDR4-2400 CL15

2core: 423.92 vs. 389.9 = 8.725% faster
4core: 357.42 vs 322.65 = 10.78% faster

2400/2133 (as a quick way to approximate the improvement in bandwidth) = 12.52% more bandwidth.

So in the case of the 2 cores, you only got 8.725% faster out of the possible 12.52% faster (or 69.7% of what's possible).

And this is at the same clock speed (4.1 GHz vs. 4.1 GHz). Why would you only be able to get 69.7% of the 12.52% gains by changing the memory speed? What happened?

Now looking at the 4core results:
2133 vs 2400 = 357.42 vs. 322.65, or 10.78% faster.

Again, the improvement that you should have been expecting was 12.52%, but it only got better than 10.78% which is 86.09% of the 12.52% improvement that you should have been able to get.

Why did it only get 86.09% of the 12.52% improvement when you increased the RAM clock speed from 2133 to 2400?

And lastly, was the Core i7-7600K an unreleased or OEM/Engineering Sample processor because I couldn't find out how many memory channels it has, the type, nor the speed?

I trust that you can perform the rest of the analysis ad simile.

If it were bandwidth limited, I would expect the increase in the memory clock speed to give more than a 70% of the expected 12.52% improvement in the wall clock time to solution.

Again, I think that it would be important to look at how the latencies changed with the RAM clock speed increase as that might help to explain why you weren't able to get the full benefit of the RAM clock speed increase.

Thank you.
alpha754293 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Building a "home workstation" for ANSYS Fluent Ep.Cygni Hardware 18 October 19, 2018 18:41
Low budget Workstation . Second Hand suggestions well received. Fer_Arus Hardware 4 April 10, 2017 12:20
mpirun, best parameters pablodecastillo Hardware 18 November 10, 2016 13:36
Optimal 32-core system bindesboll Hardware 17 July 9, 2013 11:58
Recommended PC for CFD Paul Main CFD Forum 10 March 13, 2006 16:16


All times are GMT -4. The time now is 09:44.