CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Discussions about memory bandwidth and scaling

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 4, 2019, 21:24
Default
  #41
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post
The data comes from the benchmarking here.

(It's bandwidth. (TR1950X has higher single thread available memory bandwidth). Well, it's not bandwidth. It's because single cores can't leverage the bandwidth. Or it's how much bandwidth is available to the single core (which, per the Anandtech citation, they were measuring available bandwidth). Then it isn't bandwidth and it isn't because of the single core or the amount of memory that's available to that core. I'm waiting for someone to argue that it is because that they're fundamentally different architectures as the reason why there are differences, which still means that it's not bandwidth. This is my summary of the entire discussion. It is, but it isn't, but it is, but it isn't.)

Whatever ideas you have and whatever difficulties you see in measuring CPU-memory communication, this is simple to test practically.



Quote:
Originally Posted by alpha754293 View Post

If you're going to be running CFD, the thing that is going to be your most significant limiting factor will be how many cores you have and how fast (in floating point operations per second or FLOPS) it can perform.

Memory bandwidth has almost nothing to do with it.

Memory bandwidth is in, practice, very dependent on the number of populated channels on the motherboard. No need to change memory frequency (easy right?). This does not affect the CPU FLOPS performance.


So...



You can try this on one of your home cluster nodes by only using one memory module per CPU instead of 4. Run the benchmark on the node and use all available cores. Compare it to when you have all the channels correctly populated. Report back when you are done.


What is your explanation to the extremely poor performance of the bandwidth-capped system? Please refrain from talking about IB since you should not have used it during this test.
Simbelmynė is offline   Reply With Quote

Old   January 4, 2019, 23:48
Default
  #42
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
Whatever ideas you have and whatever difficulties you see in measuring CPU-memory communication, this is simple to test practically.

Memory bandwidth is in, practice, very dependent on the number of populated channels on the motherboard. No need to change memory frequency (easy right?). This does not affect the CPU FLOPS performance.

So...

You can try this on one of your home cluster nodes by only using one memory module per CPU instead of 4. Run the benchmark on the node and use all available cores. Compare it to when you have all the channels correctly populated. Report back when you are done.


What is your explanation to the extremely poor performance of the bandwidth-capped system? Please refrain from talking about IB since you should not have used it during this test.
Really?

That is, quite possibly, one of the dumbest ideas I've read here.

First off, by taking out the DIMMs, not only are you reducing the number of independent channels, but you're also forcing the memory I/O operations and instructions to be executed onto a single DIMM.

And that leads me to the second point:



So, looking at this plot (where x-axis is memory bandwidth in (MB/s) and the y-axis is latency (in ns), you're able to explain the non-linear portion on the right part of the chart?

(Source: Gilbert and Rowland (August 2012) from Intel)

You will note two things about the above plot:
a) The measured (or available) memory bandwidth isn't just a straight, vertical line

and b) that as the memory bandwidth increases, starting linearly at the lower bandwidths, and then changing to non-linear towards the right of the chart, it shows that the bandwidth stops increasing and the latency starts increasing.

Your proposed idea above, will only exacerbate the condition on the right of the chart, which is known as or referred to as queuing delay, or Little's Law.

(Sources: https://blog.flux7.com/blogs/benchmarks/littles-law, https://www.researchgate.net/publica...state_dynamics, http://www.cs.utah.edu/~udipi/mc_wmpi.pdf)

I mean, surely you must have considered this in your written proposal above, right? Especially when you consider the obvious nature of this problem that such a proposal would cause and/or induce into the measurement system.

You DID consider this, correct?

Third, again, going back to the benchmark data that you pulled and shared, you argue that this CFD is memory bandwidth bound.

Yet, so far, you haven't been able to explain why, the TR1950X has a higher single thread memory bandwidth than the 7940X, but performed the run SLOWER than the 7940X.

You say that CFD is memory bandwidth bound. Here you have a data point, from your own citation, that shows a processor that has MORE available memory bandwidth, but still ran SLOWER than a processor that has LESS available memory bandwidth. How do you explain that? I thought you said that it's memory bandwidth bound? Therefore; why would the processor with MORE memory bandwidth run SLOWER than the processor with LESS memory bandwidth?

(Again, this is from the data that you cited. I'm not saying that. You cited the data. I'm just looking at the data and asking about it in deference to your original, initial statement. The data that you cite doesn't agree with the statement that you made. The clock speed difference accounts for 7.5% out of the 8.23% wall clock time difference. That still means that there is 0.73% difference in wall clock time that isn't explained nor accounted for. Are you trying to tell me that a processor with MORE memory bandwidth is able to be 0.73% SLOWER than a processor with LESS memory bandwidth? How can that be if you said that CFD was memory bandwidth limited? More available memory bandwidth should make it run faster, not slower, right? (Again, according to your original statement.))

I'll run your experiment after you provide an explanation in regards to this data point that you cited.

Quote:
Originally Posted by Simbelmynė
This does not affect the CPU FLOPS performance.
Are you talking theorectically or practically?

(I love how you mix both within the same response.)

Because I can assure you that if you are trying to measure floating point operations per second performance using a LINPACK, then the memory bandwidth will ABSOLUTELY affect the measured FLOPS performance.

If you've ever performed a benchmark akin to the LINPACK benchmark (I used MATLAB), you will know that CPU FLOPS IS a function of the m size of the m x m square matrix in the linear algebra equation Ax=b.

Therefore; if you make it too small (such that it will fit entirely in to the L2 cache), your CPU FLOPS will be significantly less than if the matrix utilized 50% of the available, physical RAM.

Therefore; if you're using more RAM so that the m x m matrix is bigger, then your system's memory subsystem's ability to allocate and address that memory will have an impact on the CPU's ability to perform and execute its floating point operations.

Again, I must make the declaration of an assumption that I would have thought or assumed that you knew this already though.

Here's what such a plot would characteristically look like:

(I couldn't find my plot for my Xeon E5-2690s, so I'll have to re-run it when I get the chance.)

But that's what it characteristically looks like. When the matrix sizes are small, it uses less memory, but the FLOPS performance is lower.

As the size of the matrix increases, generally, so too will the FLOPS performance, and it's rare for it to be a linear increase, although, by the same token, admittedly, I usually don't run it with sufficient resolution (i.e. enough times, over and over again, with finer differences between the matrix sizes to be able to show the initial linear gain before it becomes non-linear). For that, I would probably need to have a fairly fine matrix difference size/granularity to be able to capture that, and most of the time, I'm not usually too overly concerned about how well the CPU performs at the low matrix sizes because it doesn't show the CPU peak FLOPS performance (since that is a much more interesting number to look at/to know).

So, again, to recapitulate:
In regards to your last question - literally Little's Law can be a part of the answer.

Even in a paper published by Clapp et. al. (2015), presented at the 2015 IEEE Symposium on Workload Characterization, they define an application being bandwidth bound if "bandwidth demand exceeds supply." (p. 222).

So far, no one here has been able to provide data that shows that the OpenFOAM motorbike benchmark, or OpenFOAM in general, or CFD in general actually has memory bandwidth demand that exceeds supply.

Without that data, I, personally won't call that workload "bandwidth bound".

What's even more interesting is that the people who are proponents of the idea that CFD is bandwidth bound, so far, have not been able to distinguish the difference between queuing delay vs. actually being bandwidth bound - a point that I've been trying to hammer on for like two days now.

Again, it's amazing to me, how even in a technical forum, people would be so ready and willing to make statements absent of the presentation of the technical data from which they derive or draw their conclusions/statements.

For example, when you see that you start getting marginal returns on the wall clock times when the number of cores is greater than the number of memory channels, for example, how can you be certain that the cause of it ISN'T because of memory I/O execution queuing delay?

The memory channels have a certain amount of bandwidth available to you. How do you know (for a fact), that the reason for it is because the bandwidth demand has outstripped the bandwidth supply?

I would to see/know how you might be able to draw or infer such a conclusion like that from looking at the wall clock times WITHOUT looking at the actual peak memory transfer rates to ascertain whether the bandwidth demand has actually outstripped the bandwidth supply -- the very definition of a workload being "bandwidth bound" (per Clapp et. al.).

Again, I would love to see the data that shows that -- the bandwidth demand outstripping the bandwidth supply, a.k.a. bandwidth bound.

Or for you to show that the marginal performance increases when the number of cores > number of memory channels - that it ISN'T due to Little's Law.

I would love to see both sets of data actually. And so far, no one has presented that.

(Whereas I've shown you both -- a case (albeit not a CFD case) of what it looks like when the bandwidth demand exceeds the bandwidth supply (via the sparse/direct FEA example), and I've also shown you the arithmetic of what it looks like when you're latency bound, (or converted, loosely, queuing delay bound), by showing the benchmark results of the IB latencies and bandwidth which shows demonstrates, with data/evidence, the method for determining how and when a workload is latency/queuing delay bound.)
alpha754293 is offline   Reply With Quote

Old   January 5, 2019, 06:46
Default
  #43
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post
Really?

That is, quite possibly, one of the dumbest ideas I've read here.

First off, by taking out the DIMMs, not only are you reducing the number of independent channels, but you're also forcing the memory I/O operations and instructions to be executed onto a single DIMM.


Yes? So you have turned on your previous statement?



Quote:
Originally Posted by alpha754293 View Post
.

If you're going to be running CFD, the thing that is going to be your most significant limiting factor will be how many cores you have and how fast (in floating point operations per second or FLOPS) it can perform.

That's the true limit of CFD.

Memory bandwidth has almost nothing to do with it.



Quote:
Originally Posted by alpha754293 View Post
And that leads me to the second point:



So, looking at this plot (where x-axis is memory bandwidth in (MB/s) and the y-axis is latency (in ns), you're able to explain the non-linear portion on the right part of the chart?

(Source: Gilbert and Rowland (August 2012) from Intel)

You will note two things about the above plot:
a) The measured (or available) memory bandwidth isn't just a straight, vertical line

and b) that as the memory bandwidth increases, starting linearly at the lower bandwidths, and then changing to non-linear towards the right of the chart, it shows that the bandwidth stops increasing and the latency starts increasing.


I guess you consider Intel product advertisement as trustworthy as well as something everyone should have read.


I'm pretty sure you have misread what's on the y-axis. It is more reasonable that the y-axis represents the performance index of the memory. This is sometimes expressed as (memory frequency) / (CAS latency).


It is known that memory latency affects bandwidth. It is also known that the lower frequency memory has a higher demand for tight timings. Which is clearly seen in your figure.



You can also see that the bandwidth is not affected by latencies above a certain point, while if you have very loose timings you will see a significant performance drop.


Here is an explanation of memory latency and how it has (not) evolved over the years:


https://eu.crucial.com/eur/en/memory...-speed-latency


Quote:
Originally Posted by alpha754293 View Post
Yet, so far, you haven't been able to explain why, the TR1950X has a higher single thread memory bandwidth than the 7940X, but performed the run SLOWER than the 7940X.

This has been explained. They were not bandwidth limited at 1 core. Different architectures, IPC and frequencies come into play.
Simbelmynė is offline   Reply With Quote

Old   January 7, 2019, 19:28
Default
  #44
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,126
Rep Power: 18
me3840 is on a distinguished road
Okay, I have completed a series of tests. My machine is as follows:


CPU: i7 5820k @ 3.6GHz, smt disabled, turbo disabled.
Memory: 4x16GB DDR4 2133MHz, dual-rank
OS: ubuntu 16.04
Software: OpenFOAM 5
Benchmark is as stated in the forum except the streamline post-processing was disabled and bind-to-core was enabled.


The OS is not as clean as a cluster node would be, but as I dismantled my cluster, this is all I have for now.



According to Intel this system can get 68 GB/s. I used the Intel MLC tool to find the actual bandwidth:
Code:
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios

ALL Reads        :    45941.24
3:1 Reads-Writes :    45671.16
2:1 Reads-Writes :    46142.14
1:1 Reads-Writes :    49261.67
Stream-triad like:    46777.92
So it appears quite a bit lower than theory might suggest. I haven't looked into reasoning for this yet. MCL can also give the memory latency as a function of bandwidth, seen here:







I then performed the benchmark from the benchmarking thread for 1-6 processes, both with bandwidth-monitoring software and without it. These results are compared to Jeggi's system in the benchmarking thread which uses the same CPU (however his/hers is lower clocked, uses Windows, has half the memory, so there are differences).




To me the scaling is not that good. It's not apparent to me that the BW monitoring software affects the run significantly (as the runtimes drop the affect of statistical error will rise anyway). My machine is faster than Jeggi's, as one would expect. This chip has only a 4-channel memory controller, so we might expect heavier losses after np=4, which I don't really see. Instead I just see poor scaling. Perhaps other folks might see it differently, I don't look at scaling plots often.

In order to see how OF is doing compared to what Intel's test states the chip can do, we need to determine what is the R/W ratio to memory that OF typically uses. We can see here it's about 3. I only calculated for channel 0, but another image clearly shows they're all about the same.




So considering this, we compare the measured maximum for 3:1 on the Intel test to what OF used during execution.




The different tests for np=1,2,3,... are easily visible in the plot. After 4, as we expected, we really don't see as significant increases anymore as in lower nprocs. 4 and 5 are nearly the same, and 6 gets a minor bump. No case achieves what the Intel benchmark says, though this might be understandable since the Intel benchmark isn't a real-world scenario.


How do we know that these bandwidth measurements are accurate? Well, we can use the PCM bandwidth monitor during the MLC to see how their values compare. All 5 tests show pretty similar values, certainly to within 1000MB/s or so:



For me to draw any conclusions I think would be premature; I would like to look into the code scaling issues as well as why the experimental bandwidth is much lower than what Intel states.

However it's quite evident that the bandwidth usage after the number of memory channels is saturated doesn't rise very much. OpenFOAM may not be using the channels as efficiently as the MLC can, but it very certainly is being constrained by the bandwidth it can utilize with its problem set. There is some rise in latency, but I don't have any way to say that the rise in latency causes some drop in performance without additional tests. Certainly FLOPs are not constraining the problem, otherwise better scaling would have been realized.
Attached Images
File Type: png channel_ratio.png (36.0 KB, 78 views)
File Type: png latency_vs_bandwidth.png (29.4 KB, 82 views)
File Type: png OF_scaling.png (43.3 KB, 78 views)
File Type: png total_bandwidth.png (40.6 KB, 78 views)
File Type: png mlc_vs_pcm.png (56.5 KB, 76 views)
me3840 is offline   Reply With Quote

Old   January 8, 2019, 00:46
Default
  #45
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,126
Rep Power: 18
me3840 is on a distinguished road
I had one more idea for a plot to make. In this case, the memory bandwidth is scaled in the same manner as we do runtime, so that the runtime and the BW can be plotted together.




Again we can easily see the drop in bandwidth after np=4. The code seems to scale better than the bandwidth might suggest. The loss in performance for np<=4 looks indicative of some other problem to investigate. The difference between mean and max bandwidth is tiny at this scale. Since my system is not exactly what one would call production software, np=6 may be flaky considering the other processes running (gdm, compiz, firefox, openvpn etc) which would not be present in a real environment. If I have time I'll prepare a production installation on a separate disk, as right now I'm not getting new cluster hardware yet.
Attached Images
File Type: png OF_BW_scaling.png (44.1 KB, 11 views)
me3840 is offline   Reply With Quote

Old   January 8, 2019, 03:13
Default
  #46
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by me3840 View Post
Okay, I have completed a series of tests. My machine is as follows:


CPU: i7 5820k @ 3.6GHz, smt disabled, turbo disabled.
Memory: 4x16GB DDR4 2133MHz, dual-rank
OS: ubuntu 16.04
Software: OpenFOAM 5
Benchmark is as stated in the forum except the streamline post-processing was disabled and bind-to-core was enabled.


The OS is not as clean as a cluster node would be, but as I dismantled my cluster, this is all I have for now.



According to Intel this system can get 68 GB/s. I used the Intel MLC tool to find the actual bandwidth:
Code:
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios

ALL Reads        :    45941.24
3:1 Reads-Writes :    45671.16
2:1 Reads-Writes :    46142.14
1:1 Reads-Writes :    49261.67
Stream-triad like:    46777.92
So it appears quite a bit lower than theory might suggest. I haven't looked into reasoning for this yet. MCL can also give the memory latency as a function of bandwidth, seen here:


*snip*




I then performed the benchmark from the benchmarking thread for 1-6 processes, both with bandwidth-monitoring software and without it. These results are compared to Jeggi's system in the benchmarking thread which uses the same CPU (however his/hers is lower clocked, uses Windows, has half the memory, so there are differences).

*snip*


To me the scaling is not that good. It's not apparent to me that the BW monitoring software affects the run significantly (as the runtimes drop the affect of statistical error will rise anyway). My machine is faster than Jeggi's, as one would expect. This chip has only a 4-channel memory controller, so we might expect heavier losses after np=4, which I don't really see. Instead I just see poor scaling. Perhaps other folks might see it differently, I don't look at scaling plots often.

In order to see how OF is doing compared to what Intel's test states the chip can do, we need to determine what is the R/W ratio to memory that OF typically uses. We can see here it's about 3. I only calculated for channel 0, but another image clearly shows they're all about the same.

*snip*


So considering this, we compare the measured maximum for 3:1 on the Intel test to what OF used during execution.

*snip*


The different tests for np=1,2,3,... are easily visible in the plot. After 4, as we expected, we really don't see as significant increases anymore as in lower nprocs. 4 and 5 are nearly the same, and 6 gets a minor bump. No case achieves what the Intel benchmark says, though this might be understandable since the Intel benchmark isn't a real-world scenario.


How do we know that these bandwidth measurements are accurate? Well, we can use the PCM bandwidth monitor during the MLC to see how their values compare. All 5 tests show pretty similar values, certainly to within 1000MB/s or so:

*snip*

For me to draw any conclusions I think would be premature; I would like to look into the code scaling issues as well as why the experimental bandwidth is much lower than what Intel states.

However it's quite evident that the bandwidth usage after the number of memory channels is saturated doesn't rise very much. OpenFOAM may not be using the channels as efficiently as the MLC can, but it very certainly is being constrained by the bandwidth it can utilize with its problem set. There is some rise in latency, but I don't have any way to say that the rise in latency causes some drop in performance without additional tests. Certainly FLOPs are not constraining the problem, otherwise better scaling would have been realized.
Quote:
Originally Posted by me3840 View Post
I had one more idea for a plot to make. In this case, the memory bandwidth is scaled in the same manner as we do runtime, so that the runtime and the BW can be plotted together.


*snip*

Again we can easily see the drop in bandwidth after np=4. The code seems to scale better than the bandwidth might suggest. The loss in performance for np<=4 looks indicative of some other problem to investigate. The difference between mean and max bandwidth is tiny at this scale. Since my system is not exactly what one would call production software, np=6 may be flaky considering the other processes running (gdm, compiz, firefox, openvpn etc) which would not be present in a real environment. If I have time I'll prepare a production installation on a separate disk, as right now I'm not getting new cluster hardware yet.
Good work, me3840!

Yeah, I'm not really sure why your all reads bandwidth results seems a tad low. Did you have other stuff running in the background when MLC was running? (I'm just trying to think of possible root causes.)

At around 46 GB/s out of 68 GB/s, that's only about 67% of the theorectical peak bandwidth.

I know that for my system, running with DDR3-1333 memory, the peak bandwidth for all four channels is 41.6 GB/s. At max, I'm able to get 38.2 GB/s out of that which is about ~92% of the peak, so that's pretty good. I know that some people might only estimate being able to hit 70% of the theorectical peak bandwidth when they test it with programs like MLC, so I'm not really sure what happened, why your system wasn't able to obtain that level of result.

The other thing that was interesting was your bandwidth vs. latency plot, how it's quite spead out. Also not really sure what happened there either.

(You will see my plot later on below.)

Your read/write ratio plot is interesting with PCM. The mean might appear to hover around a ratio of 3:1, but I think taht your standard deviation is also like +/- 1, and that's quite significant (although there doesn't seem to be any difference in the bandwidth results though).

Your total bandwidth utilization plot is also interesting.

I was trying to do a little bit of research into PCM earlier today, and I wasn't able to find out if it was able to measure latency (during runtime) like the way it is able to measure bandwidth during runtime.

That would also be interesting to see.

And it is also intresting to see the PCM bandwidth plot at run time as well because you can clearly see that it doesn't peg the bandwidth constantly.

(In my tests, which again, you'll the results below later, sometimes, you can subjectively hear that with the spinning up and down of the CPU fan!)

This is great data though!

Thank you!

Here are the results of my tests:

I can't seem to find one of the links that I had put in one of my posts due to some BS.

Anyways, here is the link again:
OpenFOAM on Ubuntu 18-04-1 LTS on HP Z420 Xeon E5-2690 v1 DDR3-1333 test

System configuration stats:
HP Z420 Workstation
Number of sockets: 1
Processor: Intel Xeon E5-2690 (v1) (8-core, 2.9 GHz stock, 3.3 GHz all core turbo, 3.6 GHz max turbo, HTT disabled)
RAM: Micron Technologies MT36JSZF1G72PZ-1G4D1DD
Memory type: 8 GB DDR3-1333 PC3-10600R-9-10-J0 ECC Registered 2Rx4
SSD: Dell 100 GB SSD
OS: Ubuntu 18.04.1 LTS
OpenFOAM version: CFD Direct OpenFoam v5

I have two workstation like this that are identical, so I was able to run my tests in half the total amount of time such that they were leapfrogging over each other.

Having said that, I had to drop down to DDR3-1333 2Rx4 memory modules because when all 8 DIMM slots are populated with my DDR3-1600 4Rx4, the system will drop the speed down to DDR3-800, so I didn't want that to be a factor.

The HP BIOS also wasn't super clear in terms of the setting that would have disabled TurboBoost, so unfortunately, there is this to consider in the results as well. (The Supermicro BIOS is better than these OEM BIOSes. Much more straightforward.)

So, first:
MLC results:


So the HP Z420 workstations have 8 DIMM slots and 4 independent memory channels, that can run at upto DDR3-1600 speeds. But as I mentioned, I'm running at DDR3-1333 speeds.

You can see that when you change the number of channels, the latency does way up and the bandwidth drops way down. You will also see that between the two, they're almost proportional. You will also see (a little bit, though it might be a wee bit difficult to see) that whether it's 1 DIMM per channel or two, that doesn't seem to have an effect on the bandwidth/latency proportionally.

Below, you'll see the meshing times as a function of the number of cores that the task was performed on/with, the memory configuration, and also whether or not I set the processor affinity for that task.


And here are the solution times, again, the same - as a function fo the number of cores that the task was performed on/with, the memory configuration, and also whether or not I set the processor affinity for that task.


In this case, setting the processor affinity was sometimes a little bit more difficult because I was doing that manually where I would run 'ps aux | grep simple', manually extract the PID#s and then set the affinity using taskset. I'm sure that if I found a way to automate this, the results might be a bit better, but it is what it is for now. I'm not going to worry about it too much.

With meshing, you can see that the times scales by 53.5% going from 1 core, 1 channel to 2 cores, 2 channels (again regardless of DIMMs per channel), and then 72.8% going from 2 cores, 2 channels to 4 cores, 4 channels. However, beyond that, going from 4 cores, 4 channels to 8 cores, 4 channels, it still manages to scale by 39.8% which isn't bad considering that you might otherwise expect it to be starving for memory bandwidth since there weren't additional memory channels, but at 39.8% additional scale factor, that's only about 13.7% off from going from 1 core, 1 channel to 2 cores, 2 channels.

So I don't think that's a signfiicant, limiting factor.

The solve times definitely paints a different picture.

Going from 1 core, 1 channel to 2 cores, 2 channels represents a 92.5% improvement in solve time. Going from 2 cores, 2 channels to 4 cores, 4 channels represents an additional 89.3% improvement in solve time. However, going from 4 cores, 4 channels to 8 cores, 4 channels, only added an additional 24.9% solve time improvement.

However, when you look at the relative performance against the number of DIMMs and the number of channels, as shown below, even despite the fact that the 8 core run with only 4 channels of memory improved the solution time by only ~25%, when compared to its relative performance, it also gains the greatest performance increase relative to running the same number of cores with either just 1, 2, or 4 channels of memory.



In other words, while the number of channels definitely matters during the solve moreso than during meshing, a) the solution time is still ultimately the fastest with 8 cores and only 4 channels of memory rather than 4 cores and 4 channels of memory, and b) meshing is still significantly and consistently slower to complete than the actual solution.

And in order for me to get the meshing times, I had to keep running the meshing over and over again, despite the fact that it takes so long.

The other thing that I wish I knew how to do was to profile these two solve cases with the MPI statistics (given that the parallelization here is written with MPI, so it's distributed parallel processing rather than shared memory parallel processing), that would be interesting to me.

More specifically, I would want to duplicate the profiling efforts from these slides (if I knew how to do that):






I would be curious to see how much time is spent on MPI_wait() as the core count went up.

It's a pity that if I want to run this test again with having 8 channels of memory available, it would be vastly more complicated because I will have to move to one of my dual socket nodes and that brings the whole second socket and its memory controller and channels into the picture, but I might try it.

Again, all of the raw data is made publically available on that Google Spreadsheet.

I tried to make the plots there too, but it wouldn't let me select the cell or type in the name for the various series, so I ended up making the plots in Excel instead.

Thanks.
alpha754293 is offline   Reply With Quote

Old   January 8, 2019, 22:58
Default
  #47
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
So...for shits and giggles, I finished setting my up two HP Z420 workstations and tied them together into a cluster.

(I think that I've got the entire process now down to a science.)

As a result, I was able to perform another run, but clustered, where I am using 8 processors across the two workstations/nodes so that I will have access to 8 channels of memory.

The meshing time on the order of around like 883 seconds, so around the same see as 8 cores with 1 DIMM and 1 memory channel.

The solve time was 249.87 seconds, which is still an improvement, but of course, there is the complication of the gigabit ethernet interconnect that might be part of the reason why there isn't much of an improvement.

Running 16 cores acrossed both machines resulted in a meshing time of around 630-640 seconds (somewhere in there) and a solve time of 242.36 seconds. (with 8 channels of memory, but also a gigabit ethernet interconnect)

Again, not great, but still a speedup nevertheless.

And that results in a relative performance plot (by memory configuration) like this:


I will also note that I did have top running on both nodes during runtime to monitor the CPU usage (because I needed to check and make sure that the MPI was working between the two nodes properly), and most of the CPU time that as running, it was the system CPU time rather than the user CPU time.

I will also note, as me3840 plotted, that the way the OpenFOAM solver runs, it's very "spiky". Rather than having a steady CPU load (like most of the other major, commercial CFD solvers), OpenFOAM would ramp up and down like crazy during the solve process.

So, if you decompose the domain enough, you will eventually hit a point where it isn't the memory bandwidth (or lack of available bandwidth) that's the problem - it's the amount of MPI communication that has to take place to keep the solution coherent, that starts to introduce significant overhead, especially when you look at the number of pair-wise core-to-core communication paths you can have between any two cores, anywhere.

As I said, I wished I know how to profile the MPI program to replicate the work that was performed by performed and published by the HPC Advisory Council, so that I could put more data behind that statement (increasing MPI overhead as the number of cores increases), but I don't know how to do that, unfortunately.

If someone here knows how, I can re-run my tests again whilst profiling it.

This has been fun/interesting.

Thanks.
Attached Images
File Type: png relative performance by memory configuration 2nodes.PNG (33.2 KB, 70 views)
alpha754293 is offline   Reply With Quote

Old   January 10, 2019, 09:23
Default
  #48
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
I am of the opinion that I think that the root cause of the slow down as the number of solver processes increases actually ISN'T due to the number of memory channels and/or memory bandwidth, but due to inefficiencies in the parallel scaling.

The MLC benchmarking and when compared with the PCM results show that there is plenty of memory bandwidth headroom for OpenFOAM to operate and execute in, but the "pulse"-like shape of the memory bandwidth usage, shown by the comparison of the PCM bandwidth vs. MLC bandwidth shows that almost half of the time, OpenFOAM isn't even using the bandwidth that it is otherwise using, let alone all of the bandwidth that's available.

And this also doesn't even take into consideration, the fact that the maximum memory bandwidth that is measured is always LOWER than the theorectical maximum that the system is supposed to be able to get (i.e. it's measuring "available" memory bandwidth) and OpenFOAM, as it is demonstrated, doesn't even reach the maximum memory bandwidth utilization of the AVAILABLE memory bandwidth with that, as the number of solver processes increases.

Again, to me, that suggests that it's due to solver process either communication or coherency/synchronization inefficiencies that is the root cause of what causes the slow down as the number of solver processes increase.

I think that it's easier for people to think that it's a memory bandwidth/channel problem but then like "forget" about MPI communication overhead/coherency/synchronization and how it effectively increases exponentially as the number of solver processes increases, and the increase in MPI communication/solver overhead results in increased latency, which slows the run down such that even with plenty of memory bandwidth overhead available, it is unable to fully make use of it.
alpha754293 is offline   Reply With Quote

Old   January 10, 2019, 12:00
Default
  #49
Senior Member
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 1,957
Rep Power: 30
flotus1 will become famous soon enoughflotus1 will become famous soon enough
We went through this already. There is a simple way to rule out parallelization overhead as the root cause of poor scaling. You have all the information you need, you just keep ignoring it to feed your narrative.
__________________
Please do not send me CFD-related questions via PM
flotus1 is offline   Reply With Quote

Old   January 10, 2019, 12:30
Default
  #50
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
We went through this already. There is a simple way to rule out parallelization overhead as the root cause of poor scaling. You have all the information you need, you just keep ignoring it to feed your narrative.
Quote:
Originally Posted by flotus1
All right, I'm going to say it since nobody else seems to dare: a lot of the stuff you write makes no sense. What are you trying to achieve here, why this fight against windmills with long-winded messages? Honing your argumentation skills?
We are not dealing with data volume, but with data transfers aka amount of data per unit of time.
MPI is aware of inter-node and intra-node communication.
Hybrid MPI/OpenMP schemes are not a workaround for shitty MPI implementations. Their goal is to increase scaling of a code, e.g. by reducing the communication overhead of MPI+domain decomposition with small sub-domains.
That's all from me, I'm out of this discussion.
That is such BULLSHIT.

You haven't gone through this AT ALL.

You have NOT "ruled out parallelization overhead as the root cause of poor scaling".

"You have all the information you need, you just keep ignoring it to feed your narrative."
Yes, and the data and the evidence PROVES that.

Quote:
Originally Posted by me3840

According to Intel this system can get 68 GB/s. I used the Intel MLC tool to find the actual bandwidth:
Code:
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
Code:
ALL Reads        :    45941.24
3:1 Reads-Writes :    45671.16
2:1 Reads-Writes :    46142.14
1:1 Reads-Writes :    49261.67
Stream-triad like:    46777.92
So it appears quite a bit lower than theory might suggest. I haven't looked into reasoning for this yet.
The plot that tell you that are this:


So despite the claims about OpenFOAM being memory bandwidth limited, not only can it NOT hit the theorectical 68 GB/s that his system should be able to hit (which, okay, we know that there are difference between the theorectical numbers and the "real world" numbers, i.e. what's the AVAILABLE memory bandwidth, but it's not even able hit that. His testing shows that on 6 cores, the results have a mean of around 38 GB/s which is between 8-11 GB/s SHORT of the AVAILABLE memory bandwidth. Now why is THAT?

You said that it can be easily explained and yet, despite that, EVERY SINGLE TIME I've asked - NO ONE has been able to provide an explanation as to why OpenFOAM, when you start increasing the number of MPI processes, isn't able to hit even the AVAILABLE memory bandwidth, let alone the theorectically available memory bandwidth.

The plot above that I have cited here CLEARLY shows that there is plenty of headroom available that OpenFOAM COULD be using, but isn't.

You guys keep arguing that OpenFOAM is limited by memory bandwidth, when all evidence to the contrary, there is PLENTY of memory bandwidth/headroom that OpenFOAM is NOT using, which, therefore; makes it such that literally shows that it is NOT limited by memory bandwidth it isn't even make full use of the available memory bandwidth that the systems and the various architectures afford and make available to it (as evidenced by the PCM plot during runtime vs. the available bandwidth vs. the theorectical bandwidth limit). If the gap is somewhere between 8-11 GB/s, that represents a gap of 21-28.9% respectively of AVAILABLE, NOT theorectical memory bandwidth that OpenFOAM isn't using.

And none of the advocates for the idea that OpenFOAM is a memory bandwidth limited application has been able to explain this given all the times and ways that I've asked.

THERE is your evidence, right there.

That would be like you saying that the speed limit of the interstate is 70 mph. You're doing 45 and then arguing that the reason why you can't go any faster is because the speed limit is too low, DESPITE the fact that you're not even driving AT the speed limit.

This IS what the data, which you proclaim and profess, proves your point that OpenFOAM is being constrained by memory bandwidth, and yet, the data and the plot LITERALLY tells you and shows you the EXACT OPPOSITE of that.

The memory bandwidth is like the highway speed limit of 70 mph. The data shows that you're only doing 45. And you're arguing that the speed limit on the highway isn't high enough or there aren't enough lanes on the highway for you to go any faster, DESPITE the fact that the evidence proves to everybody that you aren't even doing the limit.

Again, you argue that it can be easily explained and every since me3840 published his data, none of you have provided any sort of commentary or explanation in regards to that.

So if it can be so easily explained, why haven't any of you done so already?
Attached Images
File Type: png total_bandwidth edited.png (42.4 KB, 86 views)

Last edited by alpha754293; January 10, 2019 at 21:03.
alpha754293 is offline   Reply With Quote

Old   January 12, 2019, 01:35
Default
  #51
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
An additional plot:


Looking at this, when you have 8 cores, 8 DIMMs, and 8 channels of memory, you are performing at 6.65x the performance of 1 core with 8 DIMMs, and 8 channels of memory.

When you increase that to 16 cores, 16 DIMMs, and 16 channels of memory, when you only are using GbE as your interconnect, you can only achieve 11.74x the performance of 1 core with 8 DIMMs, and 8 channels of memory.

When you use 4x EDR IB instead (100 Gbps) as your interconnect, that relative performance only increases to 12.79x.

Relative to the amount of improvement you COULD be getting, you're only at about 0.8 (i.e. you're only really able to effectively use 12 out of the 16 cores, despite the fact that you've given it 16 DIMMs in 16 channels of memory, tied together with a high speed, ultra low latency system interconnect).

Ultimately, of course, you get the best and fastest performance with 32 cores, 16 DIMMs, and 16 channels of memory because the speed-up ratio (against 1 core, 8 DIMMs, and 8 channels of memory) is about 16.1x, and yet, despite giving it all of that available bandwidth, OpenFOAM still can't make full use of it.

However, if you monitor how the solver runs using top, and change the default delay from 3 to 1, and then show the realtime stats for the individual processors rather than for the system as a whole, you will see the actual CPU usage ranges from about 70% to about ~90-ish%.

This suggests that for this benchmark, especially on larger systems, it is relatively ineffective at benchmarking larger systems because the number of calculations that it needs to perform can be performed by the processors of the larger systems in less than one second, which means that the rest of the time, the CPU utilization is clocked as system CPU utilization/time.

me3840's PCM plots provides a little deeper insight into this by the pulse-like nature of the memory bandwidth, especially in terms of the proportional width of the pulses and also the period (or frequency thereof) between a full on-pulse-off-pulse cycle.

With 32 cores, 16 DIMMs, and 16 channels of RAM, connected over 4x EDR IB (with just two nodes), I can solve the benchmark in 61.14 seconds.

The data shows that out of the available memory bandwidth, this OpenFOAM benchmark does NOT take nor make full use of it. In fact, the memory bandwidth utilization is quite poor (38 GB/s out of 68 GB theorectical max, e.g. only 56%, and 38 GB out of ~48 GB/s +/- 7 GB/s, which would be 64.4-92.7% of the available memory bandwidth).

Literally, this would translate as a highway with a speed limit of 68 mph. When you test it, it can get up to 48 mph +/- 7 mph. But in practice, it can only average about 38 mph on a 68 mph highway and the claim that it's memory bandwidth intensive, despite the fact that it can't even hit what's available when you actually run it is like saying that the speed limit is too low when it can't even hit the available speed limit.

And so far, nobody has been able to offer an ACTUAL explanation (not just SAYING that there is an easy explanation for this), but an ACTUAL explanation as to why it can't hit 48 mph available limit (and I'm completely ignoring the 68 mph limit).

The test shows that the highway can do 48 mph that's available to you.

But the test of the actual "car" also shows that on average, it can only do 38 mph (give or take) and no one here has been able to explain why. (Hint: it's certainly not the highway's fault. The highway has a speed limit of 68 mph. Tests on the highway shows that you can do 48 mph +/- 7 mph. The 'car' can't even do that.)
Attached Images
File Type: png 32-core speedup ratio.PNG (24.4 KB, 51 views)
alpha754293 is offline   Reply With Quote

Old   January 12, 2019, 05:33
Default
  #52
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
1) No one ever said that CFD is memory bandwidth limited for 1 core.


2) A better car analogy would be; mainland Europe allows higher speeds on the highway compared to the US (because you believe translation between two different ways of measuring should be 1:1). Then you try to understand why Europeans are not going as fast as your American car when you are there driving. When the police shows up you just "what? I am going at the speed limit, I just don't understand why I am 60% faster than the other cars".


3) I understand that you are very interested in this and I think that this could have been a nice discussion, however you are sometimes very disrespectful and this is also why only a handful reply to whatever you write.
Simbelmynė is offline   Reply With Quote

Old   January 12, 2019, 22:28
Default
  #53
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
1) No one ever said that CFD is memory bandwidth limited for 1 core.
Neither did I.

But if the proponents of the "CFD is a memory bandwidth limited application" (note CFD, NOT just OpenFOAM), the data here shows that the systems have plenty of even AVAILABLE memory bandwidth available and OpenFOAM doesn't even come REMOTELY close to using what's available, let alone what's theorectically possible.

The normalisation back to single core performance makes it so that it is easier for people to see that increasing the number of DIMMs installated (e.g. slots), and memory channels (e.g. 8 DIMMs in 8 memory channels) running with 8 cores only yields a 6.65x speed-up increase.

Therefore; if the argument from the proponents is that CFD is a memory bandwidth limited application, there is PLENTY of memory bandwidth available when you're only running 8 cores, with 8 DIMMs installed, and 8 memory channels and yet, again, the DATA shows that there is PLENTY of memory bandwidth/headroom available - bandwidth/headroom that CFD, and more specifically, OpenFOAM ISN'T using. And so far, also of all the people who have argued that there is a simple explanation for it or that it has already been explained, I've gone back and re-read all of the "explanations" and NONE of them cover why this is the case - why CFD, and again, more specifically OpenFOAM, isn't/doesn't take advantage of/leverage the TOTAL available memory bandwidth leaving a SIGNIFICANT portion as unused headroom.

There is literally NO explanation that has been provided in regards to this so far.

So how can a program/application (area) be memory bandwidth limited if it doesn't use, to its fullest potential, all of the available memory bandwidth that any system/architecture is afforded to it?

Again, NO one has an answer/explanation in regards to that.


Quote:
Originally Posted by Simbelmynė
2) A better car analogy would be; mainland Europe allows higher speeds on the highway compared to the US (because you believe translation between two different ways of measuring should be 1:1). Then you try to understand why Europeans are not going as fast as your American car when you are there driving. When the police shows up you just "what? I am going at the speed limit, I just don't understand why I am 60% faster than the other cars".
That whole paragraph depends GREATLY on where you are driving on continental Europe.

If you're driving between cities on the Deutsch Autobahn, the "recommended" "limit" outside of cities is 130 km/h, but there are stretches where that's just a "recommended" "limit" and not a "hard" limit.

Conversely though, if you're driving, for example, on the M4 just west of London, the 70 mph limit is a hard limit, and there are sections where, it's stupid cuz they would put the speed cameras and drop the speed limit temporarily down to like 50 mph for seemingly NO APPARENT reason.

So it depends. We never got around to driving the Calais to Pierrefonds stretch in France, so I can't tell you what the speed limits are like there.

Regardless, I'm not sure if you actually read that or interpreted that correctly.

That would be saying that the speed limit of the M4 is 112 km/h, and the traffic around you are only able to max out at around 76 km/h and you, personally, are only able to max out around 62 km/h and you're saying/blaming that the M4's 112 km/h (70 mph) limit is too low.

So contrary to your statements, you're not going faster than the rest of traffic. The REST of traffic is able to go faster than you, and you can't, but you're blaming the speed limit on the M4 as the reason why you aren't able to go any faster.

This is what the evidence shows.

Re-read the analogy again. Nowhere in that analogy is ANYBODY exceeding the speed limit of the highway, so I have no idea what part of your body that never sees the light of day, you're pulling that statement out from.

Quote:
Originally Posted by Simbelmynė
3) I understand that you are very interested in this and I think that this could have been a nice discussion, however you are sometimes very disrespectful and this is also why only a handful reply to whatever you write.
BWAHAHAHAHAA.......

It's not my problem nor my fault that I am and I have LITERALLY told you guys that CFD is NOT a memory bandwidth problem/application and yet you guys INSIST on it.

Like I said, prior to me3840 and I using MLC and PCM to collect the actual data/statistics, I was using my nodal interconnect as a surrogate for measuring the amount of throughput (and the total volume of data) that is exchanged during the course of the run (because this was before I learned about PCM). Now that me3840 has collected the actual throughput data using PCM, we are all able to see (this was the whole thing that I was talking about having a DATA-driven discussion, which those that find me "disrepectful" failed to provide to substantiate their claims) that the theorectical bandwidth is greater than the available bandwidth is greater than the ACTUAL bandwidth that CFD/OpenFOAM (specifically) uses.

And NONE of the proponents of that idea originally has ANY idea or ANY explanation as to why CFD/OpenFOAM (specifically) isn't able to take full advantage of the available memory bandwidth, let alone the theorectical max.

Absent of an explanation that could explain this, I again, refer and defer by to my original statement and rebuttal that CFD (OpenFOAM specifically) is NOT memory bandwidth limited. The supporting evidence to my statement is literally demonstrated with data (which again, those that find me "disrespectful" has completely failed to provide) and is summarized in just one plot, this:


If you're going make a statement, next time, make sure that you can actually perform and execute the tests that are necessary that will collect the data that you will need to support your statement.

Again, so far, the available memory bandwidth is HIGHER (by quite the margin too) than the memory bandwidth that CFD/OpenFOAM (specifically) uses during the course of a run.

If it were memory bandwidth limited, I would have EXPECTED the memory bandwidth demand will exceed the memory bandwidth supply, but that is clearly, not the case here, and as I have said, ergo; is NOT memory bandwidth limited.

If you can explain why CFD/OpenFOAM (specifically) doesn't use all of the memory bandwidth that's made available to it (relative to the available memory bandwidth) such that you can show that the demand for memory bandwidth outstrips the supply, then I would be farrr more likely to be inclined to agree with you.

However, as it is right now, none of the proponents of this original idea, in light of the data published by me3840 has been able to justify their views/opinions/hypothesis in the presence of data. This is why, it's so important to have a data driven discussion. It's not my fault nor problem that certain taxonomy of people get so easily offended when they're unable to present their case/argument with data, in a technical forum.

(Again, and oh by the way, that graph/plot isn't even my data. It's me3840's, so a lot of credit goes to him for running PCM to collect the data. You say that only a few people comment on this thread because of MY being disrespectful, and yet, that isn't even my data. If you make a statement, you should be prepared to provide evidence in support of the statement. The end runtimes tell you nothing about the memory bandwidth demand/utilization during the course of a run. These ARE the facts, whether you like it or not. You're just pissed because finally there is someone who's willing to challenge you on your claim and assertion and again, I have done so with data/evidence that has been publically shared so I have full and complete transparency.)
alpha754293 is offline   Reply With Quote

Old   January 13, 2019, 05:26
Default
  #54
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post
Neither did I.
And NONE of the proponents of that idea originally has ANY idea or ANY explanation as to why CFD/OpenFOAM (specifically) isn't able to take full advantage of the available memory bandwidth, let alone the theorectical max.

And what is your explanation as to why the MLC does not take full advantage of the theoretical max?
Simbelmynė is offline   Reply With Quote

Old   January 13, 2019, 05:36
Default
  #55
Senior Member
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 1,957
Rep Power: 30
flotus1 will become famous soon enoughflotus1 will become famous soon enough
Excuse me? I have a fairly good understanding of why most real codes can not saturate the measured maximum memory bandwidth of a system.
And it is a well-established fact that the measured maximum memory bandwidth has to be lower than the theoretical maximum bandwidth.
Stop it with the accusations. You are not talking to idiots here. Do not confuse my unwillingness to enter into a real discussion with you specifically with a lack of expertise.
__________________
Please do not send me CFD-related questions via PM
flotus1 is offline   Reply With Quote

Old   January 13, 2019, 11:17
Default
  #56
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
And what is your explanation as to why the MLC does not take full advantage of the theoretical max?
me3840 actually tells you that in his posts containing his data:

Quote:
Originally Posted by me3840
The OS is not as clean as a cluster node would be, but as I dismantled my cluster, this is all I have for now.

*snip*

For me to draw any conclusions I think would be premature; I would like to look into the code scaling issues as well as why the experimental bandwidth is much lower than what Intel states.

However it's quite evident that the bandwidth usage after the number of memory channels is saturated doesn't rise very much. OpenFOAM may not be using the channels as efficiently as the MLC can, but it very certainly is being constrained by the bandwidth it can utilize with its problem set. There is some rise in latency, but I don't have any way to say that the rise in latency causes some drop in performance without additional tests. Certainly FLOPs are not constraining the problem, otherwise better scaling would have been realized.

*snip*

Again we can easily see the drop in bandwidth after np=4. The code seems to scale better than the bandwidth might suggest. The loss in performance for np<=4 looks indicative of some other problem to investigate. The difference between mean and max bandwidth is tiny at this scale. Since my system is not exactly what one would call production software, np=6 may be flaky considering the other processes running (gdm, compiz, firefox, openvpn etc) which would not be present in a real environment. If I have time I'll prepare a production installation on a separate disk, as right now I'm not getting new cluster hardware yet.
If he is performing the test on a machine that he normally uses, which he states, the other processes that are running can interfere with what the MLC reports.

His answer to your question is given in the second last sentence in my above citation of his comments/remarks.

Quote:
Originally Posted by flotus1 View Post
Excuse me? I have a fairly good understanding of why most real codes can not saturate the measured maximum memory bandwidth of a system.
And it is a well-established fact that the measured maximum memory bandwidth has to be lower than the theoretical maximum bandwidth.
Stop it with the accusations. You are not talking to idiots here. Do not confuse my unwillingness to enter into a real discussion with you specifically with a lack of expertise.
And despite that, you have not been able to offer an explanation, based on your knowledge, experience, and understanding that explains this:


Therefore; you can make the claims about yourself all you want, but until and unless you present the raw data for the collective to study and analyze, your statements are meaningless unless they're backed up and supported with data/evidence/etc., which you haven't provided.

This has been my second point about this entire discussion.

Proponents of this idea, such as yourself, has failed to produce and publish the data and present the raw data for the everybody to stare at and analyze.

me3840 and I, so far, are the only ones here who has actually done so.

You can make whatever claims you want, but until and unless you support your claims with data, it'll just be philosophy or a theory.

Further, I am not arguing that the program should be able to hit the THEORECTICAL memory bandwidth limits. The papers which I have cited previously actually describe, in detail, (especially Chai's dissertation) on some of the root causes that leads to that.

And you can argue that no "real" code will be able to achieve the maximum MEASURED (read: available memory bandwidth) EXCEPT for ones that do.

If this benchmark doesn't, and if you look at both the results from PCM and top, as I have already talked about before, the benchmark is quite possibly insufficient to measure the performance of these higher performing systems now because the total number of floating point operations is LESS than what the number of floating point operations that the processor cores can perform per second. Borrowing a term from the storage industry, it's "short stroking" - the act where it finishes what it needs to do before the time is up. (Which in a way is great, but leads to inefficiencies because it doesn't make full use of the processor's compute capabilities).

And if you monitor top, you see that. The CPU utilization that's reported back to the OS kernel is 100%, but the breakdown of it is that only between ~70-90% of that utilization is the amount of time running un-niced user processes while the remainder of it is the time running kernel processes.

Compare and contrast that with the output from top below:


Where you see that almost all of the cores are spending 100% of the time/utilization as "user processes" and only two of the cores in this snapshot has kernel processes of 1% (and even then, only momentarily so).

So...you talk about how much you know.

You talk about how CFD (or more specifically OpenFOAM, and even more specifically this specific benchmark) is memory bandwidth intensive.

And yet, none of you has talked about how or why this benchmark, OpenFOAM, or CFD in general has such a high kernel process (usage) time compared to other CFD codes or cases (even within the same CFD code).

Below is a snapshot from top, while Fluent is running a 13.5 million polyhedra cell, transient, DES simulation:


The statement you guys made is that CFD (not OpenFOAM, not the OpenFOAM motorbike benchmark) is memory intensive.

And as this case shows (which I'm running on 32 DMP/MPP processes acrossed two nodes) (with an entirely different solver), This has MUCH higher level of time spend running user processes compared to the amount of time spent running kernel processes, which is the exact opposite of the case with respect to the motorbike benchmark running with OpenFOAM, which may or may not be indicative of OpenFOAM in general, and is DEFINITELY NOT indicative of CFD in general.

The data and the evidence shows that.

And yet, you, and the others that keep trying to promulgate that idea, do so, without having presented any data or evidence here (at this level of study/research) that actually supports your claims and theories.

So you say that you know how it works, but you don't present any data demonstrating that and even less about the details of how OpenFOAM (and this motorbike benchmark, farrr more specifically) runs.

You say you know a lot about it, and yet, there has literally been zero information coming from you and/or your "camp" that talks about top, time running un-niced user processes, time running kernel processes, the fact that the motorbike benchmark runs so fast now that you can't even hit 100% user CPU utilization.

So you want to say that it's a "memory bandwidth" issue or that CFD is a memory bandwidth intensive application and yet, NONE of the data or evidence provided here or above actually supports that claim or theory.

And you haven't presented either any explanation to explain the data that we ARE seeing or presenting your OWN data which WILL support your claim/theory.

If you know so much, then it should be very easy for you to show and present and share your raw data of the research that you have performed/conducted in the past which led you to this conclusion that CFD is a memory intensive application/program.

I mean, surely, you must have performed some investigative research that gave you the data which led you to this conclusion, otherwise, why would you say that if you never saw the data in support of it?

If you don't want to show your data, that's fine. That's your choice. But if you can't explain (since you profess and proclaim that you know a lot about this subject) why we are seeing what we are seeing, then you possessing that knowledge is completely useless and irrelevant here if you can't explain why, for example, the user CPU utilization can't or doesn't hit 100%? Is the motorbike problem "too small" for these modern systems which, not only has higher core counts, but also significantly faster cores as well?

How would you be able to separate it between a memory bandwidth issue vs. that this motorbike benchmark may be too small for these newer, bigger, faster systems to handle?

Surely, you must be able to use and leverage your experience, expertise, and knowledge to help explain this. (which again, to date, you haven't)

I've told you that I know nothing about how OpenFOAM works or how it runs. All I can tell you is with the methods and tools that I know of and have at my current disposal - I can only show you what I am seeing as the end user. And this is what I'm seeing. And when your average user CPU utilization is only 80%, I don't think that's a memory bandwidth problem at all. The question that would be in my head is why is it ONLY ~80% user CPU utilization? (which again, is usually indicative of the problem being too small as I've already commented on previously).

None of you has even broached this possibility. It's just "CFD is a memory bandwidth intensive application. Period."

So what good is it to have all that knowledge about how it works if you can't explain something as apparent as this?
Attached Images
File Type: png 100 percent CPU user utilization.PNG (27.5 KB, 33 views)
File Type: png Fluent CPU user utilization.PNG (178.4 KB, 34 views)
alpha754293 is offline   Reply With Quote

Old   January 13, 2019, 11:50
Default
  #57
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post
me3840 actually tells you that in his posts containing his data:

If he is performing the test on a machine that he normally uses, which he states, the other processes that are running can interfere with what the MLC reports.

His answer to your question is given in the second last sentence in my above citation of his comments/remarks.


So basically, you demand that we provide answers to why OF is not reaching the bandwidth recorded in the synthetic MLC. But when the MLC does not manage to reach the theoretical limit then "bad setup" is to blame?
Simbelmynė is offline   Reply With Quote

Old   January 13, 2019, 22:54
Default
  #58
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
So basically, you demand that we provide answers to why OF is not reaching the bandwidth recorded in the synthetic MLC. But when the MLC does not manage to reach the theoretical limit then "bad setup" is to blame?
MLC is separate from PCM.

And if you read the results carefully, you will also see that MLC tests the different, fundamental/root/core operations (like measuring FLOPS performance using LINPACK which is entirely based of the Gauss elimination with partial pivoting solver) -- what next you're going to say that's synthetic as well despite the fact that this is what CPUs DO?

Again, you're just pissed that with the tools we have available (as author, published, and utilized - you know - by the people who MANUFACTURER the processors, the systems, the architectures, etc.) these are the same tools that they use to develop the systems.

Are you going to argue that because we are using the same tools that they use, that Intel doesn't know what they're doing (and same with AMD as I'm certain they have their own versions of these tools in order to test and collect data/information about their designs and engineering to make sure that it will beat Intel, given that Intel has literally written about and published that they used these tools to help develop their processors, systems, architectures)?

By that logic, your entire claim that CFD is memory bandwidth intensive is equally BS because the motorbike benchmark is just as synthetic since no actual CFD case is ever exactly like this motorbike benchmark, and therefore; this is just as useless (and synthetic) as the very tools that are used to develop the processors/systems/architectures that you're trying to test.

Is that REALLY the claim that you're trying to hang your hat on now that we have presented ACTUAL data that contradicts your claim and you can't prove your claim anymore IN THE PRESENCE OF ACTUAL DATA?

Really? Is that the logic path that you want to go down?

Because I can just as easily take your logic and apply it straight/directly to the motorbike benchmark and argue that your claim is invalid because the results are valid IFF for the motorbike benchmark and for NOT for any other type of runs/simulation/analysis that ISN'T the motorbike benchmark.

Really? That's the logic path that you want to go down?

This is how people, once they have been presented with data that contradict their world views/mental model, try to defend and hang on to their world view/mental model that has just been shattered/proven to be false, again, IN THE PRESENCE OF ACTUAL DATA.

PCM measures runtime data.

We know what the theorectical limits of the memory bandwidth of the processor/system/architecture are.

And with MLC, we are also able to measure that depending on way that the processor is used and how it works/acts/functions, we also have data that tells us how that processor/system/architecture respond given that kind of a workload.

Again, that's like testing systems with LINPACK (which, BTW, the GAMG linear algebra solver that's called via fvSolution is a linear algebra solver) and saying that the LINPACK is irrelevant (despite it also being a linear algebra solver) because that's not the same one that's used by OpenFOAM.

You argued that CFD (again, not OpenFOAM specifically, and not even this motorbike benchmark specifically), but CFD, in general is a memory intensive application.

You argue that MLC produces synthetic results DESPITE the fact that you can LITERALLY, hand-calculate what the theorectical maximum bandwidth is - a calculation that even a 10 year old, in 4th or 5th grade in primary school, should be able to complete.

You continue to argue your antiquated mental model/world view because PCM is able to measure the use of more memory bandwidth as the number of cores increases (which tells you that even without MLC or the super simple hand calculation), it tells you the maximum amount that it CAN use and it will ONLY reach that max when more cores are used and not prior to that. You would think that if you're memory bandwidth limited, and knowing that a 6-core run is using more memory bandwidth than the 4-core run, you have provided NO explanation as to why the four core run can't consume or utilize the memory bandwidth that is demanded by the 6-core run.

So even within your own argument/rebuttal, it's inconsistent because again, IN THE PRESENCE OF ACTUAL DATA, you are no longer able to defend your claim and explain why this is the case.

If this is as memory bandwidth hungry of an application as you profess it to be, then it should ALWAYS be trying to push to consume that level of memory bandwidth because ultimately, the processors HAVE to be able to perform and complete that amount of work/task/operations/instructions.

(This is further substantiated with results from top, if you've ever cared to monitor it during the course of a run, and as my early surrogate, ethtool statistics. I've literally cross-validated the results using AT LEAST four different tools, with four entirely different methods in order to verify my claims, and I've published all of that data here for everybody to see, including yourself, so that you can conduct your own analysis with the raw data. To date, I haven't seen any data coming from you that's even remotely close to being this detailed/in-depth in support of your own claims.)

Quote:
Originally Posted by Simbelmynė View Post
So basically, you demand that we provide answers to why OF is not reaching the bandwidth recorded in the synthetic MLC. But when the MLC does not manage to reach the theoretical limit then "bad setup" is to blame?
Secondly, I'm not sure what is the root cause that makes it such that the comprehension skills are so lacking here.

I don't/didn't make the claim that the reason why MLC can't hit the theorectical bandwidth is due to system setup.

me3840 did.

Did you even READ that part or did you just completely skip over/ignore that fact as well? (And you wonder why you were lumped in with a certain demographic...)

So, how about you show a little respect to me by NOT putting words in my mouth that I didn't say. I cited.

There is a difference.

Third: I don't demand anything.

I've asked you, time, and time, and time again to present data for a discussion centered around data, and the level of data that you have presented is literally insufficient for you to arrive at the conclusion that you are promulgating.

Again, this is supported by the data that m3840, NOT ME that has provided. (I want to make sure that's perfectly clear since there appears to be some confusion about that.)

You don't HAVE to provide data. And you haven't.

And therefore; in the absence of your data and the presence of me3840's and mine, we are able collect and present data and ask questions that are able to disprove your theory/hypothesis, which again, is my VERY first comment to you:

I disagree with you.

Except now, we literally have the data to prove it and you haven't been able to explain your hypothesis IN THE PRESENCE OF ACTUAL DATA.

And that's fine. You don't have to.

All that means is that the data that we have provided and presented here, which you can't explain yourself away from/out of compared to the absence of the data which you have not provided -- means that we are able to prove, with data, my very first response to you: I disagree with your statement and assessment that CFD is a memory bandwidth intensive application.

You don't have to provide any data that you don't want to. I'm in no position to make demands from anybody, for anything.

If you don't want to do it, there is literally nothing that I can do that will actually make it a "demand" such that you cannot choose to provide no data. Of course you can choose that. That's entirely up to you as you've proven, acted upon, and demonstrated.

But if you don't, or you can't provide a logical, scientific, rational explanation that as you try and explain your way out of the data that we are seeing/presenting, which again, to make it perfectly clear since there might be some confusion here - you don't have to.

All that means is that with the data that has been published and presented to the community here for the collective to stare that, with that, we are able to disprove your theory/hypothesis/statement/sentiment, and we aren't making that statement just by pulling it out from our regions of the body that never see the light of day. We are arriving at this conclusion through rigorous, scientific, and thorough analysis and testing of the hypothesis by way of the scientific method.

You come up with a hypothesis, devise ways and methods and processes/procedures to test the hypothesis, and then analyze the data to see if it supports or doesn't support your hypothesis.

And in this case, it supports my hypothesis that I disagree with your statement and sentiment that CFD is a memory bandwidth intensive application.

Like I said, we literally have the data to prove it - data to which, you presently do not have a counterargument to explain why we are seeing what we are seeing.

(And end scene.)

Sources:
Behrens, T. (2009). "OpenFOAM's basic solvers for linear systems of equations" (retrieved from http://www.tfd.chalmers.se/~hani/kur...report-fin.pdf on January 13, 2019).

Culpo, M. (2011). "Current Bottlenecks in the Scalability of OpenFOAM on Massively Parallel Clusters" (retrived from http://www.prace-ri.eu/IMG/pdf/Curre...Clusters-2.pdf on January 13, 2019).

*edit*
BTW, I would also think that your "CFD is a memory bandwidth intensive application" theory should be very easy to test with this motorbike benchmark.

Again, since I am completely new and a complete n00b to OpenFOAM, if I understand it correctly, you can make the problem 8 times more difficult simply by doubling the number of cells in each direction by changing line 34
of bench_template/basecase/system/blockMeshDict from:
Code:
hex (0 1 2 3 4 5 6 7) (40 16 16) simpleGrading (1 1 1)
to:

Code:
hex (0 1 2 3 4 5 6 7) (80 32 32) simpleGrading (1 1 1)
(Again, that's if I understood how it works/how it is setup.)

And then you can re-run the whole thing and using PCM, measure the memory bandwidth that it needs during the course of this run.

I have no evidence, from you, to be able to tell whether you've even thought about performing a simple test like this or whether you've even thought about the statement and/or the nature of the problem given the evidence of data that we are presently seeing. For all I know, you could be performing this test already and I wouldn't even know it, in which case, I look forward to seeing your results.
alpha754293 is offline   Reply With Quote

Old   January 14, 2019, 02:01
Default
  #59
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 313
Rep Power: 9
Simbelmynė is on a distinguished road
Quote:
Originally Posted by alpha754293 View Post

By that logic, your entire claim that CFD is memory bandwidth intensive is equally BS



I have never claimed CFD is memory bandwidth intensive. CFD is memory bandwidth limited.


You on the other hand claimed that the number of cores on a CPU is the most important metric to opt for when designing a CFD workstation.
Simbelmynė is offline   Reply With Quote

Old   January 14, 2019, 15:32
Default
  #60
Banned
 
Join Date: May 2012
Posts: 111
Rep Power: 0
alpha754293 is on a distinguished road
Quote:
Originally Posted by Simbelmynė View Post
I have never claimed CFD is memory bandwidth intensive. CFD is memory bandwidth limited.
Is there any REAL difference between the two?

Afterall, your first post/response was:

Quote:
Originally Posted by Simbelmynė
If you are going to do CFD simulations then you should opt for the following (in this particular order)


1. Memory bandwidth
2. Memory bandwidth
3. Memory bandwidth


This means that you wish to run a system with A) many memory channels and B) High memory frequency.
What's the difference between being memory bandwidth intensive vs. being memory bandwidth limited?

You're still talking about where the demand for memory bandwidth exceeds the available supply of memory bandwidth. Let's just assume that I'm an idiot. Therefore; perhaps you can school my ass on what's the difference.

Now, I'm going to state here my assumption that you've read Clapo's paper, where he defines it in the context of workload characterization (i.e. the theme of his paper, which was presented on the Symposium on Workload Characterization).

Further, it's absolutely adorable how you just keep on trying your world view/mental model/etc. despite the fact that we now have evidence to the contrary.

The fact, for example, that the 6-core run that me3840 performed utilizes a higher amount of memory bandwidth than the 4-core run literally TELLS you (even if you reject the available memory results from MLC, a strategy that I learned from and that is used and deployed by Intel, and you reject the "back-of-the-envelop" theorectical maximum bandwidth calculations that like I said, any 10 or 11 year old in 4th or 5th grade should be able to do - even if you reject all of that, the 6-core run (as measured with PCM) TELLS you that it is able to use a higher memory bandwidth than the 4 core run.

Therefore; the mere fact that the 4 core run DOESN'T and ISN'T capped at the same level of memory bandwidth usage TELLS you that it isn't limited by memory bandwidth, contrary to your claims.

So like I said, I think that it is absolutely ADORABLE that DESPITE the facts and DESPITE the evidence/data, you still want to hang on to your old world view/mental model.

You know....people from the "old world" used to think that the world was flat until somebody decided to take a boat trip.

But...nevermind that, right, Simbelmynė?

Quote:
Originally Posted by Simbelmynė
You on the other hand claimed that the number of cores on a CPU is the most important metric to opt for when designing a CFD workstation.
Yes.

See Figure 5 from Culpo's paper.

Also cf. Culpo's conclusions in his paper in regards to the scalability of OpenFOAM on Massively Parallel Clusters. I'd like to draw your attention to the conclusions that he arrives at, based on the documentation of his work and study, published in the remainder of his paper.

So, if you want to disagree with the assessment and conclusions, now you are taking it up with Intel's hardware engineers, system engineers and architects, AMD's hardware engineers, system engineers and architects, as well as the HPC Advisory Council, and all of the other people whom I've cited throughout this exchange.

Feel free to send an email to any and/or all of those people whom I've cited (most of their contact information is made publically available in the respective works) and you are more than welcome and free to shoot them an email to tell them that they're all wrong, according to you.

I don't mind that you disagree with all these people. Again, as I've said, you're free to take it up with them.

I think that you should tell all these people whom I've cited that they're all wrong.

I've got no skin in the game.

In fact, I strongly urge and encourage you to do that.

Let me know how that goes/works out for you.

You bring the theater, I'll bring dinner.

P.S.
You may think that I'm an idiot. I don't really care. But that also means that you think that Intel, AMD, HPCAC, etc. are also all idiots - so...good luck with that, because if you think that I'm an idiot, you are implicitly also saying/thinking that Intel/AMD/HPCAC, etc. are also all idiots because these are the people whom I've cited throughout the course of this exchange/discussion.

So...if you think THEY'RE idiots too (because that's where I got and picked up/learned/obtained my information from), then at the very least, I know that I'm just much of an idiot as those guys. And I'm totally, 100% okay and fine with that.

I don't have a problem with that.

So by all means, please disagree with Intel, AMD, HPC Advisory Council, etc. some more - a.k.a. my sources of information/the people where and whom I learned this stuff from.

If you think that I'm wrong, then that means that you also think that they're wrong as well.

Go nuts. Knock yourself out disagreeing with Intel/AMD/HPC Advisory Council, etc.

Have fun with that. I'm sure that you'll have a blast!
alpha754293 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Building a "home workstation" for ANSYS Fluent Ep.Cygni Hardware 18 October 19, 2018 18:41
Low budget Workstation . Second Hand suggestions well received. Fer_Arus Hardware 4 April 10, 2017 12:20
mpirun, best parameters pablodecastillo Hardware 18 November 10, 2016 13:36
Optimal 32-core system bindesboll Hardware 17 July 9, 2013 11:58
Recommended PC for CFD Paul Main CFD Forum 10 March 13, 2006 16:16


All times are GMT -4. The time now is 09:33.