CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Benchmark fpmem

Register Blogs Community New Posts Updated Threads Search

Like Tree6Likes
  • 2 Post By ErikAdr
  • 1 Post By flotus1
  • 1 Post By Simbelmynë
  • 1 Post By ErikAdr
  • 1 Post By wkernkamp

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   March 2, 2022, 05:59
Default Benchmark fpmem
  #1
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
The STREAM benchmark test the memory bandwidth, even though floating point operations are made. In the benchmark, the number of floating point operations doesn’t exceed the number of loads. This is likely also the case for many CFD programs, but for higher order solvers based on Cartesian grid, it is not the case. The ratio between number of floating point operations and loads, could be much larger for such solvers.
Optimizing in HPC is often about minimizing the reading from memory. The work can be split into smaller chunks, where as much work as possible is done on each chunk, before the next chunk of memory is processed. The relevant size of such chunks should be determined.

I have made a benchmark, fpmem, that gives the floating point performance for various combinations of floating point operations pr load, and the size of the array processed. The benchmark doesn’t do any real work, but it can be compiled, linked and run in about 5 minutes.

The instructions for compiling, linking and usage of the benchmark is given in the first few lines of the source file. It requires a resent C++ compiler (-std=c++17) and mpi. It uses AVX2 when compiled with -D_USE_INTRINSIC. See instructions.

I hope that some care to use the benchmark and post the results. The benchmark is made to run on one CPU, and if used on a large cluster the performance will just increase linearly with the number of CPUs. I don’t have access to EPYC Milan or newer Xeons on socket LGA4189 so for me results from these could be very interesting.

I have attached the benchmark (fpmem.c) and the results for my newly build system with an Intel i5-12600.


Edit: I have uploaded a new version, that corrects an error that effected the reported performance values with up to about 10%.
Attached Files
File Type: txt i5_12600.txt (1.5 KB, 14 views)
File Type: c fpmem.C (13.2 KB, 3 views)
flotus1 and wkernkamp like this.

Last edited by ErikAdr; March 3, 2022 at 05:55.
ErikAdr is offline   Reply With Quote

Old   March 2, 2022, 07:17
Default
  #2
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
Here are some results from my system (2x Epyc 7551). First one with all 64 threads, second with only 8 threads pinned to the first CCD. The latter makes it behave like a single 1st-gen Ryzen CPU with very low clock speeds and extremely crappy memory.
7551_a.txt
Code:
compiled: mpiCC -D_USE_INTRINSIC -std=c++17 -O3 -march=native -c fpmem.c (gcc version 9.2.0)
run: mpirun -np 64 ./fpmem 30 24
System: 2x AMD Epyc 7551, 16x32GB DDR4-2666 2Rx4, OpenSUSE Leap 15.3, 5.3.18-150300.59.46-default

                            Performance (Gflops) using 64 processes with AVX2
FLOPs/load:        0.50       1         2         4         8        16        32        64  
Array size
       8kB:      316.30    319.22    515.52    917.54   1050.67   1116.92    892.12    659.82
      16kB:      320.53    319.74    518.03    918.43   1050.43   1114.73    892.49    660.28
      32kB:      273.02    295.91    512.90    900.98   1043.95   1116.21    892.29    660.26
      64kB:      277.78    295.18    512.84    903.64   1043.42   1116.13    892.17    659.73
     128kB:      272.64    296.74    509.96    897.51   1040.41   1116.53    893.17    659.66
     256kB:      228.56    291.10    486.24    862.42   1021.68   1110.92    893.14    659.84
     512kB:      138.33    271.48    460.28    831.43    999.17   1096.32    892.43    659.86
       1MB:      110.29    170.01    312.09    673.73    898.32   1051.14    888.12    658.28
       2MB:       12.53     25.13     50.02    125.73    227.88    425.46    831.72    656.62
       4MB:       11.63     23.28     46.65    116.91    211.14    398.37    765.78    656.24
       8MB:       11.63     23.29     46.59    116.60    210.99    398.59    767.66    655.67
      16MB:       11.64     23.33     46.78    116.84    211.22    398.66    768.35    655.37
      32MB:       11.69     23.47     47.00    117.33    211.95    400.49    765.00    655.42
      64MB:       11.72     23.61     47.29    118.19    213.55    406.43    752.34    652.60
7551_b.txt
Code:
compiled: mpiCC -D_USE_INTRINSIC -std=c++17 -O3 -march=native -c fpmem.c (gcc version 9.2.0)
run: mpirun -np 8 --bind-to core --rank-by core --map-by core ./fpmem 30 24
System: 2x AMD Epyc 7551, 16x32GB DDR4-2666 2Rx4, OpenSUSE Leap 15.3, 5.3.18-150300.59.46-default

                            Performance (Gflops) using 8 processes with AVX2
FLOPs/load:        0.50       1         2         4         8        16        32        64  
Array size
       8kB:       37.58     38.02     61.04    110.03    125.33    134.04    108.10     80.87
      16kB:       38.58     39.24     63.00    110.46    126.50    134.51    108.40     80.88
      32kB:       38.08     38.47     62.44    110.13    124.67    134.28    107.54     80.63
      64kB:       38.22     38.61     62.78    110.24    124.63    134.33    107.81     80.78
     128kB:       35.09     38.83     62.52    110.25    125.04    134.35    108.12     80.78
     256kB:       32.88     39.08     62.60    110.24    125.85    135.03    107.83     80.72
     512kB:       17.93     38.64     62.25    108.93    125.67    135.34    108.25     80.75
       1MB:       13.75     22.19     38.58     86.55    113.44    131.61    107.92     80.62
       2MB:        1.57      3.14      6.25     15.77     28.78     53.64    103.39     80.47
       4MB:        1.46      2.90      5.81     14.63     26.32     49.65     96.09     80.49
       8MB:        1.45      2.90      5.82     14.62     26.34     49.57     96.13     80.47
      16MB:        1.46      2.91      5.83     14.64     26.45     49.66     96.57     80.35
      32MB:        1.46      2.92      5.83     14.65     26.37     49.83     96.01     80.51
      64MB:        1.47      2.97      5.99     15.05     27.31     52.29     92.34     79.90
I will leave interpretation up to you
ErikAdr likes this.

Last edited by flotus1; March 2, 2022 at 16:34.
flotus1 is offline   Reply With Quote

Old   March 2, 2022, 09:26
Default
  #3
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
Quote:
Originally Posted by flotus1 View Post
Here are some results from my system (2x Epyc 7551). First one with all 64 threads, second with only 8 threads pinned to the first CCD. The latter makes it behave like a single 1st-gen Ryzen CPU with very low clock speeds and extremely crappy memory.
Attachment 88587
Attachment 88588
I will leave interpretation up to you

Compared to the i5-12600 it looks like the two systems have about the same ratio between compuational performance (with AVX2) and memory bandwidth. The zen-core has one fmadd 'engine' (8 FLOPs/cycle) whereas zen2, zen3 and newer Intel cores have two 'engines' and do 16 FLOPs/cycle with AVX2. It is seen especially for the small arrays that can be contained within the first level cache. The zen cores are about half performance of the Intel cores at the same clock speed, but the 7551 has a lot of cores, Looking at the line for 64MB arrays, it is seen that both systems are memory bound up to about the point where there are 32 FLOPs/load. It is seen since the performance numbers doubles each time the FLOPs/load ratio doubles. The data supplied by the memory is the limiting factor. At 64 FLOPs/load both systems are cpu-bound. Thanks for runing the benchmark!

Last edited by ErikAdr; March 2, 2022 at 16:25.
ErikAdr is offline   Reply With Quote

Old   March 2, 2022, 14:04
Default
  #4
Senior Member
 
Simbelmynë's Avatar
 
Join Date: May 2012
Posts: 546
Rep Power: 15
Simbelmynë is on a distinguished road
Another benchmark? Count me in!

Not sure what it means but here you go:

Code:
System: Ryzen 3700X, 16GB DDR4 SingleRank @ 3200 MT/s, GCC 9.3, Ubuntu 20.04
 
                            Performance (Gflops) using 8 processes with AVX2

FLOPs/load:        0.50       1         2         4         8        16        32        64  

Array size
       8kB:       71.31    118.02    164.55    318.22    368.69    373.22    286.99    203.55

      16kB:       59.24    121.15    163.89    315.79    367.57    376.81    287.23    202.66

      32kB:       62.78    105.93    160.60    278.80    351.75    358.89    286.54    203.30

      64kB:       62.71    105.87    163.70    280.15    354.65    372.24    287.88    203.20

     128kB:       62.05    103.28    159.84    275.72    350.16    369.65    286.98    203.37

     256kB:       46.90     95.48    149.06    273.69    343.32    364.09    284.49    203.07

     512kB:       35.50     74.42    142.31    237.92    334.94    356.40    283.07    202.85

       1MB:       35.69     70.13    128.29    261.14    331.96    352.53    282.27    202.63

       2MB:       15.54     14.64     43.74    131.91    181.58    271.45    277.96    201.78

       4MB:        1.80      3.60      7.16     18.21     31.90     61.04    117.54    199.16

       8MB:        1.75      3.50      7.10     18.51     32.45     60.25    115.02    199.27

      16MB:        1.75      3.49      7.00     17.69     31.56     60.00    115.75    199.15

      32MB:        1.76      3.55      6.98     17.62     31.79     59.44    115.03    199.75

      64MB:        1.78      3.51      7.09     17.72     32.07     61.32    119.82    191.47
Code:
System: 2 x Xeon E5-2673v3, 128 GB DDR4 Dual rank @2133 MT/s, GCC 8.3, Debian 10 

                            Performance (Gflops) using 24 processes with AVX2

FLOPs/load:        0.50       1         2         4         8        16        32        64  

Array size
       8kB:      181.82    227.95    370.67    544.75    592.74    594.93    479.71    373.98

      16kB:      220.25    237.19    400.54    552.96    609.18    592.86    482.14    374.02

      32kB:       63.79    125.82    224.20    484.24    540.32    585.78    337.38    374.26

      64kB:       63.68    124.11    238.68    475.64    594.16    584.14    479.93    374.18

     128kB:       55.34     93.03    189.14    392.51    542.51    579.75    481.11    374.07

     256kB:       34.44     66.66    134.06    301.36    463.48    575.17    481.45    374.03

     512kB:       33.57     64.77    129.26    287.09    448.37    571.43    482.26    373.94

       1MB:       31.43     57.68    112.56    260.22    413.58    552.70    478.22    373.60

       2MB:        4.98     10.13     20.27     50.92     92.27    173.84    338.16    370.65

       4MB:        4.73      9.46     18.85     47.05     84.63    159.23    303.62    370.16

       8MB:        4.69      9.39     18.77     46.72     83.82    158.12    300.67    369.71

      16MB:        4.65      9.34     18.64     46.53     83.17    157.03    299.76    369.34

      32MB:        4.64      9.23     18.42     45.97     82.80    155.93    298.16    369.09

      64MB:        4.63      9.20     18.39     45.90     82.48    155.30    297.76    369.05
ErikAdr likes this.
Simbelmynë is offline   Reply With Quote

Old   March 2, 2022, 16:19
Default
  #5
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
I can understand there is a need for an explanation on how to interpret the results. I took it in steps. First I ran STREAM with different array sizes to test the cache and memory bandwidths. Then I looked at floating point performance in cases with several floating point operations for each load. In HPC, some problems have a low value for the ratio FLOPs/load, and others a very high value. For CFD the ratio is usually at a low value, but for matrix multiplications between two large matrices the ratio has a very high value. For small values the performance are limited by the memory bandwidth, and for high values the performance is limited by the cpu’s ability to crunch numbers. My interest is typically in the intermediate range, say with ratios from 4 to 64, where it is not evident what limits the performance.

I don't know how to include a text file, but I have attached the results for the i5-12600 again. Please look at it. Looking at the first column for a ratio of 0.5, it is seen that the performance is highest for very small arrays. Arrays at 8kB and 16kB can be contained within the 1st level cache, and it is the fastest cache. At 32kB the performance is lower, since the the 1st level cache is a little too small and the bandwidth starts to be limited by the slower 2nd level cache. From 64kB to 256kB the performance is nearly constant and determined by the bandwidth of the 2nd level cache. For larger arrays the bandwidth of the 3rd level cache starts to play a role, but from array sizes of 8MB and larger, the performance is limited by the bandwidth of the RAM. All performance figures in the first column is in this way determined by the bandwidth to the memory system in which the arrays can be contained. The calculation include two equal sized arrays, but the size specified in the table is for each array.


The column at the right for the ratio at 64 is much easier to interpret. Here all performance figures are about the same, independent of the array size. The performance here is alone determined by the cpu’s ability to crunch numbers.


For most intermediate columns the performance is about constant for the smaller array sizes, where the cpu is the limiting factor, but at some point, the memory system that contains the larger arrays gets too slow, and then the performance shifts to be limited by the memory bandwidth. The benchmark shows where this happens!
Attached Files
File Type: txt i5_12600.txt (1.5 KB, 2 views)
flotus1 likes this.
ErikAdr is offline   Reply With Quote

Old   March 2, 2022, 16:59
Default
  #6
Super Moderator
 
flotus1's Avatar
 
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,399
Rep Power: 46
flotus1 has a spectacular aura aboutflotus1 has a spectacular aura about
You can wrap code tags around any text you want to appear as formatted in a plain text file.
[CODE] text goes here[/CODE ] <- remove the space
Code:
 text goes here
If I understood correctly, this program is partially a memory/cache benchmark. Pretty much the triad part of stream, without possibility for streaming stores. Might be neat to have another output for bandwidth.

Last edited by flotus1; March 3, 2022 at 06:02.
flotus1 is offline   Reply With Quote

Old   March 3, 2022, 05:48
Default
  #7
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
I have made a version that also shows the corresponding memory bandwidth. There were a lot of figures before and even more now.....


Code:
System: 12th Gen Intel(R) Core(TM) i5-12600; 2 channels SR DDR5 @ 6000


                            Bandwidth (GB/s) using 6 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50       1         2         4         8        16        32        64  

Array size
       8kB:     2353.36   1291.34   1281.65    799.87    514.75    282.19    129.43     55.39

      16kB:     2675.92   1403.25   1297.78    815.08    518.52    284.52    127.70     55.40

      32kB:     1833.83   1123.37   1158.65    775.19    497.57    271.84    126.96     55.17

      64kB:     1145.61   1031.83    989.96    768.46    471.57    269.89    125.24     55.16

     128kB:     1083.94   1034.94    990.89    771.72    472.04    267.54    124.16     55.18

     256kB:     1144.07   1035.63    990.41    769.20    471.71    269.90    122.44     54.95

     512kB:     1081.49    982.72    931.36    742.99    461.37    262.46    121.02     54.19

       1MB:      444.34    450.78    448.27    450.23    399.18    253.15    124.23     52.94

       2MB:      306.17    266.65    261.62    241.13    209.98    198.56    122.54     53.65

       4MB:      185.51    154.16    117.49    111.73    111.77    110.09    101.85     53.46

       8MB:       90.26     91.51     90.55     90.74     90.13     89.45     88.06     53.36

      16MB:       81.98     83.36     82.49     82.81     82.35     81.84     78.58     53.40

      32MB:       77.44     79.98     79.24     79.25     79.10     78.72     78.07     53.53

      64MB:       76.76     78.39     77.67     77.81     77.65     77.27     76.70     53.61




                            Performance (Gflops) using 6 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50       1         2         4         8        16        32        64  

Array size
       8kB:       98.06    107.61    213.61    299.95    364.61    388.00    350.53    297.73

      16kB:      111.50    116.94    216.30    305.65    367.29    391.22    345.84    297.76

      32kB:       76.41     93.61    193.11    290.70    352.45    373.77    343.86    296.55

      64kB:       47.73     85.99    164.99    288.17    334.03    371.10    339.19    296.50

     128kB:       45.16     86.24    165.15    289.40    334.36    367.86    336.27    296.59

     256kB:       47.67     86.30    165.07    288.45    334.13    371.11    331.62    295.38

     512kB:       45.06     81.89    155.23    278.62    326.81    360.89    327.76    291.28

       1MB:       18.51     37.57     74.71    168.84    282.75    348.08    336.45    284.56

       2MB:       12.76     22.22     43.60     90.42    148.74    273.02    331.88    288.34

       4MB:        7.73     12.85     19.58     41.90     79.17    151.37    275.85    287.36

       8MB:        3.76      7.63     15.09     34.03     63.84    122.99    238.51    286.81

      16MB:        3.42      6.95     13.75     31.06     58.33    112.53    212.83    287.02

      32MB:        3.23      6.66     13.21     29.72     56.03    108.24    211.45    287.72

      64MB:        3.20      6.53     12.94     29.18     55.00    106.25    207.73    288.17
The figures show my point with the benchmark. Looking at the line for the bandwidth results for an array size of 8MB, then the bandwidth is almost constant for the first 7 results, and then drops for the last result. It indicates that the fp performance is memory bound up till the result for 64 FLOPs/load, where it is the number crunching that is the limiting factor. For the 1MB array, the bandwidth is at a constant level for the first 4 results, and then it decays. In other words, the computational performance is limited by memory bandwidth for the first 4 results, and then for higher ratios of FLOPs/load it becomes computational bound.

Looking at the column for 0.5 FLOPs/load, then the performance of the memory system can be seen. From 64kB to 256kB the results are almost constant, showing the performance of the second level cache. From 8MB and up, it is the bandwidth of the RAM that limits the performance. In the results for the bandwidth I have include the one write for each two reads like in the STREAM benchmark. The performance for large arrays are very similar to the bandwidth figures from STREAM.

Looking at the computational performance, the performance has dropped a bit compared to the results previously posted. I made a mistake that effects from 4 FLOPs/load and up. The performance is reduced by about 10% for for 4 FLOPs/load, 5% for 8 FLOPs/load and 2.5% for 16 FLOPs/load. I have uploaded a corrected version in the first post. It is also corrected in the version attached here, that also reports memory bandwidth.
Attached Files
File Type: c fpmem_bw.C (15.0 KB, 3 views)

Last edited by ErikAdr; March 3, 2022 at 07:44.
ErikAdr is offline   Reply With Quote

Old   March 9, 2022, 18:55
Default
  #8
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 316
Rep Power: 12
wkernkamp is on a distinguished road
Code:
System: 4xOpteron 6376  32x 8GB DDR3-1600 

                            Bandwidth (GB/s) using 64 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50       1         2         4         8        16        32        64

Array size
       8kB:     1429.63    774.41    385.02    326.77    331.47    195.80    143.08     77.17

      16kB:     1237.93    719.69    445.85    341.76    310.33    195.59    142.24     77.11

      32kB:     1238.08    718.77    491.09    362.15    312.95    191.37    141.70     76.94

      64kB:     1228.22    717.10    471.26    379.37    312.74    190.14    140.95     76.57

     128kB:     1124.11    649.13    446.28    372.39    307.63    189.93    140.25     76.67

     256kB:     1087.40    628.51    444.04    370.25    306.16    186.85    140.14     76.70

     512kB:      693.82    516.11    424.20    331.35    272.41    174.58    136.23     75.53

       1MB:      211.77    180.41    181.49    191.25    183.45    151.17    124.84     73.27

       2MB:      124.33    119.58    119.68    120.33    120.66    118.90    114.47     72.71

       4MB:      120.83    119.24    119.48    119.93    119.58    118.78    114.82     72.70

       8MB:      121.15    119.52    119.48    119.96    119.84    118.92    115.04     72.64

      16MB:      121.22    119.55    119.73    120.29    120.00    119.02    115.03     72.60

      32MB:      121.22    121.16    120.98    121.69    121.66    120.93    117.75     73.43

      64MB:      121.26    121.13    121.02    121.57    121.60    120.98    117.40     73.53




                            Performance (Gflops) using 64 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50       1         2         4         8        16        32        64

Array size
       8kB:       59.57     64.53     64.17    122.54    234.79    269.22    387.51    414.76

      16kB:       51.58     59.97     74.31    128.16    219.81    268.94    385.23    414.49

      32kB:       51.59     59.90     81.85    135.81    221.68    263.13    383.76    413.54

      64kB:       51.18     59.76     78.54    142.26    221.53    261.44    381.75    411.58

     128kB:       46.84     54.09     74.38    139.65    217.90    261.15    379.84    412.08

     256kB:       45.31     52.38     74.01    138.84    216.87    256.92    379.55    412.24

     512kB:       28.91     43.01     70.70    124.26    192.96    240.05    368.97    405.98

       1MB:        8.82     15.03     30.25     71.72    129.94    207.86    338.10    393.82

       2MB:        5.18      9.96     19.95     45.12     85.47    163.49    310.03    390.84

       4MB:        5.03      9.94     19.91     44.97     84.70    163.32    310.98    390.74

       8MB:        5.05      9.96     19.91     44.98     84.89    163.51    311.57    390.42

      16MB:        5.05      9.96     19.95     45.11     85.00    163.66    311.54    390.21

      32MB:        5.05     10.10     20.16     45.63     86.17    166.28    318.90    394.67

       64MB:        5.05     10.09     20.17     45.59     86.13    166.34    317.95    395.21
wkernkamp is offline   Reply With Quote

Old   March 9, 2022, 22:57
Default
  #9
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 316
Rep Power: 12
wkernkamp is on a distinguished road
Recompiled with openmp and two threads per process (running 32 processes instead of 64). The GFlops are much improved because this cpu shares cache and fpu between two integer cores.


Code:
System: 4xOpteron 6376  32x DDR3-1600

                            Bandwidth (GB/s) using 32 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50       1         2         4         8        16        32        64

Array size
       8kB:      358.73    328.31    336.01    287.65    294.05    251.21    153.21     88.35

      16kB:      565.12    457.82    431.06    375.55    366.33    291.31    167.76     88.58

      32kB:      783.87    574.87    539.15    462.21    426.45    330.46    172.52     90.89

      64kB:      953.99    658.27    619.53    523.85    471.55    334.78    171.77     96.58

     128kB:     1090.23    716.07    671.24    528.28    506.23    329.29    178.33     97.10

     256kB:      973.01    602.55    564.47    492.98    458.04    324.14    170.09     95.76

     512kB:      986.28    613.41    578.11    509.80    469.13    327.15    170.53     95.80

       1MB:      613.50    522.80    493.97    455.18    413.47    307.43    168.32     94.67

       2MB:      200.24    183.15    185.47    185.90    186.86    182.11    153.70     92.87

       4MB:      120.47    118.79    118.98    119.18    118.86    118.71    117.33     96.64

       8MB:      120.78    118.75    118.92    119.13    118.84    118.72    118.29     97.05

      16MB:      120.87    118.92    119.01    119.67    119.00    119.05    118.82     82.24

      32MB:      120.92    120.62    120.93    121.25    121.08    121.19    118.45     84.99

      64MB:      120.95    120.49    121.03    121.43    121.28    121.20    117.56     86.36




                            Performance (Gflops) using 32 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50       1         2         4         8        16        32        64

Array size
       8kB:       14.95     27.36     56.00    107.87    208.28    345.42    414.96    474.90

      16kB:       23.55     38.15     71.84    140.83    259.48    400.55    454.34    476.12

      32kB:       32.66     47.91     89.86    173.33    302.07    454.39    467.23    488.56

      64kB:       39.75     54.86    103.25    196.44    334.01    460.32    465.20    519.11

     128kB:       45.43     59.67    111.87    198.11    358.58    452.78    482.99    521.93

     256kB:       40.54     50.21     94.08    184.87    324.45    445.70    460.66    514.69

     512kB:       41.10     51.12     96.35    191.18    332.30    449.83    461.85    514.93

       1MB:       25.56     43.57     82.33    170.69    292.88    422.72    455.86    508.84

       2MB:        8.34     15.26     30.91     69.71    132.36    250.40    416.28    499.20

       4MB:        5.02      9.90     19.83     44.69     84.20    163.23    317.78    519.45

       8MB:        5.03      9.90     19.82     44.67     84.18    163.25    320.36    521.63

      16MB:        5.04      9.91     19.84     44.88     84.29    163.70    321.81    442.06

      32MB:        5.04     10.05     20.15     45.47     85.76    166.63    320.82    456.83

      64MB:        5.04     10.04     20.17     45.54     85.90    166.65    318.40    464.21
ErikAdr likes this.
wkernkamp is offline   Reply With Quote

Old   March 10, 2022, 20:13
Default
  #10
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 316
Rep Power: 12
wkernkamp is on a distinguished road
I dont understand why the GFlops dont keep increasing as the flop_load_ratio goes up. Anyone have an answer?
wkernkamp is offline   Reply With Quote

Old   March 11, 2022, 09:30
Default
  #11
Member
 
Erik Andresen
Join Date: Feb 2016
Location: Denmark
Posts: 35
Rep Power: 10
ErikAdr is on a distinguished road
Quote:
Originally Posted by wkernkamp View Post
I dont understand why the GFlops dont keep increasing as the flop_load_ratio goes up. Anyone have an answer?

The Gflops do increase for your Opteron system. See the computational performance in the tabel at the buttom. The tabel at the top shows the amount of data read from memory, corresponding to the computational performance in the lower tabel. The GB/s decays when the computational performance becomes the limitting factor. That is the case, when the flop_load_ratio is high.

The performance is either limited by memory bandwidth or by computational performance. The test gives a picture of what is the limitting factor for various array sizes and flop_load_ratios. Hope this helped.
ErikAdr is offline   Reply With Quote

Old   March 11, 2022, 09:41
Default
  #12
Senior Member
 
Will Kernkamp
Join Date: Jun 2014
Posts: 316
Rep Power: 12
wkernkamp is on a distinguished road
It does on my result, but not for the others. If I run the 128 case, it also drops. Why would a higher number of repeats lead to reduced flops?
wkernkamp is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[snappyHexMesh] Motobike benchmark case joshmccraney OpenFOAM Meshing & Mesh Conversion 6 March 26, 2020 16:28
Setting up Lid driven Cavity Benchmark with 1M cells for multiple cores puneet336 OpenFOAM Running, Solving & CFD 11 April 7, 2019 00:58
Benchmark Commannd Line eRzBeNgEl STAR-CCM+ 2 February 17, 2013 15:27
Euler3d Benchmark Verdi Hardware 2 May 26, 2011 06:21
SIG HPC Benchmark jens_klostermann OpenFOAM 0 October 1, 2009 18:20


All times are GMT -4. The time now is 22:06.