CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   Benchmark fpmem (https://www.cfd-online.com/Forums/hardware/241494-benchmark-fpmem.html)

ErikAdr March 2, 2022 05:59

Benchmark fpmem
 
2 Attachment(s)
The STREAM benchmark test the memory bandwidth, even though floating point operations are made. In the benchmark, the number of floating point operations doesn’t exceed the number of loads. This is likely also the case for many CFD programs, but for higher order solvers based on Cartesian grid, it is not the case. The ratio between number of floating point operations and loads, could be much larger for such solvers.
Optimizing in HPC is often about minimizing the reading from memory. The work can be split into smaller chunks, where as much work as possible is done on each chunk, before the next chunk of memory is processed. The relevant size of such chunks should be determined.

I have made a benchmark, fpmem, that gives the floating point performance for various combinations of floating point operations pr load, and the size of the array processed. The benchmark doesn’t do any real work, but it can be compiled, linked and run in about 5 minutes.

The instructions for compiling, linking and usage of the benchmark is given in the first few lines of the source file. It requires a resent C++ compiler (-std=c++17) and mpi. It uses AVX2 when compiled with -D_USE_INTRINSIC. See instructions.

I hope that some care to use the benchmark and post the results. The benchmark is made to run on one CPU, and if used on a large cluster the performance will just increase linearly with the number of CPUs. I don’t have access to EPYC Milan or newer Xeons on socket LGA4189 so for me results from these could be very interesting.

I have attached the benchmark (fpmem.c) and the results for my newly build system with an Intel i5-12600.


Edit: I have uploaded a new version, that corrects an error that effected the reported performance values with up to about 10%.

flotus1 March 2, 2022 07:17

2 Attachment(s)
Here are some results from my system (2x Epyc 7551). First one with all 64 threads, second with only 8 threads pinned to the first CCD. The latter makes it behave like a single 1st-gen Ryzen CPU with very low clock speeds and extremely crappy memory.
Attachment 88587
Code:

compiled: mpiCC -D_USE_INTRINSIC -std=c++17 -O3 -march=native -c fpmem.c (gcc version 9.2.0)
run: mpirun -np 64 ./fpmem 30 24
System: 2x AMD Epyc 7551, 16x32GB DDR4-2666 2Rx4, OpenSUSE Leap 15.3, 5.3.18-150300.59.46-default

                            Performance (Gflops) using 64 processes with AVX2
FLOPs/load:        0.50      1        2        4        8        16        32        64 
Array size
      8kB:      316.30    319.22    515.52    917.54  1050.67  1116.92    892.12    659.82
      16kB:      320.53    319.74    518.03    918.43  1050.43  1114.73    892.49    660.28
      32kB:      273.02    295.91    512.90    900.98  1043.95  1116.21    892.29    660.26
      64kB:      277.78    295.18    512.84    903.64  1043.42  1116.13    892.17    659.73
    128kB:      272.64    296.74    509.96    897.51  1040.41  1116.53    893.17    659.66
    256kB:      228.56    291.10    486.24    862.42  1021.68  1110.92    893.14    659.84
    512kB:      138.33    271.48    460.28    831.43    999.17  1096.32    892.43    659.86
      1MB:      110.29    170.01    312.09    673.73    898.32  1051.14    888.12    658.28
      2MB:      12.53    25.13    50.02    125.73    227.88    425.46    831.72    656.62
      4MB:      11.63    23.28    46.65    116.91    211.14    398.37    765.78    656.24
      8MB:      11.63    23.29    46.59    116.60    210.99    398.59    767.66    655.67
      16MB:      11.64    23.33    46.78    116.84    211.22    398.66    768.35    655.37
      32MB:      11.69    23.47    47.00    117.33    211.95    400.49    765.00    655.42
      64MB:      11.72    23.61    47.29    118.19    213.55    406.43    752.34    652.60

Attachment 88588
Code:

compiled: mpiCC -D_USE_INTRINSIC -std=c++17 -O3 -march=native -c fpmem.c (gcc version 9.2.0)
run: mpirun -np 8 --bind-to core --rank-by core --map-by core ./fpmem 30 24
System: 2x AMD Epyc 7551, 16x32GB DDR4-2666 2Rx4, OpenSUSE Leap 15.3, 5.3.18-150300.59.46-default

                            Performance (Gflops) using 8 processes with AVX2
FLOPs/load:        0.50      1        2        4        8        16        32        64 
Array size
      8kB:      37.58    38.02    61.04    110.03    125.33    134.04    108.10    80.87
      16kB:      38.58    39.24    63.00    110.46    126.50    134.51    108.40    80.88
      32kB:      38.08    38.47    62.44    110.13    124.67    134.28    107.54    80.63
      64kB:      38.22    38.61    62.78    110.24    124.63    134.33    107.81    80.78
    128kB:      35.09    38.83    62.52    110.25    125.04    134.35    108.12    80.78
    256kB:      32.88    39.08    62.60    110.24    125.85    135.03    107.83    80.72
    512kB:      17.93    38.64    62.25    108.93    125.67    135.34    108.25    80.75
      1MB:      13.75    22.19    38.58    86.55    113.44    131.61    107.92    80.62
      2MB:        1.57      3.14      6.25    15.77    28.78    53.64    103.39    80.47
      4MB:        1.46      2.90      5.81    14.63    26.32    49.65    96.09    80.49
      8MB:        1.45      2.90      5.82    14.62    26.34    49.57    96.13    80.47
      16MB:        1.46      2.91      5.83    14.64    26.45    49.66    96.57    80.35
      32MB:        1.46      2.92      5.83    14.65    26.37    49.83    96.01    80.51
      64MB:        1.47      2.97      5.99    15.05    27.31    52.29    92.34    79.90

I will leave interpretation up to you :confused:

ErikAdr March 2, 2022 09:26

Quote:

Originally Posted by flotus1 (Post 823352)
Here are some results from my system (2x Epyc 7551). First one with all 64 threads, second with only 8 threads pinned to the first CCD. The latter makes it behave like a single 1st-gen Ryzen CPU with very low clock speeds and extremely crappy memory.
Attachment 88587
Attachment 88588
I will leave interpretation up to you :confused:


Compared to the i5-12600 it looks like the two systems have about the same ratio between compuational performance (with AVX2) and memory bandwidth. The zen-core has one fmadd 'engine' (8 FLOPs/cycle) whereas zen2, zen3 and newer Intel cores have two 'engines' and do 16 FLOPs/cycle with AVX2. It is seen especially for the small arrays that can be contained within the first level cache. The zen cores are about half performance of the Intel cores at the same clock speed, but the 7551 has a lot of cores, Looking at the line for 64MB arrays, it is seen that both systems are memory bound up to about the point where there are 32 FLOPs/load. It is seen since the performance numbers doubles each time the FLOPs/load ratio doubles. The data supplied by the memory is the limiting factor. At 64 FLOPs/load both systems are cpu-bound. Thanks for runing the benchmark!

Simbelmynë March 2, 2022 14:04

Another benchmark? Count me in! ;)

Not sure what it means but here you go:

Code:

System: Ryzen 3700X, 16GB DDR4 SingleRank @ 3200 MT/s, GCC 9.3, Ubuntu 20.04
 
                            Performance (Gflops) using 8 processes with AVX2

FLOPs/load:        0.50      1        2        4        8        16        32        64 

Array size
      8kB:      71.31    118.02    164.55    318.22    368.69    373.22    286.99    203.55

      16kB:      59.24    121.15    163.89    315.79    367.57    376.81    287.23    202.66

      32kB:      62.78    105.93    160.60    278.80    351.75    358.89    286.54    203.30

      64kB:      62.71    105.87    163.70    280.15    354.65    372.24    287.88    203.20

    128kB:      62.05    103.28    159.84    275.72    350.16    369.65    286.98    203.37

    256kB:      46.90    95.48    149.06    273.69    343.32    364.09    284.49    203.07

    512kB:      35.50    74.42    142.31    237.92    334.94    356.40    283.07    202.85

      1MB:      35.69    70.13    128.29    261.14    331.96    352.53    282.27    202.63

      2MB:      15.54    14.64    43.74    131.91    181.58    271.45    277.96    201.78

      4MB:        1.80      3.60      7.16    18.21    31.90    61.04    117.54    199.16

      8MB:        1.75      3.50      7.10    18.51    32.45    60.25    115.02    199.27

      16MB:        1.75      3.49      7.00    17.69    31.56    60.00    115.75    199.15

      32MB:        1.76      3.55      6.98    17.62    31.79    59.44    115.03    199.75

      64MB:        1.78      3.51      7.09    17.72    32.07    61.32    119.82    191.47

Code:

System: 2 x Xeon E5-2673v3, 128 GB DDR4 Dual rank @2133 MT/s, GCC 8.3, Debian 10

                            Performance (Gflops) using 24 processes with AVX2

FLOPs/load:        0.50      1        2        4        8        16        32        64 

Array size
      8kB:      181.82    227.95    370.67    544.75    592.74    594.93    479.71    373.98

      16kB:      220.25    237.19    400.54    552.96    609.18    592.86    482.14    374.02

      32kB:      63.79    125.82    224.20    484.24    540.32    585.78    337.38    374.26

      64kB:      63.68    124.11    238.68    475.64    594.16    584.14    479.93    374.18

    128kB:      55.34    93.03    189.14    392.51    542.51    579.75    481.11    374.07

    256kB:      34.44    66.66    134.06    301.36    463.48    575.17    481.45    374.03

    512kB:      33.57    64.77    129.26    287.09    448.37    571.43    482.26    373.94

      1MB:      31.43    57.68    112.56    260.22    413.58    552.70    478.22    373.60

      2MB:        4.98    10.13    20.27    50.92    92.27    173.84    338.16    370.65

      4MB:        4.73      9.46    18.85    47.05    84.63    159.23    303.62    370.16

      8MB:        4.69      9.39    18.77    46.72    83.82    158.12    300.67    369.71

      16MB:        4.65      9.34    18.64    46.53    83.17    157.03    299.76    369.34

      32MB:        4.64      9.23    18.42    45.97    82.80    155.93    298.16    369.09

      64MB:        4.63      9.20    18.39    45.90    82.48    155.30    297.76    369.05


ErikAdr March 2, 2022 16:19

1 Attachment(s)
I can understand there is a need for an explanation on how to interpret the results. I took it in steps. First I ran STREAM with different array sizes to test the cache and memory bandwidths. Then I looked at floating point performance in cases with several floating point operations for each load. In HPC, some problems have a low value for the ratio FLOPs/load, and others a very high value. For CFD the ratio is usually at a low value, but for matrix multiplications between two large matrices the ratio has a very high value. For small values the performance are limited by the memory bandwidth, and for high values the performance is limited by the cpu’s ability to crunch numbers. My interest is typically in the intermediate range, say with ratios from 4 to 64, where it is not evident what limits the performance.

I don't know how to include a text file, but I have attached the results for the i5-12600 again. Please look at it. Looking at the first column for a ratio of 0.5, it is seen that the performance is highest for very small arrays. Arrays at 8kB and 16kB can be contained within the 1st level cache, and it is the fastest cache. At 32kB the performance is lower, since the the 1st level cache is a little too small and the bandwidth starts to be limited by the slower 2nd level cache. From 64kB to 256kB the performance is nearly constant and determined by the bandwidth of the 2nd level cache. For larger arrays the bandwidth of the 3rd level cache starts to play a role, but from array sizes of 8MB and larger, the performance is limited by the bandwidth of the RAM. All performance figures in the first column is in this way determined by the bandwidth to the memory system in which the arrays can be contained. The calculation include two equal sized arrays, but the size specified in the table is for each array.


The column at the right for the ratio at 64 is much easier to interpret. Here all performance figures are about the same, independent of the array size. The performance here is alone determined by the cpu’s ability to crunch numbers.


For most intermediate columns the performance is about constant for the smaller array sizes, where the cpu is the limiting factor, but at some point, the memory system that contains the larger arrays gets too slow, and then the performance shifts to be limited by the memory bandwidth. The benchmark shows where this happens!

flotus1 March 2, 2022 16:59

You can wrap code tags around any text you want to appear as formatted in a plain text file.
[CODE] text goes here[/CODE ] <- remove the space
Code:

text goes here
If I understood correctly, this program is partially a memory/cache benchmark. Pretty much the triad part of stream, without possibility for streaming stores. Might be neat to have another output for bandwidth.

ErikAdr March 3, 2022 05:48

1 Attachment(s)
I have made a version that also shows the corresponding memory bandwidth. There were a lot of figures before and even more now.....


Code:

System: 12th Gen Intel(R) Core(TM) i5-12600; 2 channels SR DDR5 @ 6000


                            Bandwidth (GB/s) using 6 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50      1        2        4        8        16        32        64 

Array size
      8kB:    2353.36  1291.34  1281.65    799.87    514.75    282.19    129.43    55.39

      16kB:    2675.92  1403.25  1297.78    815.08    518.52    284.52    127.70    55.40

      32kB:    1833.83  1123.37  1158.65    775.19    497.57    271.84    126.96    55.17

      64kB:    1145.61  1031.83    989.96    768.46    471.57    269.89    125.24    55.16

    128kB:    1083.94  1034.94    990.89    771.72    472.04    267.54    124.16    55.18

    256kB:    1144.07  1035.63    990.41    769.20    471.71    269.90    122.44    54.95

    512kB:    1081.49    982.72    931.36    742.99    461.37    262.46    121.02    54.19

      1MB:      444.34    450.78    448.27    450.23    399.18    253.15    124.23    52.94

      2MB:      306.17    266.65    261.62    241.13    209.98    198.56    122.54    53.65

      4MB:      185.51    154.16    117.49    111.73    111.77    110.09    101.85    53.46

      8MB:      90.26    91.51    90.55    90.74    90.13    89.45    88.06    53.36

      16MB:      81.98    83.36    82.49    82.81    82.35    81.84    78.58    53.40

      32MB:      77.44    79.98    79.24    79.25    79.10    78.72    78.07    53.53

      64MB:      76.76    78.39    77.67    77.81    77.65    77.27    76.70    53.61




                            Performance (Gflops) using 6 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50      1        2        4        8        16        32        64 

Array size
      8kB:      98.06    107.61    213.61    299.95    364.61    388.00    350.53    297.73

      16kB:      111.50    116.94    216.30    305.65    367.29    391.22    345.84    297.76

      32kB:      76.41    93.61    193.11    290.70    352.45    373.77    343.86    296.55

      64kB:      47.73    85.99    164.99    288.17    334.03    371.10    339.19    296.50

    128kB:      45.16    86.24    165.15    289.40    334.36    367.86    336.27    296.59

    256kB:      47.67    86.30    165.07    288.45    334.13    371.11    331.62    295.38

    512kB:      45.06    81.89    155.23    278.62    326.81    360.89    327.76    291.28

      1MB:      18.51    37.57    74.71    168.84    282.75    348.08    336.45    284.56

      2MB:      12.76    22.22    43.60    90.42    148.74    273.02    331.88    288.34

      4MB:        7.73    12.85    19.58    41.90    79.17    151.37    275.85    287.36

      8MB:        3.76      7.63    15.09    34.03    63.84    122.99    238.51    286.81

      16MB:        3.42      6.95    13.75    31.06    58.33    112.53    212.83    287.02

      32MB:        3.23      6.66    13.21    29.72    56.03    108.24    211.45    287.72

      64MB:        3.20      6.53    12.94    29.18    55.00    106.25    207.73    288.17

The figures show my point with the benchmark. Looking at the line for the bandwidth results for an array size of 8MB, then the bandwidth is almost constant for the first 7 results, and then drops for the last result. It indicates that the fp performance is memory bound up till the result for 64 FLOPs/load, where it is the number crunching that is the limiting factor. For the 1MB array, the bandwidth is at a constant level for the first 4 results, and then it decays. In other words, the computational performance is limited by memory bandwidth for the first 4 results, and then for higher ratios of FLOPs/load it becomes computational bound.

Looking at the column for 0.5 FLOPs/load, then the performance of the memory system can be seen. From 64kB to 256kB the results are almost constant, showing the performance of the second level cache. From 8MB and up, it is the bandwidth of the RAM that limits the performance. In the results for the bandwidth I have include the one write for each two reads like in the STREAM benchmark. The performance for large arrays are very similar to the bandwidth figures from STREAM.

Looking at the computational performance, the performance has dropped a bit compared to the results previously posted. I made a mistake that effects from 4 FLOPs/load and up. The performance is reduced by about 10% for for 4 FLOPs/load, 5% for 8 FLOPs/load and 2.5% for 16 FLOPs/load. I have uploaded a corrected version in the first post. It is also corrected in the version attached here, that also reports memory bandwidth.

wkernkamp March 9, 2022 18:55

Code:

System: 4xOpteron 6376  32x 8GB DDR3-1600

                            Bandwidth (GB/s) using 64 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50      1        2        4        8        16        32        64

Array size
      8kB:    1429.63    774.41    385.02    326.77    331.47    195.80    143.08    77.17

      16kB:    1237.93    719.69    445.85    341.76    310.33    195.59    142.24    77.11

      32kB:    1238.08    718.77    491.09    362.15    312.95    191.37    141.70    76.94

      64kB:    1228.22    717.10    471.26    379.37    312.74    190.14    140.95    76.57

    128kB:    1124.11    649.13    446.28    372.39    307.63    189.93    140.25    76.67

    256kB:    1087.40    628.51    444.04    370.25    306.16    186.85    140.14    76.70

    512kB:      693.82    516.11    424.20    331.35    272.41    174.58    136.23    75.53

      1MB:      211.77    180.41    181.49    191.25    183.45    151.17    124.84    73.27

      2MB:      124.33    119.58    119.68    120.33    120.66    118.90    114.47    72.71

      4MB:      120.83    119.24    119.48    119.93    119.58    118.78    114.82    72.70

      8MB:      121.15    119.52    119.48    119.96    119.84    118.92    115.04    72.64

      16MB:      121.22    119.55    119.73    120.29    120.00    119.02    115.03    72.60

      32MB:      121.22    121.16    120.98    121.69    121.66    120.93    117.75    73.43

      64MB:      121.26    121.13    121.02    121.57    121.60    120.98    117.40    73.53




                            Performance (Gflops) using 64 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50      1        2        4        8        16        32        64

Array size
      8kB:      59.57    64.53    64.17    122.54    234.79    269.22    387.51    414.76

      16kB:      51.58    59.97    74.31    128.16    219.81    268.94    385.23    414.49

      32kB:      51.59    59.90    81.85    135.81    221.68    263.13    383.76    413.54

      64kB:      51.18    59.76    78.54    142.26    221.53    261.44    381.75    411.58

    128kB:      46.84    54.09    74.38    139.65    217.90    261.15    379.84    412.08

    256kB:      45.31    52.38    74.01    138.84    216.87    256.92    379.55    412.24

    512kB:      28.91    43.01    70.70    124.26    192.96    240.05    368.97    405.98

      1MB:        8.82    15.03    30.25    71.72    129.94    207.86    338.10    393.82

      2MB:        5.18      9.96    19.95    45.12    85.47    163.49    310.03    390.84

      4MB:        5.03      9.94    19.91    44.97    84.70    163.32    310.98    390.74

      8MB:        5.05      9.96    19.91    44.98    84.89    163.51    311.57    390.42

      16MB:        5.05      9.96    19.95    45.11    85.00    163.66    311.54    390.21

      32MB:        5.05    10.10    20.16    45.63    86.17    166.28    318.90    394.67

      64MB:        5.05    10.09    20.17    45.59    86.13    166.34    317.95    395.21


wkernkamp March 9, 2022 22:57

Recompiled with openmp and two threads per process (running 32 processes instead of 64). The GFlops are much improved because this cpu shares cache and fpu between two integer cores.


Code:

System: 4xOpteron 6376  32x DDR3-1600

                            Bandwidth (GB/s) using 32 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50      1        2        4        8        16        32        64

Array size
      8kB:      358.73    328.31    336.01    287.65    294.05    251.21    153.21    88.35

      16kB:      565.12    457.82    431.06    375.55    366.33    291.31    167.76    88.58

      32kB:      783.87    574.87    539.15    462.21    426.45    330.46    172.52    90.89

      64kB:      953.99    658.27    619.53    523.85    471.55    334.78    171.77    96.58

    128kB:    1090.23    716.07    671.24    528.28    506.23    329.29    178.33    97.10

    256kB:      973.01    602.55    564.47    492.98    458.04    324.14    170.09    95.76

    512kB:      986.28    613.41    578.11    509.80    469.13    327.15    170.53    95.80

      1MB:      613.50    522.80    493.97    455.18    413.47    307.43    168.32    94.67

      2MB:      200.24    183.15    185.47    185.90    186.86    182.11    153.70    92.87

      4MB:      120.47    118.79    118.98    119.18    118.86    118.71    117.33    96.64

      8MB:      120.78    118.75    118.92    119.13    118.84    118.72    118.29    97.05

      16MB:      120.87    118.92    119.01    119.67    119.00    119.05    118.82    82.24

      32MB:      120.92    120.62    120.93    121.25    121.08    121.19    118.45    84.99

      64MB:      120.95    120.49    121.03    121.43    121.28    121.20    117.56    86.36




                            Performance (Gflops) using 32 processes with AVX2 (ver 1.2)

FLOPs/load:        0.50      1        2        4        8        16        32        64

Array size
      8kB:      14.95    27.36    56.00    107.87    208.28    345.42    414.96    474.90

      16kB:      23.55    38.15    71.84    140.83    259.48    400.55    454.34    476.12

      32kB:      32.66    47.91    89.86    173.33    302.07    454.39    467.23    488.56

      64kB:      39.75    54.86    103.25    196.44    334.01    460.32    465.20    519.11

    128kB:      45.43    59.67    111.87    198.11    358.58    452.78    482.99    521.93

    256kB:      40.54    50.21    94.08    184.87    324.45    445.70    460.66    514.69

    512kB:      41.10    51.12    96.35    191.18    332.30    449.83    461.85    514.93

      1MB:      25.56    43.57    82.33    170.69    292.88    422.72    455.86    508.84

      2MB:        8.34    15.26    30.91    69.71    132.36    250.40    416.28    499.20

      4MB:        5.02      9.90    19.83    44.69    84.20    163.23    317.78    519.45

      8MB:        5.03      9.90    19.82    44.67    84.18    163.25    320.36    521.63

      16MB:        5.04      9.91    19.84    44.88    84.29    163.70    321.81    442.06

      32MB:        5.04    10.05    20.15    45.47    85.76    166.63    320.82    456.83

      64MB:        5.04    10.04    20.17    45.54    85.90    166.65    318.40    464.21


wkernkamp March 10, 2022 20:13

I dont understand why the GFlops dont keep increasing as the flop_load_ratio goes up. Anyone have an answer?:confused:

ErikAdr March 11, 2022 09:30

Quote:

Originally Posted by wkernkamp (Post 823931)
I dont understand why the GFlops dont keep increasing as the flop_load_ratio goes up. Anyone have an answer?:confused:


The Gflops do increase for your Opteron system. See the computational performance in the tabel at the buttom. The tabel at the top shows the amount of data read from memory, corresponding to the computational performance in the lower tabel. The GB/s decays when the computational performance becomes the limitting factor. That is the case, when the flop_load_ratio is high.

The performance is either limited by memory bandwidth or by computational performance. The test gives a picture of what is the limitting factor for various array sizes and flop_load_ratios. Hope this helped.

wkernkamp March 11, 2022 09:41

It does on my result, but not for the others. If I run the 128 case, it also drops. Why would a higher number of repeats lead to reduced flops?


All times are GMT -4. The time now is 07:47.