Initial public release

FZJ-JSC · Jul 20, 2024 · e98ae2b · e98ae2b
commit e98ae2b
Show file tree

Hide file tree

Showing 11 changed files with 918 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,13 @@
+.sync.toml
+*.tar.gz
+.ipynb_checkpoints/*
+.*
+!/.gitignore
+*.png
+*.svg
+*.pdf
+*.jpg
+*.html
+*.sha256
+stream_run/
+stream_src/
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,49 @@
+cff-version: 1.2.0
+title: "JUPITER Benchmark Suite: STREAM"
+message: >-
+  In addition to citing this benchmark repository, please also cite either the JUPITER Benchmark Suite or the accompanying SC24 paper
+authors:
+  - given-names: Sebastian
+    family-names: Achilles
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0002-1943-6803'
+  - given-names: Thomas
+    family-names: Breuer
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0003-3979-4795'
+  - given-names: Kay
+    family-names: Thust
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0002-1181-1832'
+  - given-names: Yannik
+    family-names: Müller
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0009-0001-5696-6512'
+  - given-names: Andreas
+    family-names: Herten
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0002-7150-2505'
+  - given-names: Alexandre
+    family-names: Strube
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0002-9177-6474'
+  - given-names: Dorian
+    family-names: Krause
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0001-9799-562X'
+  - given-names: Salem
+    family-names: El Sayed
+    affiliation: Forschungszentrum Jülich, Jülich Supercomputing Centre
+    orcid: 'https://orcid.org/0000-0002-7217-6027'
+abstract: "The STREAM benchmark of the JUPITER Benchmark Suite"
+identifiers:
+  - type: doi
+    value: 10.5281/zenodo.12787776
+    description: Version-agnostic Zenodo Identifier
+repository-code: 'https://github.com/FZJ-JSC/jubench-stream/'
+license: MIT
+date-released: '2024-07-20'
+references:
+  - title: "JUPITER Benchmark Suite"
+    type: software
+    doi: 10.5281/zenodo.12737073
diff --git a/DESCRIPTION.md b/DESCRIPTION.md
@@ -0,0 +1,168 @@
+# STREAM
+
+## Purpose
+
+The STREAM benchmark is a synthetic benchmark which measures the sustainable memory bandwidth of a compute node. It only uses little to no computation per byte transferred to or from memory. Different versions of this benchmark are available. The one used here is written in C and only uses OpenMP for threading (no MPI).
+
+## Source
+
+Archive name: `stream-bench.tar.gz`.
+
+The source code is available in the file `src/stream_src.tar.gz`.
+
+The provided sources are equivalent to the STREAM version 5.10 from the official website: https://www.cs.virginia.edu/stream/FTP/Code/
+
+Download and unpack the tar-file first; the directory `stream_src` will be created.
+
+```
+cd src
+tar -xvzf stream_src.tar.gz
+```
+
+## Building
+
+OpenMP must be enabled for compiler and linker. The preprocessor macros `STREAM_ARRAY_SIZE` and `NTIMES` must be defined (see _Configurations_ below).
+
+When using the JUBE script, JUBE will build the STREAM executable in the compile step (see _Execution with JUBE_ below). To build STREAM without JUBE, use a compiler invocation similar to the following line.
+
+```
+cd stream_src
+make CC=gcc CFLAGS="-fopenmp -Wall -DSTREAM_ARRAY_SIZE=$((2 ** 26)) -DNTIMES=200" stream_c.exe
+```
+
+Since the array size is a compile-time parameter, it is recommended to build a dedicated executable for each `STREAM_ARRAY_SIZE`.
+
+Depending on the tested configuration, the compiler flags might be modified, see _Modification_ section below.
+
+## Execution
+
+The executable is called `stream_c.exe`.
+
+### Command Line
+
+Stream can be executed manually for example with the following command:
+
+```
+OMP_PLACES="cores" OMP_PROC_BIND="spread" OMP_NUM_THREADS=128 ./stream_c.exe
+```
+
+### Execution with JUBE
+
+Using JUBE the benchmark can be executed as follows:
+
+```
+jube run stream.jube.xml [--tag tags...]
+jube continue stream_run # wait/repeat until all steps are done
+jube result -a stream_run
+```
+
+|       Tags           |               Effect               | Default |
+|----------------------|------------------------------------|---------|
+| `varySize`           | `STREAM_ARRAY_SIZE=2^15..2^28`     |         |
+| `varySizeextended`   | `STREAM_ARRAY_SIZE=2^10..2^35`     |         |
+| `fixedSize`          | `STREAM_ARRAY_SIZE=2^28`           | yes     |
+| `varyThreads`        | `OMP_NUM_THREADS=1..16`            | yes     |
+| `threads1`           | `OMP_NUM_THREADS=1`                |         |
+| `threads4`           | `OMP_NUM_THREADS=4`                |         |
+| `threadsCores`       | `OMP_NUM_THREADS=#Cores`           |         |
+| `threadsNuma`        | `OMP_NUM_THREADS=#NUMA Domains`    |         |
+| `threadsHyper`       | `OMP_NUM_THREADS=#Hardware Threads`|         |
+| `s22`                | `module load Stages/2022`          | yes     |
+| `s21`                | `module load Stages/2021`          |         |
+| `s20`                | `module load Stages/2020`          |         |
+| `varyCompiler`       | `CC=gcc,icc,nvc,clang`             |         |
+| `gcc`                | `CC=gcc`                           | yes     |
+| `intel`              | `CC=icc`                           |         |
+| `nvhpc`              | `CC=nvc`                           |         |
+| `aocc`               | `CC=clang` (AOCC)                  |         |
+
+The results can be found in the columns `Copy, Scale, Add, Triad`.
+
+### Configurations
+
+Candidates are requested to run the following configurations; according JUBE tags are given. All STREAM benchmarks should be run in FP64 precision. See also the modification overview below.
+
+| Name                       | Description                                                                                                              | JUBE Tags                 | 
+| ---------------------------| ------------------------------------------------------------------------------------------------------------------------ | --------------------------| 
+| Threads1                   | Single threaded, array sizes 2^15 to 2^28 should be reported; array size 2^28 will be used for evaluation                | `threads1` `varySize`     |
+| Threads4                   | Four threads located on the same socket, fixed size (2^28)                                                               | `threads4`                |
+| ThreadsCores               | One thread per physical core on the node, fixed size (2^28)                                                              | `threadsCores`            |
+| ThreadsHyper               | One thread per hardware thread on the node, fixed size (2^28)                                                            | `threadsHyper`            |
+| Optimal                    | A custom configuration that achieves maximum bandwidth (2^28)                                                            | `optimal`^                |
+
+The following STREAM parameters are to be used:
+
+|         Name        |                     Value                     |
+|---------------------|-----------------------------------------------|
+| `STREAM_ARRAY_SIZE` | 2^28; 2^15 to 2^28 for _Threads1_ in addition |
+| `NTIMES`            | 200                                           |
+
+In any case, the timing accuracy output by the program needs to be at least 20 clocks-ticks (`CLK/Ins`).
+
+If multiple memory domains are exposed to user space (for example HBM and DDR), each memory domain must be measured and reported separately. This can be achieved, for example, by using `numactl` and the `-m` (`--membind`) option. Mixing memory domains (for example _cache mode_) is not allowed.
+
+^ Tag already exists but values have to be defined in `parametersets.jube.xml`.
+
+### Modification
+
+The following Modifications are allowed, depending on the configuration:
+
+| Name         | Compiler | Compiler Flags | Size | Threads | Affinity | Source Code |
+| ------------ | -------- | -------------- | ---- | ------- | -------- | ----------- |
+| Threads1     | Yes      | Yes            | No^  | No      | No       | No          |
+| Threads4     | Yes      | Yes            | No   | No      | No       | No          |
+| ThreadsCores | Yes      | Yes            | No   | No      | No       | No          |
+| ThreadsHyper | Yes      | Yes            | No   | No      | No       | No          |
+| Optimal      | Yes      | Yes            | No   | Yes     | Yes      | No          |
+
+
+Here, _Size_ refers to the length of the arrays in STREAM (i.e `STREAM_ARRAY_SIZE` / JUBE variable `size`), _Threads_ to the number of OpenMP threads (`OMP_NUM_THREADS`/ JUBE's `threadspertask`), _Affinity_ to the locations of spawned threads (`OMP_PLACES`, `OMP_PROC_BIND`, ...). For the _Threads1_ configuration, a report of additional array sizes is expected for the qualitative evaluation.
+
+^: `STREAM_ARRAY_SIZE` should be 2^28 for the evaluation, but a scan through multiple array sizes (2^15 to 2^28) should be reported as part of the feedback.
+
+## Verification
+
+All execution should run without errors. In case of JUBE: status _done_, no errors.
+Only in case of out-of-memory crashes, the range of `STREAM_ARRAY_SIZE` values can be narrowed to prevent them.
+
+## Results
+
+In most cases, the average triad bandwidth is to be reported. For the case of the freely chosen _Optimal_ configuration, please report the best bandwidth achieved. Should multiple memory domains be available (for example HBM and DDR), the benchmark is to be run separately for each domain.
+
+Abbreviated example output follow.
+
+| stage | modules | mem | CLK/Ins (>20) |   N |  Copy | Scale |   Add | Triad | runtime[sec] |
+|-------|---------|-----|---------------|-----|-------|-------|-------|-------|--------------|
+|  2022 |     GCC | 3.0 |          55.0 | 200 | 16786 | 19590 | 26872 | 26709 |         0.47 |
+|  2022 |   Intel | 3.0 |          57.0 | 200 | 22554 | 26494 | 26441 | 26655 |         0.48 |
+|  2022 |   NVHPC | 3.0 |         200.0 | 200 | 10701 |  9763 | 15062 | 14096 |         0.62 |
+|  2022 |    AOCC |     |               | 200 |       |       |       |       |         0.49 |
+
+
+| stage | modules | Exp | Array [MiB] | Thread [MiB] | Threads | CLK/Ins (>20) |   N |    Triad   | runtime[sec]  |
+|-------|---------|-----|-------------|--------------|---------|---------------|-----|------------|---------------|
+|  2022 |     GCC |  10 |         0.0 |          0.0 |       8 |           5.0 | 200 |   8589.9   |         0.37  |
+|  2022 |     GCC |  12 |         0.0 |          0.1 |       8 |           4.0 | 200 |  34359.7   |         0.37  |
+|  2022 |     GCC |  14 |         0.1 |          0.4 |       8 |           5.0 | 200 | 103079.2   |         0.36  |
+|  2022 |     GCC |  16 |         0.5 |          1.5 |       8 |           6.0 | 200 | 329853.5   |         0.37  |
+|  2022 |     GCC |  18 |         2.0 |          6.0 |       8 |          18.0 | 200 | 488671.8   |         0.38  |
+|  2022 |     GCC |  20 |         8.0 |         24.0 |       8 |          41.0 | 200 | 514893.3   |         0.41  |
+|  2022 |     GCC |  22 |        32.0 |         96.0 |       8 |         882.0 | 200 | 167478.2   |         0.92  |
+|  2022 |     GCC |  24 |       128.0 |        384.0 |       8 |        3672.0 | 200 | 166520.4   |         2.95  |
+|  2022 |     GCC |  26 |       512.0 |       1536.0 |       8 |       14652.0 | 200 | 159671.9   |         9.84  |
+|  2022 |     GCC |  28 |      2048.0 |       6144.0 |       8 |       53737.0 | 200 | 127464.6   |        35.08  |
+|  2022 |     GCC |  32 |     32768.0 |      98304.0 |       8 |      380153.0 | 200 | 153141.9   |       498.92  |
+
+## Commitment
+
+The following values should be reported. Any run with CLK/Ins <= 20 is not counted. For _Threads1_, the average triad bandwidth for array sizes from 2^15 to 2^28 is to be reported as well. The configuration chosen or _Optimal_ needs to be given.  
+If multiple memory types are exposed to user space (for example HBM in flat mode), the benchmark should be run on each.
+
+|     Name     |                    Values                   |
+|--------------|---------------------------------------------|
+| Threads1     | Average triad bandwidth for array size 2^18 |
+| Threads4     | Average triad bandwidth for array size 2^18 |
+| ThreadsCores | Average triad bandwidth for array size 2^18 |
+| ThreadsHyper | Average triad bandwidth for array size 2^18 |
+| Optimal      | Average triad bandwidth for array size 2^18 |
+
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Forschungszentrum Jülich GmbH
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,7 @@
+# JUPITER Benchmark Suite: STREAM
+
+[![DOI](https://zenodo.org/badge/831481274.svg)](https://zenodo.org/badge/latestdoi/831481274) [![Static Badge](https://img.shields.io/badge/DOI%20(Suite)-10.5281%2Fzenodo.12737073-blue)](https://zenodo.org/badge/latestdoi/764615316)
+
+This benchmark is part of the [JUPITER Benchmark Suite](https://github.com/FZJ-JSC/jubench). See the repository of the suite for some general remarks.
+
+This repository contains the STREAM benchmark. [`DESCRIPTION.md`](DESCRIPTION.md) contains details for compilation, execution, and evaluation. Sources are available in `./src/`, archived as a tarball.