Skip to content

Latest commit

 

History

History
168 lines (118 loc) · 10.1 KB

DESCRIPTION.md

File metadata and controls

168 lines (118 loc) · 10.1 KB

STREAM

Purpose

The STREAM benchmark is a synthetic benchmark which measures the sustainable memory bandwidth of a compute node. It only uses little to no computation per byte transferred to or from memory. Different versions of this benchmark are available. The one used here is written in C and only uses OpenMP for threading (no MPI).

Source

Archive name: stream-bench.tar.gz.

The source code is available in the file src/stream_src.tar.gz.

The provided sources are equivalent to the STREAM version 5.10 from the official website: https://www.cs.virginia.edu/stream/FTP/Code/

Download and unpack the tar-file first; the directory stream_src will be created.

cd src
tar -xvzf stream_src.tar.gz

Building

OpenMP must be enabled for compiler and linker. The preprocessor macros STREAM_ARRAY_SIZE and NTIMES must be defined (see Configurations below).

When using the JUBE script, JUBE will build the STREAM executable in the compile step (see Execution with JUBE below). To build STREAM without JUBE, use a compiler invocation similar to the following line.

cd stream_src
make CC=gcc CFLAGS="-fopenmp -Wall -DSTREAM_ARRAY_SIZE=$((2 ** 26)) -DNTIMES=200" stream_c.exe

Since the array size is a compile-time parameter, it is recommended to build a dedicated executable for each STREAM_ARRAY_SIZE.

Depending on the tested configuration, the compiler flags might be modified, see Modification section below.

Execution

The executable is called stream_c.exe.

Command Line

Stream can be executed manually for example with the following command:

OMP_PLACES="cores" OMP_PROC_BIND="spread" OMP_NUM_THREADS=128 ./stream_c.exe

Execution with JUBE

Using JUBE the benchmark can be executed as follows:

jube run stream.jube.xml [--tag tags...]
jube continue stream_run # wait/repeat until all steps are done
jube result -a stream_run
Tags Effect Default
varySize STREAM_ARRAY_SIZE=2^15..2^28
varySizeextended STREAM_ARRAY_SIZE=2^10..2^35
fixedSize STREAM_ARRAY_SIZE=2^28 yes
varyThreads OMP_NUM_THREADS=1..16 yes
threads1 OMP_NUM_THREADS=1
threads4 OMP_NUM_THREADS=4
threadsCores OMP_NUM_THREADS=#Cores
threadsNuma OMP_NUM_THREADS=#NUMA Domains
threadsHyper OMP_NUM_THREADS=#Hardware Threads
s22 module load Stages/2022 yes
s21 module load Stages/2021
s20 module load Stages/2020
varyCompiler CC=gcc,icc,nvc,clang
gcc CC=gcc yes
intel CC=icc
nvhpc CC=nvc
aocc CC=clang (AOCC)

The results can be found in the columns Copy, Scale, Add, Triad.

Configurations

Candidates are requested to run the following configurations; according JUBE tags are given. All STREAM benchmarks should be run in FP64 precision. See also the modification overview below.

Name Description JUBE Tags
Threads1 Single threaded, array sizes 2^15 to 2^28 should be reported; array size 2^28 will be used for evaluation threads1 varySize
Threads4 Four threads located on the same socket, fixed size (2^28) threads4
ThreadsCores One thread per physical core on the node, fixed size (2^28) threadsCores
ThreadsHyper One thread per hardware thread on the node, fixed size (2^28) threadsHyper
Optimal A custom configuration that achieves maximum bandwidth (2^28) optimal^

The following STREAM parameters are to be used:

Name Value
STREAM_ARRAY_SIZE 2^28; 2^15 to 2^28 for Threads1 in addition
NTIMES 200

In any case, the timing accuracy output by the program needs to be at least 20 clocks-ticks (CLK/Ins).

If multiple memory domains are exposed to user space (for example HBM and DDR), each memory domain must be measured and reported separately. This can be achieved, for example, by using numactl and the -m (--membind) option. Mixing memory domains (for example cache mode) is not allowed.

^ Tag already exists but values have to be defined in parametersets.jube.xml.

Modification

The following Modifications are allowed, depending on the configuration:

Name Compiler Compiler Flags Size Threads Affinity Source Code
Threads1 Yes Yes No^ No No No
Threads4 Yes Yes No No No No
ThreadsCores Yes Yes No No No No
ThreadsHyper Yes Yes No No No No
Optimal Yes Yes No Yes Yes No

Here, Size refers to the length of the arrays in STREAM (i.e STREAM_ARRAY_SIZE / JUBE variable size), Threads to the number of OpenMP threads (OMP_NUM_THREADS/ JUBE's threadspertask), Affinity to the locations of spawned threads (OMP_PLACES, OMP_PROC_BIND, ...). For the Threads1 configuration, a report of additional array sizes is expected for the qualitative evaluation.

^: STREAM_ARRAY_SIZE should be 2^28 for the evaluation, but a scan through multiple array sizes (2^15 to 2^28) should be reported as part of the feedback.

Verification

All execution should run without errors. In case of JUBE: status done, no errors. Only in case of out-of-memory crashes, the range of STREAM_ARRAY_SIZE values can be narrowed to prevent them.

Results

In most cases, the average triad bandwidth is to be reported. For the case of the freely chosen Optimal configuration, please report the best bandwidth achieved. Should multiple memory domains be available (for example HBM and DDR), the benchmark is to be run separately for each domain.

Abbreviated example output follow.

stage modules mem CLK/Ins (>20) N Copy Scale Add Triad runtime[sec]
2022 GCC 3.0 55.0 200 16786 19590 26872 26709 0.47
2022 Intel 3.0 57.0 200 22554 26494 26441 26655 0.48
2022 NVHPC 3.0 200.0 200 10701 9763 15062 14096 0.62
2022 AOCC 200 0.49
stage modules Exp Array [MiB] Thread [MiB] Threads CLK/Ins (>20) N Triad runtime[sec]
2022 GCC 10 0.0 0.0 8 5.0 200 8589.9 0.37
2022 GCC 12 0.0 0.1 8 4.0 200 34359.7 0.37
2022 GCC 14 0.1 0.4 8 5.0 200 103079.2 0.36
2022 GCC 16 0.5 1.5 8 6.0 200 329853.5 0.37
2022 GCC 18 2.0 6.0 8 18.0 200 488671.8 0.38
2022 GCC 20 8.0 24.0 8 41.0 200 514893.3 0.41
2022 GCC 22 32.0 96.0 8 882.0 200 167478.2 0.92
2022 GCC 24 128.0 384.0 8 3672.0 200 166520.4 2.95
2022 GCC 26 512.0 1536.0 8 14652.0 200 159671.9 9.84
2022 GCC 28 2048.0 6144.0 8 53737.0 200 127464.6 35.08
2022 GCC 32 32768.0 98304.0 8 380153.0 200 153141.9 498.92

Commitment

The following values should be reported. Any run with CLK/Ins <= 20 is not counted. For Threads1, the average triad bandwidth for array sizes from 2^15 to 2^28 is to be reported as well. The configuration chosen or Optimal needs to be given.
If multiple memory types are exposed to user space (for example HBM in flat mode), the benchmark should be run on each.

Name Values
Threads1 Average triad bandwidth for array size 2^18
Threads4 Average triad bandwidth for array size 2^18
ThreadsCores Average triad bandwidth for array size 2^18
ThreadsHyper Average triad bandwidth for array size 2^18
Optimal Average triad bandwidth for array size 2^18