Skip to content

Latest commit

 

History

History
748 lines (523 loc) · 25.7 KB

architecture.rst

File metadata and controls

748 lines (523 loc) · 25.7 KB

Architecture of triSYCL runtime and compiler

triSYCL is a pure C++20 header runtime library to provide SYCL support on CPU and accelerators.

To target devices, a device compiler based on Clang/LLVM is required to extract the kernel from the SYCL program, compile it to the kernel and generate on the host side some glue to call the kernel.

The main library is defined in ../include/CL/sycl.hpp which introduces the API inside the ::cl::sycl namespace as defined by the SYCL 1.2.1 standard.

As a convenience extension, there are 2 other include files that can be used instead:

  • ../include/sycl/sycl.hpp to have the SYCL API defined inside ::sycl as a shortcut to save 4 letters. It does not define anything inside the ::cl::sycl namespace, allowing another SYCL_ implementation to coexist there;
  • ../include/triSYCL/sycl.hpp which is where the triSYCL implementation resides. It defines all the API under ::trisycl, leaving the ::cl and ::sycl namespaces free to be used by some other implementations and to have triSYCL to coexist with them and even to use them.

All the headers are mainly called in alphabetic order in ../include/triSYCL/sycl.hpp.

More generally all the header files are included in alphabetic order, but with the standard headers first, then Boost libraries and at last the triSYCL headers.

The coding style is similar to the STL or Boost libraries, with lines with less than 80 characters, to fit on a standard punch card. :-)

Tab characters are not used. Be sure you configure your editor to help you and to prevent parasitic spacing, such as trailing white spaces and so on.

To avoid ODR conflicts with a pure header library, C++17 inline variables are used massively.

Each SYCL concept (for example a trisycl::queue or according to the header included, a trisycl::queue or a cl::sycl::queue) is defined in its own header file (for example ../include/triSYCL/queue.hpp) so it is easy to find the definition of a concept.

If a concept requires more implementation details, these are detailed inside a related detail directory, for example for the queue, besides the file ../include/triSYCL/sycl/queue.hpp, there might be also inside ../include/triSYCL/queue/detail the files

In ../include/triSYCL/detail there are also some code used by the implementation but not directly related to some SYCL classes.

The documentation of the source-code itself is based on Doxygen.

Doxygen modules are used to group the code elements according to various themes.

To build the Doxygen documentation, in the top directory, run

make

that will produce tmp/Doxygen/SYCL with the API documentation and tmp/Doxygen/triSYCL with the documented triSYCL implementation source code.

To publish the documentation on GitHub:

make publish

and finish as explained by the make output.

The documentation workflow is basically implemented with ../Makefile and ../dev/publish_Doxygen.

This script builds the content of https://github.com/triSYCL/triSYCL/tree/gh-pages branch which is used by GitHub to be published as https://triSYCL.github.io/triSYCL

The connection between this gh-pages branch and https://triSYCL.github.io/triSYCL is done with https://github.com/triSYCL/triSYCL/settings in the "GitHub Pages" section.

../dev/publish_Doxygen is a little bit convoluted because it connects the history of the documentation to the history of the project itself, for traceability. So if you look at the main branches of the project there is no compiled files for the documentation while at the same time in the gh-pages branch you have only the compiled documentation and not the source directories. But the history is still connected to the main branch, to track exactly where the documentation comes from.

The implementation for CPU is a pure C++17 templated header library and does not require a specific C++ or SYCL compiler.

The dataflow SYCL infrastructure between kernels related by buffer/accessors dependencies is implemented in ../include/triSYCL/command_group/detail/task.hpp with plain C++ std::thread and std::condition_variable. It should be updated to a more efficient library in the future for the tasking, such as Boost.Fiber or TBB;

All the kernel code itself is accelerated with OpenMP or with TBB according to some macros parameters, allowing various behaviors. See ../include/triSYCL/parallelism/detail/parallelism.hpp or ../include/triSYCL/parallelism/detail/parallelism_tbb.hpp for the implementation details.

Since in SYCL barriers are available and the CPU triSYCL implementation does not use a compiler to restructure the kernel code, it is implemented in SYCL with CPU threads provided by OpenMP. This is massively inefficient. If you know that there will be no barrier you should define the TRISYCL_NO_BARRIER macro first.

Anyway, low-level OpenCL-style barriers should not be used in modern SYCL code. Hierarchical parallelism, which is performance portable between device and CPU, is preferable.

Otherwise, using an OpenCL target on CPU can be used to rely on the CPU OpenCL stack to do CPU-friendly SIMD-ization of the barrier-spaghetti code. But this relies on the triSYCL device compiler...

All this is obsolete and has been superseded by https://github.com/triSYCL/sycl

When targeting an accelerator, even if SYCL is a pure C++ DSEL, a specific compiler is required to extract the kernel code and compile it to some target device and at the same time to compile on the host side some glue code around the extraction boundary to transfer data to and from the device and call the kernel itself.

The device compiler is very experimental and exists in several versions, mainly dependent on different Clang/LLVM versions.

The specific versions are in branches with name starting with sycl/. For example:

First download or clone the device compiler repositories, for example with:

git clone --branch sycl/master git@github.com:triSYCL/llvm.git
cd llvm/tools
git clone --branch sycl/master git@github.com:triSYCL/clang.git
cd ../..

Then compile for example with:

mkdir build
cd build
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 \
  -DLLVM_ENABLE_CXX1Y=On \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_TARGETS_TO_BUILD="X86" \
  -DLLVM_BUILD_LLVM_DYLIB:BOOL=ON \
  -DLLVM_LINK_LLVM_DYLIB:BOOL=ON \
  ../llvm
# Use -j with 1 thread per core to speed up compilation
make -j`nproc`

You might replace the Release by Debug above if you want to debug the compiler itself. Look at https://llvm.org/docs/CMake.html for more information.

Compilation and installation of the triSYCL runtime:

git clone --branch device git@github.com:triSYCL/triSYCL.git
# Compile the triSYCL_tool command
cd triSYCL/src
make

Unfortunately there is no Clang driver yet to generate directly the host and device part and it is up to the end-user for now, since it is still experimental and in development. So, using the compiler is... painful. :-(

It is expected to be used as for example with examples from ../tests/device_compiler. Everything is done from ../tests/Makefile when making a target ending with the .kernel_caller extension such as tests/device_compiler/single_task_vector_add_drt.kernel_caller.

triSYCL assumes some recent Clang/LLVM installed, independently from the one used by device compiler which might not be new enough.

A recent version of Boost is required. It is available with package libboost-all-dev on Debian/Ubuntu or with some more modern specific versions such as libboost1.67-all-dev.

The following assumes that you have an OpenCL ICD installed on the machine, to allow several OpenCL platforms usable at the same time on the machine. For example the ocl-icd-libopencl1 package on Debian/Ubuntu.

The device compiler generates the kernels as SPIR-df (de facto), which is SPIR 2.0 encoded with LLVM IR of a more recent version than LLVM 3.4 expected by the SPIR specification. So a very modern SPIR+ consumer is required, such as a recent PoCL. It is not the version available in Ubuntu 17.10 for example, so you might compile and install PoCL on your own... The rule is that PoCL has to use a Clang/LLVM at least as modern as the one used by the device compiler to be able to consume the IR.

Note that you can also use PoCL to target CUDA, as a way to use SYCL on nVidia GPU.

Set up the environment:

# Used by the tests Makefile to find the device compiler
export LLVM_BUILD_DIR=<directory_where_LLVM_is_built>

# Use PoCL OpenCL stack
export BOOST_COMPUTE_DEFAULT_PLATFORM='Portable Computing Language'
# Do not use another OpenCL stack if the one requested is not available
export BOOST_COMPUTE_DEFAULT_ENFORCE=1
# OPTIONAL: Used by the tests Makefile and tells triSYCL to queue kernels
# using OpenCL nd_range_kernel when parallel_for is used. Rather than queuing
# a single work-item task and looping over the range inside of the work-item.
# This variable is used by the Makefile and requires recompilation if you
# change its value
export TRISYCL_USE_OPENCL_ND_RANGE=1

Compile and execute a small example:

cd tests
make -j2 device_compiler/single_task_vector_add_drt.kernel_caller
device_compiler/single_task_vector_add_drt.kernel_caller
  [...]
  Queue waiting for kernel completion

  **** no errors detected

Let's assume you have installed Xilinx SDx somewhere, and probably a /etc/OpenCL/vendors/xilinx.icd file containing the string libxilinxopencl.so to have the OpenCL ICD indirection working.

Initialize the environment with something like:

export XILINX_SDX=/opt/Xilinx/SDx/2018.3
PATH=$PATH:$XILINX_SDX/bin
export LD_LIBRARY_PATH=$XILINX_SDX/runtime/lib/x86_64:$XILINX_SDX/lib/lnx64.o

# Used by the tests Makefile to find the device compiler
export LLVM_BUILD_DIR=<directory_where_LLVM_is_built>

# Use the Xilinx OpenCL stack
export BOOST_COMPUTE_DEFAULT_PLATFORM=Xilinx
# Do not use another OpenCL stack if the one requested is not available
export BOOST_COMPUTE_DEFAULT_ENFORCE=1

Compile and execute a small example:

cd tests
make -j2 device_compiler/single_task_vector_add_drt.kernel_caller
device_compiler/single_task_vector_add_drt.kernel_caller
  [...]
  Queue waiting for kernel completion

  **** no errors detected

Note that since the final code contains the FPGA bit-stream configuration file and not the SPIR representation, it takes quite a lot of time to be generated through SDx...

High-level compilation workflow in triSYCL

Figure 1: High-level view of the compilation workflow in triSYCL.

When compiling on CPU, since triSYCL relies on the fact that SYCL is a pure C++ executable DSEL, the C++ SYCL code is just compiled with any host compiler (top of Figure 1) which includes the SYCL runtime (bottom left of Figure 1) which is a plain C++ header file. A CPU executable is generated, using OpenMP for multithreading.

If some OpenCL features are used through the interoperability mode (non-single-source SYCL), then an OpenCL library is required to interact with some OpenCL devices.

When using SYCL in single-source mode on device, the compilation flow is quite more complex because it requires a device compiler to split and compile the code for the final target.

The Clang/LLVM-based device compiler (bottom of Figure 1) compiles the C++ SYCL code as for CPU only, but just keep the kernel part of the code and produce a simple portable intermediate representation (SPIR) of the kernels. For now, triSYCL uses SPIR-df (de facto), a non-conforming SPIR 2.0 encoded in something newer than LLVM 3.4 IR. But you could graft an official SPIR down-caster if you have one or a SPIR-V generator using this SPIR-df.

Then this SPIR-df output is optionally compiled by some vendor compiler to speed-up the launch time by doing some compilation ahead. With PoCL it is not done (dashed arrow line) but for FPGA it is done ahead-of-time since compilation is very slow.

In single-source mode on the host side, the source code has also to go through the device compiler, but to do the dual operation: to remove the kernel code and just to keep the host code. This is also where some glue to call the kernels and to do the argument serialization is done.

The kernel binary generated by the other compiler flow is also included in the host code so that the main host executable is self-contained and can start the kernel on the device without having to load the binary from an external file. It is a manual way to get a fat binary and we could probably use the official off-loading Clang/LLVM-way in the future.

Low-level compilation workflow in triSYCL

Figure 2: Low-level view of the compilation workflow in triSYCL.

The real workflow is currently implemented in ../tests/Makefile and this is the current source of truth. The path to go for example from a ex.cpp file to a final ex.kernel_caller is summarized on Figure 2,

Each intermediate file is characterized by a specific extension:

.cpp
for the single-source SYCL C++ input file;
.bc
some LLVM IR bitcode;
.ll
some LLVM IR in textual assembly syntax;
.kernel_caller
for the final host executable, with the kernel binary internalized so the host can load and launch the kernels on the devices without external files.

Note that the file without any extension is actually the normal CPU-only executable, which does not appear in this picture because it is about compiling for device instead.

All the SYCL-specific LLVM passes are in the lib/SYCL directory of LLVM.

The file extensions used on the host side are:

.pre_kernel_caller.ll
the SYCL C++ code compiled by Clang for the host side, including the call of the kernels;
.kernel_caller.ll
the LLVM IR of the host code after the LLVM triSYCL transformation passes;

To generate the .pre_kernel_caller.ll file, the source code is compiled with:

clang -O3 -sycl

which is basically clang unchanged, but with loop-idiom detection pass skipped because otherwise it generates some memory copy intrinsic functions that prevents some argument flattening to work later.

The -O3 is important to generate optimized minimal code that can be massaged later, with a lot of in-lining to have the C++ constructs to disappear. Otherwise less optimized code breaks a lot of assumptions in the triSYCL-specific LLVM passes later.

The compilation flow to generate the final .kernel_caller.ll file is based on LLVM opt to apply a sequence of LLVM passes:

-globalopt -deadargelim
to clean-up the code before SYCL massaging;
-SYCL-args-flattening
is a fundamental SYCL-specific pass that takes the lambda capture (basically a C++ structure passed by address) of a SYCL kernel lambda expression and flattens it as its content. So basically if the capture has several scalar and accessor parameters, the structure address used in the function call is replaced by a function call with all the parameters explicitly passed as arguments. This makes the classical OpenCL-style kernel parameter to show up;
-loop-idiom
then the loop-idiom detection pass which was not applied before to avoid choking the SYCL-args-flattening pass can now be applied to optimize some loops and generate the LLVM intrinsics representing memory copies and initialization for example;
-deadargelim
removes some dead code that might be left by previous passes;
-SYCL-serialize-arguments

is another fundamental SYCL-specific pass on host side which replaces a kernel function call by some calls to the runtime to select the kernel and serialize all the kernel arguments.

The input code from the triSYCL headers of the form

cl::sycl::detail::set_kernel_task_marker(t);
cl::sycl::detail::instantiate_kernel<KernelName>(/* flatten args */);

is replaced by

cl::sycl::drt::set_kernel(detail::task &task, const char *kernel_name,
                          const char *kernel_short_name);
// For each parameter call:
// either for a scalar argument
cl::sycl::drt::serialize_arg(detail::task &task, std::size_t index,
                             void *arg, std::size_t arg_size);
// or for an accessor argument
cl::sycl::drt::serialize_accessor_arg(detail::task &task, std::size_t index,
                                      void *arg, std::size_t arg_size);

The marking functions generated by triSYCL headers are in ../include/triSYCL/detail/instantiate_kernel.hpp while the functions used by the transformed code are in ../include/triSYCL/device_runtime.hpp. The functions from cl::sycl::drt:: are the link to the underlying runtime, such as OpenCL.

-deadargelim
again to removes some dead code that might be left by previous pass.

The file extensions used on the kernel side are:

.pre_kernel.ll
the SYCL C++ code compiled by Clang for the host side, including the call of the kernels;
.kernel.bc
the LLVM IR of the host code after the LLVM triSYCL pass transformations;
.kernel.bin
is for the kernel binary to be shipped into the final host executable. This is typically a SPIR LLVM IR bitcode or an FPGA bitstream configuration;
.kernel.internalized.cxx

is the kernel binary represented as C++ code so it can just be compiled by a C++ compiler to have it internalized into the final host binary and used by the runtime.

It is constructed from the .kernel.bin file through the helper triSYCL_tool --source-in.

To generate the .pre_kernel.ll file, the source code is compiled with:

clang -O3 -DTRISYCL_DEVICE -sycl -sycl-is-device

This is similar to the compilation for the host side and the -O3 is important for the same reasons. -DTRISYCL_DEVICE is used so the triSYCL headers behave slightly differently on the device code, mainly enabling some address-space related code used to represent OpenCL global or local memory for example.

Like for the host side path, the compilation flow to generate the final .kernel.bc file is based on LLVM opt to apply a sequence of LLVM passes with:

-globalopt -deadargelim -SYCL-args-flattening -deadargelim
are applied as for the host side. It is important to have globally the same code compiled with the same passes for both host and device side to keep the code synchronized before serialization. Otherwise it would lead to some mismatch and some wrong global code at the end;
-SYCL-kernel-filter
this is one of the most important SYCL-specific pass on the device side, to extract the kernels from the single-source code. Actually it works in 2 passes, in a mark-and-sweep approach. Here is the first pass that marks all the kernel with external linkage (tricking the compiler as it might be useful from outside) and all the non-kernel part with internal linkage;
-globaldce
this is the second stage of kernel selection. It will remove all the dead code of the program. Since only the kernels have been marked as potentially used from the outside, after application of this pass, only what is transitively useful for the kernels are left. So only remains the device code;
-RELGCD
compiling C++ comes with an ABI storing the lists of global static constructors and destructors. Unfortunately even if at the end these lists are empty because of SYCL specification, they are not removed by -globaldce and it is not supported by SPIR yet. So this SYCL-specific pass Removes the Empty List of Global Constructors or Destructors (RELGCD);
-reqd-workgroup-size-1
in the case the kernel are compiled with only 1 SPIR work-group with 1 work-item (common use case on FPGA), this SYCL-specific pass add a SPIR metadata on the kernels to specify it will be called with only 1 work-item. This way the target compiler can spare some resources on the device;
-inSPIRation
is the SYCL-specific pass generating the SPIR 2.0-style LLVM IR output. Since it generates LLVM IR with the version of the recent LLVM used, it is quite more modern that the official SPIR 2.0 based on LLVM 3.4 IR. So it is a SPIR-df (de facto)", which is nevertheless accepted by some tools. But by using a bitcode down-caster, it could probably make some decent official SPIR 2.0 encoded in LLVM 3.4 IR. Otherwise a SPIR-V back-end could generate some SPIR-V code from this.
-globaldce is the last cleaning to remove unused functions, for
example __gxx_personality_v0 that was used to specify the exception handling flavour for the kernel functions before SPIR transformation.

Look at testing.rst and ../tests/README.rst

Travis CI is used to validate triSYCL with its test suite from tests/ on CPU and OpenCL with interoperability mode, using CMake ctest.

The device compiler is not tested yet through Travis CI. :-(

Look at ../.travis.yml and ../Dockerfile for the configuration.