Skip to content

Commit

Permalink
Merge branch 'development' into 'master'
Browse files Browse the repository at this point in the history
  • Loading branch information
CNugteren committed Aug 3, 2016
2 parents 68a3882 + 9f6a5a6 commit 182e068
Show file tree
Hide file tree
Showing 6 changed files with 122 additions and 77 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@

Version 7.0 (2016-08-03):
- Re-wrote the OpenCL event implementation with proper memory management
- Updated some return types of device-query information to fix issues on 32-bit systems
- Updated the API documentation
- Refactored some functions to reduce the amount of code
- Added new methods to the API:
* Kernel::GetFunctionName

Version 6.0 (2016-06-29):
- Added the possibility to use Event pointers, adjusted the Kernel::Launch function to do so
- Added a new constructor for Program based on a binary (OpenCL only)
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
# CMake project details
cmake_minimum_required(VERSION 2.8.10)
project("CLCudaAPI" CXX)
set(CLCudaAPI_VERSION_MAJOR 6)
set(CLCudaAPI_VERSION_MAJOR 7)
set(CLCudaAPI_VERSION_MINOR 0)

# ==================================================================================================
Expand Down
33 changes: 25 additions & 8 deletions doc/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Retrieves the maximum total number of threads in an OpenCL work-group or CUDA th
* `size_t MaxWorkItemDimensions() const`:
Retrieves the maximum number of dimensions (e.g. 2D or 3D) in an OpenCL work-group or CUDA thread-block.

* `size_t LocalMemSize() const`:
* `unsigned long LocalMemSize() const`:
Retrieves the maximum amount of on-chip scratchpad memory ('local memory') available to a single OpenCL work-group or CUDA thread-block.

* `std::string Capabilities() const`:
Expand All @@ -70,10 +70,10 @@ Retrieves the device's core clock frequency in MHz.
* `size_t ComputeUnits() const`:
Retrieves the number of compute units (OpenCL terminology) or multi-processors (CUDA terminology) in the device.

* `size_t MemorySize() const`:
* `unsigned long MemorySize() const`:
Retrieves the total global memory size.

* `size_t MaxAllocSize() const`:
* `unsigned long MaxAllocSize() const`:
Retrieves the maximum amount of allocatable global memory per allocation.

* `size_t MemoryClock() const`:
Expand All @@ -82,13 +82,29 @@ Retrieves the device's memory clock frequency in MHz (CUDA back-end) or 0 (OpenC
* `size_t MemoryBusWidth() const`:
Retrieves the device's memory bus-width in bits (CUDA back-end) or 0 (OpenCL back-end).


* `bool IsLocalMemoryValid(const size_t local_mem_usage) const`:
Given a requested amount of local on-chip scratchpad memory, this method returns whether or not this is a valid configuration for this particular device.

* `bool IsThreadConfigValid(const std::vector<size_t> &local) const`:
Given a requested OpenCL work-group or CUDA thread-block configuration `local`, this method returns whether or not this is a valid configuration for this particular device.

* `bool IsCPU() const`:
Determines whether this device is of the CPU type.

* `bool IsGPU() const`:
Determines whether this device is of the GPU type.

* `bool IsAMD() const`:
Determines whether this device is of the AMD brand.

* `bool IsNVIDIA() const`:
Determines whether this device is of the NVIDIA brand.

* `bool IsIntel() const`:
Determines whether this device is of the Intel brand.

* `bool IsARM() const`:
Determines whether this device is of the ARM brand.

CLCudaAPI::Context
-------------
Expand Down Expand Up @@ -226,13 +242,14 @@ Method to set a kernel argument (l-value or r-value). The argument `index` speci

* `template <typename... Args> void SetArguments(Args&... args)`: As above, but now sets all arguments in one go, starting at index 0. This overwrites any previous arguments (if any). The parameter pack `args` takes any number of arguments of different types, including `CLCudaAPI::Buffer`.

* `size_t LocalMemUsage(const Device &device) const`:
* `unsigned long LocalMemUsage(const Device &device) const`:
Retrieves the amount of on-chip scratchpad memory (local memory in OpenCL, shared memory in CUDA) required by this specific kernel.

* `std::string GetFunctionName() const `:
Retrieves the name of the kernel (OpenCL only).

* `Launch(const Queue &queue, const std::vector<size_t> &global, const std::vector<size_t> &local, Event &event)`:
Launches a kernel onto the specified queue. This kernel launch is a-synchronous: this method can return before the device kernel is completed. The total number of threads launched is equal to the `global` vector; the number of threads per OpenCL work-group or CUDA thread-block is given by the `local` vector. The elapsed time is recorded into the `event` argument.

* `Launch(const Queue &queue, const std::vector<size_t> &global, const std::vector<size_t> &local, Event &event, std::vector<Event>& waitForEvents)`: As above, but now this kernel is only launched after the other specified events have finished (OpenCL only).

* `Launch(const Queue &queue, const std::vector<size_t> &global, Event &event)`: As above, but now the local size is determined automatically (OpenCL only).
* `Launch(const Queue &queue, const std::vector<size_t> &global, const std::vector<size_t> &local, Event &event, std::vector<Event>& waitForEvents)`: As above, but now this kernel is only launched after the other specified events have finished (OpenCL only). If `local` is empty, the kernel-size is determined automatically (OpenCL only).

118 changes: 65 additions & 53 deletions include/clpp11.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
// Portability here means that a similar header exists for CUDA with the same classes and
// interfaces. In other words, moving from the OpenCL API to the CUDA API becomes a one-line change.
//
// This is version 6.0 of CLCudaAPI.
// This is version 7.0 of CLCudaAPI.
//
// =================================================================================================
//
Expand Down Expand Up @@ -71,37 +71,46 @@ inline void CheckError(const cl_int status) {
class Event {
public:

// Constructor based on the regular OpenCL data-type
explicit Event(const cl_event event): event_(event) { }
// Constructor based on the regular OpenCL data-type: memory management is handled elsewhere
explicit Event(const cl_event event):
event_(new cl_event) {
*event_ = event;
}

// Regular constructor
explicit Event(): event_(nullptr) { }
// Regular constructor with memory management
explicit Event():
event_(new cl_event, [](cl_event* e) {
if (*e) { CheckError(clReleaseEvent(*e)); }
delete e;
}) {
*event_ = nullptr;
}

// Waits for completion of this event
void WaitForCompletion() const {
CheckError(clWaitForEvents(1, &event_));
CheckError(clWaitForEvents(1, &(*event_)));
}

// Retrieves the elapsed time of the last recorded event. Note that no error checking is done on
// the 'clGetEventProfilingInfo' function, since there is a bug in Apple's OpenCL implementation:
// http://stackoverflow.com/questions/26145603/clgeteventprofilinginfo-bug-in-macosx
float GetElapsedTime() const {
WaitForCompletion();
auto bytes = size_t{0};
clGetEventProfilingInfo(event_, CL_PROFILING_COMMAND_START, 0, nullptr, &bytes);
auto time_start = size_t{0};
clGetEventProfilingInfo(event_, CL_PROFILING_COMMAND_START, bytes, &time_start, nullptr);
clGetEventProfilingInfo(event_, CL_PROFILING_COMMAND_END, 0, nullptr, &bytes);
auto time_end = size_t{0};
clGetEventProfilingInfo(event_, CL_PROFILING_COMMAND_END, bytes, &time_end, nullptr);
return (time_end - time_start) * 1.0e-6f;
const auto bytes = sizeof(cl_ulong);
auto time_start = cl_ulong{0};
clGetEventProfilingInfo(*event_, CL_PROFILING_COMMAND_START, bytes, &time_start, nullptr);
auto time_end = cl_ulong{0};
clGetEventProfilingInfo(*event_, CL_PROFILING_COMMAND_END, bytes, &time_end, nullptr);
return static_cast<float>(time_end - time_start) * 1.0e-6f;
}

// Accessor to the private data-member
cl_event& operator()() { return event_; }
cl_event* pointer() { return &event_; }
cl_event& operator()() { return *event_; }
const cl_event& operator()() const { return *event_; }
cl_event* pointer() { return &(*event_); }
const cl_event* pointer() const { return &(*event_); }
private:
cl_event event_;
std::shared_ptr<cl_event> event_;
};

// Pointer to an OpenCL event
Expand Down Expand Up @@ -184,24 +193,32 @@ class Device {
}
size_t MaxWorkGroupSize() const { return GetInfo<size_t>(CL_DEVICE_MAX_WORK_GROUP_SIZE); }
size_t MaxWorkItemDimensions() const {
return GetInfo(CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS);
return static_cast<size_t>(GetInfo<cl_uint>(CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS));
}
std::vector<size_t> MaxWorkItemSizes() const {
return GetInfoVector<size_t>(CL_DEVICE_MAX_WORK_ITEM_SIZES);
}
size_t LocalMemSize() const {
return static_cast<size_t>(GetInfo<cl_ulong>(CL_DEVICE_LOCAL_MEM_SIZE));
unsigned long LocalMemSize() const {
return GetInfo<cl_ulong>(CL_DEVICE_LOCAL_MEM_SIZE);
}
std::string Capabilities() const { return GetInfoString(CL_DEVICE_EXTENSIONS); }
size_t CoreClock() const { return GetInfo(CL_DEVICE_MAX_CLOCK_FREQUENCY); }
size_t ComputeUnits() const { return GetInfo(CL_DEVICE_MAX_COMPUTE_UNITS); }
size_t MemorySize() const { return GetInfo(CL_DEVICE_GLOBAL_MEM_SIZE); }
size_t MaxAllocSize() const { return GetInfo(CL_DEVICE_MAX_MEM_ALLOC_SIZE); }
size_t CoreClock() const {
return static_cast<size_t>(GetInfo<cl_uint>(CL_DEVICE_MAX_CLOCK_FREQUENCY));
}
size_t ComputeUnits() const {
return static_cast<size_t>(GetInfo<cl_uint>(CL_DEVICE_MAX_COMPUTE_UNITS));
}
unsigned long MemorySize() const {
return static_cast<unsigned long>(GetInfo<cl_ulong>(CL_DEVICE_GLOBAL_MEM_SIZE));
}
unsigned long MaxAllocSize() const {
return static_cast<unsigned long>(GetInfo<cl_ulong>(CL_DEVICE_MAX_MEM_ALLOC_SIZE));
}
size_t MemoryClock() const { return 0; } // Not exposed in OpenCL
size_t MemoryBusWidth() const { return 0; } // Not exposed in OpenCL

// Configuration-validity checks
bool IsLocalMemoryValid(const size_t local_mem_usage) const {
bool IsLocalMemoryValid(const cl_ulong local_mem_usage) const {
return (local_mem_usage <= LocalMemSize());
}
bool IsThreadConfigValid(const std::vector<size_t> &local) const {
Expand All @@ -218,9 +235,11 @@ class Device {
// Query for a specific type of device or brand
bool IsCPU() const { return Type() == "CPU"; }
bool IsGPU() const { return Type() == "GPU"; }
bool IsAMD() const { return Vendor() == "AMD" || Vendor() == "Advanced Micro Devices, Inc."; }
bool IsAMD() const { return Vendor() == "AMD" || Vendor() == "Advanced Micro Devices, Inc." ||
Vendor() == "AuthenticAMD";; }
bool IsNVIDIA() const { return Vendor() == "NVIDIA" || Vendor() == "NVIDIA Corporation"; }
bool IsIntel() const { return Vendor() == "Intel" || Vendor() == "GenuineIntel"; }
bool IsIntel() const { return Vendor() == "INTEL" || Vendor() == "Intel" ||
Vendor() == "GenuineIntel"; }
bool IsARM() const { return Vendor() == "ARM"; }

// Accessor to the private data-member
Expand All @@ -237,13 +256,6 @@ class Device {
CheckError(clGetDeviceInfo(device_, info, bytes, &result, nullptr));
return result;
}
size_t GetInfo(const cl_device_info info) const {
auto bytes = size_t{0};
CheckError(clGetDeviceInfo(device_, info, 0, nullptr, &bytes));
auto result = cl_uint(0);
CheckError(clGetDeviceInfo(device_, info, bytes, &result, nullptr));
return static_cast<size_t>(result);
}
template <typename T>
std::vector<T> GetInfoVector(const cl_device_info info) const {
auto bytes = size_t{0};
Expand Down Expand Up @@ -593,8 +605,7 @@ class Buffer {

// Retrieves the actual allocated size in bytes
size_t GetSize() const {
auto bytes = size_t{0};
CheckError(clGetMemObjectInfo(*buffer_, CL_MEM_SIZE, 0, nullptr, &bytes));
const auto bytes = sizeof(size_t);
auto result = size_t{0};
CheckError(clGetMemObjectInfo(*buffer_, CL_MEM_SIZE, bytes, &result, nullptr));
return result;
Expand Down Expand Up @@ -645,13 +656,22 @@ class Kernel {
}

// Retrieves the amount of local memory used per work-group for this kernel
size_t LocalMemUsage(const Device &device) const {
auto bytes = size_t{0};
unsigned long LocalMemUsage(const Device &device) const {
const auto bytes = sizeof(cl_ulong);
auto query = cl_kernel_work_group_info{CL_KERNEL_LOCAL_MEM_SIZE};
CheckError(clGetKernelWorkGroupInfo(*kernel_, device(), query, 0, nullptr, &bytes));
auto result = size_t{0};
auto result = cl_ulong{0};
CheckError(clGetKernelWorkGroupInfo(*kernel_, device(), query, bytes, &result, nullptr));
return result;
return static_cast<unsigned long>(result);
}

// Retrieves the name of the kernel
std::string GetFunctionName() const {
auto bytes = size_t{0};
CheckError(clGetKernelInfo(*kernel_, CL_KERNEL_FUNCTION_NAME, 0, nullptr, &bytes));
auto result = std::string{};
result.resize(bytes);
CheckError(clGetKernelInfo(*kernel_, CL_KERNEL_FUNCTION_NAME, bytes, &result[0], nullptr));
return std::string{result.c_str()}; // Removes any trailing '\0'-characters
}

// Launches a kernel onto the specified queue
Expand All @@ -665,30 +685,22 @@ class Kernel {
// As above, but with an event waiting list
void Launch(const Queue &queue, const std::vector<size_t> &global,
const std::vector<size_t> &local, EventPointer event,
std::vector<Event>& waitForEvents) {
if (waitForEvents.size() == 0) { return Launch(queue, global, local, event); }
const std::vector<Event> &waitForEvents) {

// Builds a plain version of the events waiting list
auto waitForEventsPlain = std::vector<cl_event>();
for (auto &waitEvent : waitForEvents) {
waitForEventsPlain.push_back(waitEvent());
if (waitEvent()) { waitForEventsPlain.push_back(waitEvent()); }
}

// Launches the kernel while waiting for other events
CheckError(clEnqueueNDRangeKernel(queue(), *kernel_, static_cast<cl_uint>(global.size()),
nullptr, global.data(), local.data(),
nullptr, global.data(), !local.empty() ? local.data() : nullptr,
static_cast<cl_uint>(waitForEventsPlain.size()),
waitForEventsPlain.data(),
!waitForEventsPlain.empty() ? waitForEventsPlain.data() : nullptr,
event));
}

// As above, but with the default local workgroup size
void Launch(const Queue &queue, const std::vector<size_t> &global, EventPointer event) {
CheckError(clEnqueueNDRangeKernel(queue(), *kernel_, static_cast<cl_uint>(global.size()),
nullptr, global.data(), nullptr,
0, nullptr, event));
}

// Accessor to the private data-member
const cl_kernel& operator()() const { return *kernel_; }
private:
Expand Down
36 changes: 22 additions & 14 deletions include/cupp11.h
Original file line number Diff line number Diff line change
Expand Up @@ -181,20 +181,22 @@ class Device {
GetInfo(CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Y),
GetInfo(CU_DEVICE_ATTRIBUTE_MAX_BLOCK_DIM_Z)};
}
size_t LocalMemSize() const { return GetInfo(CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK); }
unsigned long LocalMemSize() const {
return static_cast<unsigned long>(GetInfo(CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK));
}
std::string Capabilities() const {
auto major = GetInfo(CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR);
auto minor = GetInfo(CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR);
return "SM "+std::to_string(major)+"."+std::to_string(minor);
}
size_t CoreClock() const { return 1e-3*GetInfo(CU_DEVICE_ATTRIBUTE_CLOCK_RATE); }
size_t ComputeUnits() const { return GetInfo(CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT); }
size_t MemorySize() const {
unsigned long MemorySize() const {
auto result = size_t{0};
CheckError(cuDeviceTotalMem(&result, device_));
return result;
return static_cast<unsigned long>(result);
}
size_t MaxAllocSize() const { return MemorySize(); }
unsigned long MaxAllocSize() const { return MemorySize(); }
size_t MemoryClock() const { return 1e-3*GetInfo(CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE); }
size_t MemoryBusWidth() const { return GetInfo(CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH); }

Expand Down Expand Up @@ -566,10 +568,15 @@ class Kernel {

// Retrieves the amount of local memory used per work-group for this kernel. Note that this the
// shared memory in CUDA terminology.
size_t LocalMemUsage(const Device &) const {
unsigned long LocalMemUsage(const Device &) const {
auto result = 0;
CheckError(cuFuncGetAttribute(&result, CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES, kernel_));
return static_cast<size_t>(result);
return static_cast<unsigned long>(result);
}

// Retrieves the name of the kernel
std::string GetFunctionName() const {
return std::string{"unknown"}; // Not implemented for the CUDA backend
}

// Launches a kernel onto the specified queue
Expand Down Expand Up @@ -601,14 +608,15 @@ class Kernel {
void Launch(const Queue &queue, const std::vector<size_t> &global,
const std::vector<size_t> &local, EventPointer event,
std::vector<Event>& waitForEvents) {
if (waitForEvents.size() == 0) { return Launch(queue, global, local, event); }
Error("launching with an event waiting list is not implemented for the CUDA back-end");
}

// As above, but with the default local workgroup size
// TODO: Implement this function
void Launch(const Queue &, const std::vector<size_t> &, EventPointer) {
Error("launching with a default workgroup size is not implemented for the CUDA back-end");
if (local.size() == 0) {
Error("launching with a default workgroup size is not implemented for the CUDA back-end");
}
else if (waitForEvents.size() != 0) {
Error("launching with an event waiting list is not implemented for the CUDA back-end");
}
else {
return Launch(queue, global, local, event);
}
}

// Accessors to the private data-members
Expand Down
2 changes: 1 addition & 1 deletion samples/smallest.cc
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@

int main() {
constexpr auto platform_id = size_t{0};
constexpr auto device_id = size_t{1};
constexpr auto device_id = size_t{0};
auto platform = CLCudaAPI::Platform(platform_id);
auto device = CLCudaAPI::Device(platform, device_id);
auto context = CLCudaAPI::Context(device);
Expand Down

0 comments on commit 182e068

Please sign in to comment.