Cooperative Groups was introduced in CUDA 9 to provide a flexible model for synchronization and communication between groups of threads executing CUDA kernels.  Mark Harris and Kyrylo Perelygin have an excellent introduction to Cooperative Groups here.

CUDA has always supported barrier synchronization of all the threads in a thread block through the __syncthreads() intrinsic. Cooperative Groups adds the ability to synchronize across thread blocks, by synchronizing all the threads in a grid. Grid synchronization is supported on Compute Capability 6.0 (Pascal) and higher GPUs. If you are running in Windows, your GPU must additionally be running in Tesla Compute Cluster (TCC) driver mode.

To demonstrate the utility of grid synchronization, I use it to normalize an input array of integers to the range [-1, 1] by scaling each element of the input by the maximum absolute value of the input array.  The following source code listing contains three different normalization implementations which we compare below.

Normalize.cu

The code must be compiled to generate relocatable device code, and target GPUs of Compute Capability 6.0 or higher, eg.:

nvcc -rdc=true -arch=sm_60 -o normalize normalize.cu

Two-Pass Normalization

Without grid synchronization, thread blocks are only synchronized upon completion of a kernel, necessitating a two kernel approach for normalization: one kernel to find the maximum absolute value, followed by a second kernel to multiply by the inverse of the maximum absolute value.

The FindMaximum kernel requires 1 thread per input element, and demonstrates the use of Cooperative Group’s partitioning mechanism to do a warp-level shuffle reduction to find the maximum absolute value within a warp, followed by having a single thread of each warp perform an atomicMax() operation to find the global maximum.

One-Pass Normalization Via Grid Synchronization

Using Cooperative Groups, we can merge the two steps in the normalization process into a single kernel by finding the maximum absolute value, synchronizing all the threads in the grid to avoid a potential race condition when accessing the maximum absolute value, and then scaling by the inverse of the maximum absolute value.

Note that to enable grid synchronization, kernels must be launched via the host cudaLaunchCooperativeKernel launch API instead of the more familiar <<<...>>> execution configuration syntax. Additionally, grids must be sized appropriately to guarantee that all the blocks in the grid are resident on the GPU’s streaming multiprocessors (SMs). An upper bound for the grid size can be determined by programmatically querying the number of blocks that can reside on a single SM via the cudaOccupancyMaxActiveBlocksPerMultiprocessor API call, and then multiplying by the number of SMs on the GPU. Given the constraints on grid size, using the same 1 thread per input/output element strategy as the two-pass normalization approach would also limit the maximum size of the input array. To maintain flexibility, instead of requiring 1 thread per input/output element, the GridSynchronizationNormalization kernel uses grid stride loops.

Performance

The table below shows the relative performance of the two approaches on a Tesla P100 GPU.

Implementation

Execution Time (uS)

Two-pass

46.8 (27.4 + 19.4)

GridSynchronizationNormalize

34.4 

GridSynchronizationNormalize is faster than the two-pass approach.  However, it still does more work than necessary.  The advantage of using grid synchronization is that it allows stateful approaches where data persists in registers and/or shared memory after the synchronization, which isn’t possible in the two-stage implementation.

One-Pass Normalization Via Stateful Grid Synchronization

The GridSynchronizationStatefulNormalization kernel uses shared memory to reduce global memory traffic.  The first part of the kernel finds the maximum absolute value while also explicitly caching input elements in shared memory.  After the grid synchronization operation, instead of rereading the input values for the scaling operation as in the other kernels, they are read directly from shared memory.  The table below shows the performance improvements from the stateful approach:

Implementation

Execution Time (uS)

Two-pass

46.8 (27.4 + 19.4)

GridSynchronizationNormalize

34.4

GridSynchronizationStatefulNormalize

25.4

 

The improved performance of the stateful approach comes at a price.  With the stateful approach, the input elements must fit within the GPU’s aggregate shared memory capacity.  On a Tesla P100, the input array can contain no larger than 56 SMs * 64 KB/SM / 4 bytes/element = 917504 elements.  On a Tesla V100, the largest array we can support increases to 80 SMs * 96 KB/SM / 4 bytes/element = 1966080 elements.  These size limitations make the GridSynchronizationStatefulNormalize kernel less scalable than the alternatives.  The stateful approach could be employed as part of an optimized path for smaller data sets, with automatic fallback to the stateless approach for larger data sets.

Potential Mistakes When Developing Grid Synchronization Code

 One of the strengths of the CUDA C programming model compared to other heterogeneous models like OpenCL, Vulkan, or even the CUDA Driver API, is that host and device code are combined into a single-source file.  This advantage is especially evident when using CUDA’s <<<…>>> kernel launch operator.  The kernel launch operator facilitates compile-time error checking to ensure that kernels are invoked with the expected number of arguments, and that each kernel argument is of the expected type.  No compile-time error checking is performed when using the cudaLaunchCooperativeKernel API call to launch a kernel that uses grid synchronization.  Beware that passing an incorrect number of arguments, or arguments of incorrect type through the void** args parameter results in undefined runtime behavior not a compile time error!  A quick experiment omitting the final argument to GridSynchronizationNormalize resulted in a segmentation fault.

cudaLaunchCooperativeKernel does validate its  gridDim, blockDim and sharedMem arguments to ensure that all blocks of the grid are resident on the GPUs SMs and in the event of an error returns the error code cudaErrorCooperativeLaunchTooLarge = 82 with corresponding error string “too many blocks in cooperative launch”.