For NVIDIA CUDA hardware, up to 4 warps (depending on the hardware version) will execute on a streaming multiprocessor (SM) at a time. Because thread instructions execute sequentially on the GPU, whenever warps reach an unsatisfied dependency (such as waiting for a memory load) the warp is considered stalled. To prevent the SM from idling when the warp is stalled, the scheduler will check all active warps assigned to it and switch to a warp that is ready to execute.

Multiprocessor occupancy is the ratio of active warps assigned versus the maximum number of active warps supported on a SM. To hide latency from data dependencies and improve performance, one of the first steps for optimization is to increase occupancy on the SM. Occupancy for a kernel can be limited by block size, shared memory usage, register usage as well as hardware limits of the device. The overall occupancy is determined by the lowest of these limits.

To calculate the theoretical occupancy of a kernel, we can either do the calculation for each portion by hand, or use one of the occupancy calculators available. Nvidia provides a calculator in the form of a spreadsheet, available in the tools folder of the CUDA Toolkit. Acceleware provides a handy one on our website, accessible from http://training.acceleware.com/calculator. The calculators allow users to input the hardware they are targeting and the resource usage of their kernel. The calculator displays data on active warp limits on each of the categories, as well as charts to show how occupancy may change when certain parameters are adjusted.

The online calculator provided by Acceleware uses the CUDA Toolkit’s standalone occupancy calculation functions from cuda_occupancy.h directly rather than reimplementing the logic to mimic the behaviour of the worksheet. This means that while the output of our calculator does not exactly match the theoretical values from the worksheet, it more closely reflects the actual behaviour of the CUDA toolkit. For example, at the time of writing on CUDA 9.2, for compute 6.0 devices the documentation and worksheet suggests occupancy calculation is based on a warp allocation granularity of 2, but our calculator reflects actual scheduling granularity of 4. As another example, warp occupancy does not drop to 0 when shared memory usage of the kernel exceeds the lower shared memory configuration sizes, due to the toolkit ignoring the limit rather than failing to launch as suggested by the worksheet.

Occupancy Example

To determine your occupancy, you will need to set the following parameters:

  1. The compute capability (CC) of the device.  This is determined by the GPU.  For example, a P100 is a CC 6.0 device. You can look up the compute capability of your device at https://developer.nvidia.com/cuda-gpus.
  2. Threads per block.  This is defined by you during the kernel launch as the second argument in the launch syntax. If using a 2 or 3-dimensional block size, multiply the dimensions together to gain the total number of threads per block.
  3. Registers per thread.  This is generated at compile time for each kernel.  You can determine this number using the NVIDIA profiler or by compiling with verbose output (--ptxas-options=-v).
  4. Shared memory per block. This parameter is defined by you and is either statically or dynamically assigned.

Assume you have a P100 (CC 6.0) with a kernel that uses:

  • 128 threads per block
  • 12 registers per thread
  • 8192KB of shared memory

The occupancy calculator suggests that you have an occupancy of 50%.  If the occupancy in this case is not at a sufficient level to hide latency, you could increase the threads per block to 256, 512, or 1024 to achieve 100% occupancy as shown on the curves.

References

For more information on occupancy, watch our video on optimization techniques at https://www.youtube.com/watch?v=_f31hvfBv4s.

You can also refer to the programmers guide at  https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multiprocessor-level and check out the occupancy experiment at https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm