We received an interesting question on one of our Webinar presentations (GPU Architecture and the CUDA Memory Model).

"What happens if enough shared memory is not available? Does the block stall until it becomes available or it uses global memory meanwhile?"


There are a couple of things that are happening with shared memory and its size. You as the programmer declare the size of your shared memory allocation. You can do this statically (at compile time) or dynamically (at run time). If you statically declare a shared memory allocation that exceeds the limit per thread block (48KB), the NVIDIA compiler will generate an error and your program will not compile. If you dynamically attempt to allocate more memory than is available, your GPU kernel will not launch and an error code will be returned to your program, although you have to check for error codes programmatically. You can also have a combination of multiple static and one dynamic allocation. If the sum of the allocations exceeds the thread limit per block, the GPU kernel will not launch.

Finally, shared memory is allocated exclusively to a thread block to facilitate zero-overhead context switching. As a result the total amount of shared memory on the SM limits the number of thread blocks that can be scheduled to run concurrently on the SM, which impacts occupancy, and potentially performance. For more details on occupancy and performance, here is the link to our optimization webinar.

As a follow up to my earlier blog, my engineering team commented that NVIDIA’s Volta architecture (compute capability 7.0) increases the amount of shared memory you can allocate per thread block. Using static declarations, the maximum allocation is still 48KB, however, you can access the full 96KB on Volta using dynamic allocation. This can be done by a single dynamic allocation of up to 96KB or a combination of static and dynamic allocations (e.g., 48KB static and 48KB dynamic). Additionally, you must explicitly increase the allocation limit using the cudaFuncSetAttribute call prior to your kernel launch as shown: