Dan’s previous post “Grid Synchronization with Cooperative Groups” demonstrated the use of the new synchronization model introduced in CUDA 9 by normalizing an input array of integers using only a single CUDA kernel, instead of two separate kernels.  Essentially, the algorithm requires two steps – finding the element with the maximum absolute value, then dividing all elements by this maximum.   With Cooperative Groups, both steps can be combined within a single kernel, with a grid-level synchronization between steps.  Prior to CUDA 9, these two steps needed to be executed in separate kernels.
The single CUDA kernel ended up faster than a conventional pre-CUDA 9 two-pass approach.  Additional speedup was achieved by explicitly caching input elements read during the first step in shared memory and then re-using them during the second step, although this limited the number of elements that can be handled on a single device.
Cooperative Groups aren’t only limited to a single device though – CUDA 9 also introduced multi-device synchronization, which I experimented with to try to understand performance implications.

Download Source Code

Download Multi Device Normalization Results

Implementation

For this experiment, I extended the normalization problem to two GPU devices and wrote up seven implementations to compare in terms of performance.  The input array of elements can be split across the two GPUs, however the primary problem now becomes how to coordinate the overall maximum. 
The first implementation used a “naïve” two pass approach – it allocated an integer in global memory on the first GPU to hold the temporary maximum, and had both GPUs update it with the maximum atomically.  The standard CUDA atomics are only atomic to the GPU it executes on, therefore the scoped atomic function atomicMax_system() was required in order to avoid race conditions.   The maximum was then passed to the kernel that performed the division.  The goal for this implementation was to establish performance of system-level atomics.

Implementation

Implementations 2 and 3 were also two-pass, but allocated an integer in global memory for each device. Each GPU atomically updated its own max using the standard atomicMax() function.  Implementation 2 then copied the two maxes back to the host where the overall maximum was found before passing it on to the division kernel.  Implementation 3 modified the division kernel so that the two maxes were compared by a single thread in each thread-block and the overall maximum stored in shared memory, prior to division occurring.  Avoiding atomicMax_system() was the goal for both of these implementations.  Implementation 3 had the additional goal of removing the need for device to host transfers of the maximum.

Implementations

Implementation 4 allocated an integer in managed memory for each device and relied on the page fault mechanisms to handle transferring the maximums back to the host where the max was found and then passed on to the division kernel.  The objective of this implementation was to benchmark managed memory page fault performance.  To improve performance, advice was provided to the managed memory system regarding where the managed memory should reside, its “mostly read” state, as well as which devices would access it via cudaMemAdvise().

Implementation

Implementation 5 allocated per-device global memory, a per-device warp counter as well as per-device host pinned memory to store the device maximum.  AtomicMax() was used to compute the maximum per device.  Each warp that updated the maximum for the device would also atomically increment a counter, and the last warp that executed would also copy the maximum from global memory into the host pinned memory.  The host pinned memory was then compared on CPU and then passed to the division kernel.  The goal here was to examine the performance trade off between decreasing the amount of data explicitly transferred (4 bytes vs full page) with the cost of the second atomic operation.

Implementation

Implementation 6 performed 3 passes.  The first kernel found per device maximums into per-device global memory.  The second kernel only launched a single thread per device to write the maximum of the two maximums to device memory.  The third kernel executed the division.  This implementation was implemented to evaluate whether or not leaving the maximums in device memory without returning to the host resulted in better performance.
The last implementation used a multi-device launch and explicit caching in shared memory.  Each GPU tracked maximums in its own global memory.  After multigrid synchronization across both devices, the first thread on each of the two devices computed the inverse based on the max of the two device maxes and stored it in own global memory.  After a grid synchronization (only on across blocks on the device), each thread performed normalization.

Results

Each implementation was timed over five iterations via the NVIDIA Visual Profiler.  Timings listed only measure execution time measured from the start of the first associated kernel to the last associated kernel, and do not include time required to allocate device memory / copy input data onto GPU / copy out output to host / deallocate device memory / validate results.  For runs that involved multiple devices, the minimum start time for both devices was used as the start time, and the maximum finish time was used as the finish time.  The hardware used to collect this data was a CentOS 7.4.1708 machine running two NVIDIA Tesla P100s on CUDA driver 387.26.  The number of elements executed was fixed to 917504 elements which is the maximum that a single GPU can run for the cached single pass version from Dan’s original blog.

 

Run 1

Run 2

Run 3

Run 4

Run 5

Min

Avg

Max

Baseline 1
(Single Gpu Two Pass)

51.135

50.688

50.336

50.335

50.527

50.335

50.6042

51.135

Baseline 2
(Single Gpu One Pass Cached)

25.44

23.808

23.808

24.095

24.159

23.808

24.262

25.44

Implementation 1
(Two Pass Atomic System)

14780.94

14728.58

14740.34

14716.08

14719.84

14716.08

14737.16

14780.94

Implementation 2
(Two Pass Host Max)

169.925

156.799

154.327

106.705

120.042

106.705

141.5596

169.925

Implementation 3
(Two Pass Threadblock Max)

187.967

228.412

188.862

189.013

184.785

184.785

195.8078

228.412

Implementation 4
(Two Pass Managed Max)

364.668

284.729

268.286

248.958

246.542

246.542

282.6366

364.668

Implementation 5
(Two Pass Double Atomic)

148.646

129.312

135.996

126.774

124.081

124.081

132.9618

148.646

Implementation 6
(Three Pass)

158.312

180.294

150.403

132.255

126.399

126.399

149.5326

180.294

Implementation 7
(One Pass Cached)

112.511

65.454

154.399

66.886

103.074

65.454

100.4648

154.399

Time (usec) to Normalize 917504 elements on NVIDIA Tesla P100s

 

Time to Normalize

As expected, Implementation 1 (Two Pass, Atomic System) performed extremely slowly and had to be omitted from the graph to avoid scaling issues.  Of the four remaining two-pass implementations, managed memory performed the worst, even with cudaMemAdvise() hints.  Profiling showed only a small number of CPU and GPU page faults, however the other three implementations that were fault free performed noticeably faster.  Computing the maximum of the two maxes on the host (Implementation 2) ended up faster than having a third kernel to compute the max on GPU (Implementation 6).

The fastest multiple GPU implementation was the multi-device kernel launch with explicit caching (Implementation 7), which wouldn’t be possible prior to CUDA 9.  The range of results returned by Implementation 7 also varied quite noticeably – some iterations performed twice as fast as others.  According to the profiler, the key difference between these runs was the occasional delay between when the kernel was launched on the first device versus the second device.  For example, for the first iteration (112 usec) there was roughly a 50usec delay before the kernel was launched on the 2nd GPU, while for the second iteration (65 usec), both GPUs launched the kernel nearly simultaneously. 

Lessons Learned

 

Lessons Learned

The new Cooperative Groups in CUDA 9 add more programming flexibility by permitting synchronization across groups of thread blocks, however moving to synchronization across multiple devices can incur non-trivial performance costs from peer-to-peer memory access and multi-device synchronization.

  • Avoid system level atomics if possible
  • Multi-device kernel performance may vary due to kernels not launching simultaneously
  • The grid size / block size / shared memory size needs to be identical for all devices during a multi-device kernel launch.
    • Even though grid size is the same on each GPU device, each GPU operates on a grid of that size.  (E.g. launching a grid of 100 blocks on 2 devices results in 200 total blocks, and not 50 blocks on each device).
  • Managed memory is easy to use due to the system taking care of data migration between devices, however explicit transfers still perform faster, (especially if the data being transferred is much smaller than page size).