Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer

NVIDIA Compute Sanitizer (NCS) is a powerful tool that can save you time and effort while improving the reliability and performance of your CUDA…

NVIDIA Compute Sanitizer (NCS) is a powerful tool that can save you time and effort while improving the reliability and performance of your CUDA applications. 

In our previous post, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitzer, we explored efficient debugging in the realm of parallel programming. We discussed how debugging code in the CUDA environment can be both challenging and time-consuming, especially when dealing with thousands of threads, and how NCS can help with this process. 

This post continues our exploration of efficient CUDA debugging. It highlights a few more NCS tools and walks through several examples. 

NVIDIA Compute Sanitizer

NCS is a suite of tools that can perform different types of checks on the functional correctness of your code. There are four main tools in NCS:

  • Memcheck for memory access error and leak detection
  • Racecheck, a shared memory data access hazard detection tool
  • Initcheck, an uninitialized device global memory access detection tool
  • Synccheck for thread synchronization hazard detection

In addition to these tools, NCS capabilities include:

  • An API to enable the creation of sanitizing and tracing tools that target CUDA applications
  • Integration with NVIDIA Tools Extension (NVTX) 
  • Coredump support for use with CUDA-GDB
  • Suppression features for managing the output of the tool

This post will focus on debugging code and catching bugs related to uninitialized device arrays using initcheck, and synchronization using synccheck. See Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitzer for details about using memcheck for discovering memory leaks and racecheck for finding race conditions.

Initialization checking

NCS Initcheck helps developers identify and resolve uninitialized memory access errors in CUDA code. Uninitialized memory access can lead to unpredictable behavior and incorrect results in CUDA applications. 

NCS Initcheck can detect uninitialized memory accesses to global memory in device code and provides detailed information about the location and timing of the access, as well as the stack trace of the accessing thread. This helps to reveal the root cause of the issue and resolve the problem.

To provide an example, the code below benefits from initialization checking.


#define THREADS 32
#define BLOCKS 2

__global__ void addToVector(float *v) {
  int tx = threadIdx.x + blockDim.x * blockIdx.x;
  v[tx] += tx;

int main(int argc, char **argv) {
  float *d_vec = NULL;
  float *h_vec = NULL;

  h_vec = (float *)malloc(BLOCKS*THREADS * sizeof(float));
  cudaMalloc((void**)&d_vec, sizeof(float) * BLOCKS * THREADS);
  cudaMemset(d_vec, 0, BLOCKS * THREADS); // Zero the array

  cudaMemcpy(h_vec, d_vec, BLOCKS*THREADS * sizeof(float), cudaMemcpyDeviceToHost);
  printf("After : Vector 0, 1 .. N-1: %f %f .. %fn", h_vec[0], h_vec[1], h_vec[BLOCKS*THREADS-1]);


This code contains a CUDA kernel called addToVector that performs a simple add of a value to each element in a vector, with the results written back to the same element. At ‌first glance, it looks fine: allocate the vector on the device with cudaMalloc, then zero it with cudaMemset, then perform calculations in the kernel. It even prints out the correct answer:

$ nvcc -lineinfo initcheck_example.cu -o initcheck_example
$ ./initcheck_example
After : Vector 0, 1 .. N-1: 0.000000 1.000000 .. 63.000000

But the code contains a small mistake. (Twenty points if you can spot it.)

Use the NCS initcheck tool to check whether any of the accesses to the vector in global memory on the device are trying to read uninitialized values.

$ compute-sanitizer --tool initcheck ./initcheck_example
========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x70 in /home/pgraham/Code/BlogExamples/initcheck_example.cu:8:addToVector(float *)
=========     by thread (16,0,0) in block (0,0,0)

. . .

========= Uninitialized __global__ memory read of size 4 bytes
=========     at 0x70 in /home/pgraham/Code/BlogExamples/initcheck_example.cu:8:addToVector(float *)
=========     by thread (17,0,0) in block (0,0,0)
. . . 
After : Vector 0, 1 .. N-1: 0.000000 1.000000 .. 63.000000
========= ERROR SUMMARY: 48 errors

This should print a lot of information (the output shown has been edited for brevity), but something is not right. A large quantity of the output is backtrace information, which can be hidden using the --show-backtrace no option:

$ compute-sanitizer --tool initcheck --show-backtrace no ./initcheck_example

Looking at the output, you can see 48 errors in total. The report shows that they are all of the type, Uninitialized __global__ memory read of size 4 bytes.

Each message refers to an attempt to read something from global device memory, and that something had a size of 4 bytes. A reasonable guess would be that the errors refer to attempts to access elements of the vector, made up of floats that are 4 bytes each.

Looking at the first error, the next part of the message indicates which thread, and which thread block, caused the error. In this case, it was thread 16 in block 0. As the kernel is set up so each thread accesses a different element of the vector, element 17 of the vector, d_vec[16], was uninitialized.

In your output, you may see a different thread as the first one causing an error. The GPU can schedule warps (groups of 32 threads) in whatever order it sees fit. But check through the rest of the output, and convince yourself that the lowest element in the vector causing an error was element 17 (thread 16 from block 0).

Next, look at the line of code that initialized (or should have initialized) the array:

cudaMemset(d_vec, 0, BLOCKS * THREADS); // Zero the array

Checking the definition of cudaMemset, it takes three arguments: the pointer to the device memory you want to set (d_vec in this case), the value to which each byte in that memory region should be set (0 in this case), and the number of bytes to set (BLOCKS * THREADS). 

Now the problem begins to become more clear. The vector contains 64 elements determined by BLOCKS * THREADS, but each element is a float, so the entire vector is 256 bytes long. cudaMemset was initializing only the first 64 bytes (the first 16 elements), which means the remaining 192 bytes (equivalent to 48 elements) are uninitialized. These 48 elements correspond to the 48 errors.

This ties in with the observation that element 17 (thread 16, block 0) was the first to cause an error. Bingo, problem found. 

To fix the problem, change the cudaMemset call: 

cudaMemset(d_vec, 0, sizeof(float) * BLOCKS * THREADS);

And check to make sure that the sanitizer is happy.

Checking unused memory

Another feature of the initcheck tool is identifying allocated device memory that hasn’t been accessed by the end of the application. In some programs, this may be deliberate—using a large static buffer to handle a range of potential problem sizes, for example. But when this is more likely an error causing a bug, use initcheck, as shown below. 


#define N 10

__global__ void initArray(float* array, float value) {
  int threadGlobalID = threadIdx.x + blockIdx.x * blockDim.x;
  if (threadGlobalID < N)
    array[threadGlobalID] = value;

int main() {
  float* array;

  const int numThreadsPerBlock = 4;
  const int numBlocks = 2;

  cudaMalloc((void**)&array, sizeof(float) * N);

  initArray<<>>(array, 3.0);


This very basic code will reveal the potential error. It is initializing an array, but the number of threads and the number of blocks are hard-coded. The execution configuration > will launch a grid of eight threads while the dataset has 10 elements (the last two elements will go unused).

Check this using the track-unused-memory option. Note that the required syntax will depend on the CUDA version in use. For versions before 12.3, supply an argument “yes” using the following:

--track-unused-memory yes ;

Beginning with version 12.3, it is not necessary to supply an argument, as shown below:

$ nvcc -o unused -lineinfo unused.cu
$ compute-sanitizer --tool initcheck --track-unused-memory ./unused
=========  Unused memory in allocation 0x7fe0a7200000 of size 40 bytes
=========     Not written 8 bytes at offset 0x20 (0x7fe0a7200020)
=========     20% of allocation were unused.
========= ERROR SUMMARY: 1 error

Clearly, track-unused-memory indicates that the array of 40 bytes (10 x 4 byte floats) includes 8 bytes that were not written to. Use the array address (the first long 0x… number) and the offset (0 x 20, which is 32 in decimal, so 32 bytes or 8 floats along) to see which bytes were unused. As expected, floats 9 and 10 in the array were not used.

To fix this, use N to define numBlocks:

const int numBlocks = (N + numThreadsPerBlock - 1) / numThreadsPerBlock;

Note that --track-unused-memory is designed to work for device memory assigned with cudaMalloc. The feature doesn’t work for unified memory (cudaMallocManaged allocated memory, for example).

Synchronization checking

The capability to synchronize threads at a variety of levels (beyond just block and warp) is a powerful CUDA feature, enabled by the Cooperative Groups programming model. Cooperative groups are device code APIs for defining, partitioning, and synchronizing groups of threads, giving much more flexibility and control compared to the standard syncthreads function, which synchronizes all the threads in a block. For more details, see Cooperative Groups: Flexible CUDA Thread Programming.

Yet, this capability comes with greater opportunities to introduce bugs. This is where NCS synccheck can help to identify and resolve synchronization errors in CUDA code. synccheck can identify whether a CUDA application is correctly using synchronization primitives and their Cooperative Groups API counterparts.

One interesting use of synchronization is the application of a mask to a warp of threads. Set up the warp so that some threads are true and others are false, enabling each thread to individually perform different operations depending on that property. For more details, see Using CUDA Warp-Level Primitives.

A useful function to help with this is __ballot_sync defined as:

unsigned int __ballot_sync(unsigned int mask, int predicate);

mask is an initial mask, typically created with all bits set to 1, representing that all threads in the warp are initially active. predicate is a condition evaluated by each thread, where predicate evaluates to either true (non-zero) or false (zero) for each thread.

The ballot function evaluates the predicate for each thread in the warp, and returns a mask representing the outcome for that thread. It also provides a synchronization point. All threads in the warp must reach this __ballot_sync before any of them can proceed further. 

For example, set up a mask where even threads in the warp are true and odd threads are false:

__ballot_sync(0xffffffff, threadID % 2 == 0);

The initial mask 0xffffff is a hexadecimal representation, and evaluates to 11111111111111111111111111111111 in binary. This ensures that all 32 threads are involved in the ballot. 

The outcome of the ballot is a mask, 0xaaaaaaaa, which in binary is 10101010101010101010101010101010. The even threads (thread ID 0, 2, 4 …) are set to true, and odd threads are set to false. 

The ballot is often used in conjunction with a __syncwarp, which can synchronize threads in a warp, based on the mask provided.

The following example uses both _ballot_sync and _syncwarp:

static constexpr int NumThreads = 32 ;

__shared__ int smem[NumThreads];

__global__ void sumValues(int *sum_out) {
    int threadID = threadIdx.x;

    unsigned int mask = __ballot_sync(0xffffffff, threadID < (NumThreads / 2));

    if (threadId <= (NumThreads / 2)) {
        smem[threadId] = threadId;


        if (threadID == 0) {
          *sum_out = 0;
          for (int i = 0; i < (NumThreads / 2); ++i)
            *sum_out += smem[i];


int main(){
    int *sum_out = nullptr;

    cudaMallocManaged((void**)&sum_out, sizeof(int));

    sumVaules<<<1, NumThreads>>>(sum_out);
    printf("Sum out = %dn", *sum_out);
    return 0;

Before reading further, take a look at the code and try and work out what it is doing given your understanding of the ballot and the syncwarp functionality. See if you can spot what’s wrong. (Fifty points for this one—it’s more challenging.)

The purpose of this code is for each thread to assign a single value to shared memory, and then sum up all the values to get one answer. However, this is applied to only half the available threads. A single warp of 32 threads is set up through the execution configuration to execute the kernel sumValues

In that kernel, create a mask using __ballot_sync with threadID < NumThreads/2 as the predicate, which will evaluate to true for the first half of the warp where threadID

Source:: NVIDIA