Skip to content


Debugging applications which launch GPU kernels is more challenging than CPU applications. Typical debuggers like gdb cannot debug GPU kernels or CUDA C application code. Nevertheless, several tools are available on Cori GPU which can aid in debugging GPU code. Most interactive command-line debuggers like cuda-gdb require the additional srun flag --pty in order to function properly (please see the Cori GPU usage page for more details.


cuda-gdb is NVIDIA's official debugger for GPU code. It has similar behavior to GNU gdb, but has many enhancements for debugging GPU kernels. It is designed to debug non-MPI applications. For MPI-parallelized applications, NVIDIA recommends using debuggers provided by different vendors, which are summarized below. On Cori GPU nodes, cuda-gdb is provided by any of the cuda modules.


Arm DDT (formerly "Allinea DDT") is a debugger for parallel codes which run on both CPUs and GPUs. Documentation for using Arm DDT to debug parallel CPU codes at NERSC is provided here. The same debugger can also be used to debug parallel GPU codes on Cori GPU nodes. Documentation regarding how to use Arm DDT to debug GPU codes is provided here.


TotalView is another debugger for parallel GPU applications which is available at NERSC. Documentation regarding how to debug parallel CPU-only applications at NERSC is provided here. Documentation regarding how to debug parallel GPU codes with TotalView is provided here.

How do I know if my code ran on the GPU?

While it is usually clear that a code has run at all, it is sometimes less clear whether the code ran on the CPU or the GPU. One way this ambiguity can arise is if one includes GPU offloading directives in the code, but does not use the appropriate compiler flag to enable those directives.

There are several ways to determine if your code actually ran on the GPU. It may be more useful to know when a code does not run on the GPU (especially when one expects that it should):

  • Run the code through an NVIDIA profiler such as Nsight Compute or nvprof. If a code runs on the GPU, both profilers will print a summary following code execution:

    ==39359== Profiling application: ./laplace2d_acc
    ==39359== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:   53.75%  264.48ms      1000  264.48us  254.33us  292.35us  main_96_gpu
                       43.03%  211.73ms      1000  211.73us  210.27us  214.01us  main_109_gpu
                        1.37%  6.7473ms      1004  6.7200us  1.2790us  1.3656ms  [CUDA memcpy HtoD]
                        1.33%  6.5500ms      1005  6.5170us  1.4070us  1.2849ms  [CUDA memcpy DtoH]

    An Nsight Compute profile would look something like:

    An Nsight Compute profile of a code which does not run on the GPU at all will print the following message:

    user@cgpu12:~/tests> srun -n 1 nv-nsight-cu-cli ./a.out
    ==PROF== ERROR: Target application terminated before first instrumented API call.
    srun: error: cgpu12: task 0: Exited with exit code 255
    srun: Terminating job step 123456.5

    and an nvprof profile would print:

    user@cgpu12:~/tests> srun -n 1 nvprof ./a.out
    ======== Warning: No profile data collected.
  • Check compiler reports. If one writes a code with OpenACC directives and compiles it with the PGI compiler but does not include the flags needed to inform the compiler to use the directives, there will be no output:

    user@cori02:~> pgcc -I../common -Minfo=accel -o laplace2d_acc laplace2d.c

    Including the appropriate OpenACC flags to the compiler (in this case -acc -ta=nvidia) results in more output, including a note that the compiler generated Tesla code.

    user@cori02:~> pgcc -I../common -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c
         86, Generating copy(A[:][:])
             Generating create(Anew[:][:])
         93, Loop is parallelizable
         96, Loop is parallelizable
             Generating Tesla code
             93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */
             96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
            100, Generating implicit reduction(max:error)
        106, Loop is parallelizable
        109, Loop is parallelizable
             Generating Tesla code
            106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
            109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */