Debugging applications which launch GPU kernels is more challenging than CPU applications. Typical debuggers like
gdb cannot debug GPU kernels or CUDA C application code. Nevertheless, several tools are available on Cori GPU which can aid in debugging GPU code. Most interactive command-line debuggers like
cuda-gdb require the additional
--pty in order to function properly (please see the Cori GPU usage page for more details.
cuda-gdb is NVIDIA's official debugger for GPU code. It has similar behavior to GNU
gdb, but has many enhancements for debugging GPU kernels. It is designed to debug non-MPI applications. For MPI-parallelized applications, NVIDIA recommends using debuggers provided by different vendors, which are summarized below. On Cori GPU nodes,
cuda-gdb is provided by any of the
Arm DDT (formerly "Allinea DDT") is a debugger for parallel codes which run on both CPUs and GPUs. Documentation for using Arm DDT to debug parallel CPU codes at NERSC is provided here. The same debugger can also be used to debug parallel GPU codes on Cori GPU nodes. Documentation regarding how to use Arm DDT to debug GPU codes is provided here.
TotalView is another debugger for parallel GPU applications which is available at NERSC. Documentation regarding how to debug parallel CPU-only applications at NERSC is provided here. Documentation regarding how to debug parallel GPU codes with TotalView is provided here.
How do I know if my code ran on the GPU?¶
While it is usually clear that a code has run at all, it is sometimes less clear whether the code ran on the CPU or the GPU. One way this ambiguity can arise is if one includes GPU offloading directives in the code, but does not use the appropriate compiler flag to enable those directives.
There are several ways to determine if your code actually ran on the GPU. It may be more useful to know when a code does not run on the GPU (especially when one expects that it should):
==39359== Profiling application: ./laplace2d_acc ==39359== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 53.75% 264.48ms 1000 264.48us 254.33us 292.35us main_96_gpu 43.03% 211.73ms 1000 211.73us 210.27us 214.01us main_109_gpu 1.37% 6.7473ms 1004 6.7200us 1.2790us 1.3656ms [CUDA memcpy HtoD] 1.33% 6.5500ms 1005 6.5170us 1.4070us 1.2849ms [CUDA memcpy DtoH]
An Nsight Compute profile would look something like:
An Nsight Compute profile of a code which does not run on the GPU at all will print the following message:
user@cgpu12:~/tests> srun -n 1 nv-nsight-cu-cli ./a.out ==PROF== ERROR: Target application terminated before first instrumented API call. srun: error: cgpu12: task 0: Exited with exit code 255 srun: Terminating job step 123456.5 user@cgpu12:~/tests>
and an nvprof profile would print:
user@cgpu12:~/tests> srun -n 1 nvprof ./a.out ======== Warning: No profile data collected. user@cgpu12:~/tests>
Check compiler reports. If one writes a code with OpenACC directives and compiles it with the PGI compiler but does not include the flags needed to inform the compiler to use the directives, there will be no output:
user@cori02:~> pgcc -I../common -Minfo=accel -o laplace2d_acc laplace2d.c user@cori02:~>
Including the appropriate OpenACC flags to the compiler (in this case
-acc -ta=nvidia) results in more output, including a note that the compiler generated Tesla code.
user@cori02:~> pgcc -I../common -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 86, Generating copy(A[:][:]) Generating create(Anew[:][:]) 93, Loop is parallelizable 96, Loop is parallelizable Generating Tesla code 93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */ 96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */ 100, Generating implicit reduction(max:error) 106, Loop is parallelizable 109, Loop is parallelizable Generating Tesla code 106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */ 109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */ user@cori02:~>