Skip to content


The Cori GPU nodes provide a few tools for profiling GPU code.


Nsight is NVIDIA's new profiling suite which will replace nvprof after CUDA 10. It measures much of the same information as nvprof, but organizes information in different ways. Nsight is divided into two separate tools: Nsight Compute, and Nsight Systems.

Nsight Systems

Nsight Systems is a relatively low overhead profiling tool which provides a broad description of a GPU application's performance. It is generally the best tool to start with when profiling a new application. Nsight Systems is available on Cori using the nsight-systems module. Older versions of Nsight Systems are bundled together with the CUDA SDK, and may be available via the cuda modules.

A common workflow when profiling an application with nvprof is to simply run srun ... nvprof <application> <args>, which provides a summary of CUDA API and GPU kernel activity in the application. One can achieve approximately the same output as nvprof ./my_code.exe using Nsight Systems by running

srun <args> nsys profile --stats=true -t nvtx,cuda <code> <args>

The -t nvtx,cuda flag instructs Nsight to trace only NVTX and CUDA activity, and to ignore most other activity. (If the application has no NVTX markers, one can use simply -t cuda.) The --stats=true flag tells Nsight to print a summary of the application's performance to STDOUT, similar to the output from nvprof. The Nsight output resembles the following:

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   90.8      8972535373         112      80111923.0            2387       365592608  cudaDeviceSynchronize                                                           
    6.2       611000099         791        772440.1             385       299035623  cudaFree                                                                        
    2.7       268607547         745        360547.0            1428       245707260  cudaMalloc                                                                      
    0.1        12623512          56        225419.9           10369          813038  cudaMemcpy                                                                      
    0.1         8696988         108         80527.7            4748         7362867  cudaLaunchKernel                                                                
    0.0         4813263         566          8504.0            6091          132788  cudaMemcpy2D                                                                    
    0.0         3653793         735          4971.1            2602           68339  cudaMemset                                                                      
    0.0           30464          18          1692.4             907           11455  cudaEventDestroy                                                                
    0.0           17824          18           990.2             434            7087  cudaEventCreateWithFlags                                                        
    0.0            7115           3          2371.7            2157            2544  cuInit                                                                          

Generating CUDA Kernel Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                                                                                                                                                                                                                                                                                         
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                                                         
   73.5      6590057118          36     183057142.2        63000409       365588974  void kronmult6_xbatched<double>(int, double const* const*, int, double**, double**, double**, int)                                                                                                                                                                                                                                           
   26.3      2357802925          36      65494525.7        26495997       103555789  void stage_inputs_kronmult_kernel<double>(double const*, double*, int, int)                                                                                                                                                                                                                                                                  
    0.3        23804445          36        661234.6          119999         1512978  void prepare_kronmult_kernel<double>(int const*, double* const*, int, double*, double*, double*, double**, double**, double**, double**, int, int, int, int, int, int, int)                                                                                                                                                                  

Generating CUDA Memory Operation Statistics...
CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   51.9         4952754         161         30762.4            1280          530651  [CUDA memcpy HtoD]                                                              
   32.1         3065729           9        340636.6          327772          359420  [CUDA memcpy DtoH]                                                              
   10.8         1026448         735          1396.5            1119            9089  [CUDA memset]                                                                   
    5.2          496228         452          1097.8            1055            1728  [CUDA memcpy DtoD]                                                              

CUDA Memory Operation Statistics (KiB)

              Total      Operations              Average            Minimum              Maximum  Name                                                                            
-------------------  --------------  -------------------  -----------------  -------------------  --------------------------------------------------------------------------------
          54253.937             735               73.815              0.281             2853.352  [CUDA memset]                                                                   
          27030.883             161              167.894              0.109             2853.352  [CUDA memcpy HtoD]                                                              
          25680.164               9             2853.352           2853.352             2853.352  [CUDA memcpy DtoH]                                                              
            226.000             452                0.500              0.500                0.500  [CUDA memcpy DtoD]                                                              

Nsight Systems will save the profiling output to a .qdrep file in the present working directory. One can then view the .qdrep profiling database via the Nsight Systems GUI. Adding the --stats=true flag to nsys profile causes Nsight Systems to automatically convert the .qdrep file into a .sqlite file. As with nvvp, the Nsight Systems GUI is best viewed from a Cori login node using NoMachine. Inside a NoMachine session, one can launch the Nsight Systems GUI with the nsight-sys. From the GUI, one can import the .qdrep file to view the profiling output. An example of this output is shown below:

nsight-sys screenshot

NVIDIA documentation about Nsight Systems is here.

Reduce Nsight Systems profile database size by disabling CPU sampling

Although Nsight Systems adds relatively little overhead to an application during profiling, it nevertheless can generate large profiling databases (several 100 MB or even GB) if an application runs for a long time (minutes or more). One can reduce the size of the profiling database by disabling CPU sampling via the -s none flag to nsys profile.

Nsight Compute

The Nsight Compute tool enables deep dives into GPU code performance. It is often useful after an initial analysis has been done with Nsight Systems, and key GPU kernels have been identified as being critical to achieving high overall application performance. It is far less useful if the overall performance characteristics of the application are not known.

Nsight Compute is available on Cori GPU via the nsight-compute module. Similarly to Nsight Systems, an older version is typically included with the CUDA SDK installation, via the cuda modules. The command line interface to Nsight Compute is nv-nsight-cu-cli, and the GUI is accessible via nv-nsight-cu; starting in version 2020.1, this command has been simplified to ncu for the CLI and ncu-ui for the GUI. As with Nsight Systems, it is strongly recommended to use NoMachine when using the Nsight Compute GUI.

A typical Nsight Compute performance collection invocation has a similar form to nvprof; one simply prefixes the application name with ncu (or nv-nsight-cu-cli in older versions of Nsight Compute), along with the desired Nsight Compute flags:

srun <srun args> ncu -o <filename> <other Nsight args> <code> <args>

Nsight Compute can add large overhead during application profiling

Unlike Nsight Systems, which generally adds relatively low overhead to an application's runtime, Nsight Compute can increase an application's runtime by orders of magnitude, due to the large amount of performance data it collects from GPU kernels. It is therefore strongly recommended to limit the scope of an Nsight Compute performance collection using the tips described in this document.

A few flags to Nsight Compute can reduce the overhead added to an application's runtime, which are summarized below. All of these flags are documented in more detail in the Nsight Compute documentation.

  • -k <expr>: this instructs Nsight Compute to profile kernels whose names are matched by the regular expression <expr>; all non-matching kernels are ignored;
  • -s <num1> -c <num2>: these flags instruct Nsight Compute to skip the first <num1> GPU kernel launches, and to only profile <num2> kernels after that. These flags are useful when an application launches the same GPU kernels many times, e.g., in an iterative solver or in a time-stepping routine.

A new feature in Nsight Compute 2020.1 is an automated GPU roofline tool. Nsight Compute generates a GPU roofline plot in the GUI automatically, but only if all performance metrics are collected on the kernel; this can be accomplished by adding --set full to the list of Nsight Compute arguments during application profiling. Users should note that this flag increases profiling overheard by a very large amount, so it is recommended to limit the number of kernels profiled by using the flags summarized above. Below is a depiction of the roofline plot generated in Nsight Compute:

Nsight Compute roofline screenshot

NVIDIA documentation about Nsight Compute is here.


nvprof has been CUDA's standard profiling tool for several years. It is easy to use - one simply inserts the word nvprof in front of their application in the srun command, and it will profile the code and generate a report:

user@cgpu17:~/samples/bin/x86_64/linux/release> srun -n 1 ./nvgraph_SpectralClustering
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0

Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
user@cgpu17:~/samples/bin/x86_64/linux/release> srun -n 1 nvprof ./nvgraph_SpectralClustering
==152717== NVPROF is profiling process 152717, command: ./nvgraph_SpectralClustering
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0

Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
==152717== Profiling application: ./nvgraph_SpectralClustering
==152717== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   26.12%  22.309ms      6162  3.6200us  3.3270us  5.4400us  void nrm2_kernel<float, float, float, int=1, int=0, int=128, int=0>(cublasNrm2Params<float, float>)
                   16.07%  13.722ms      8850  1.5500us  1.5040us  2.1120us  [CUDA memcpy DtoH]
                   13.07%  11.161ms      8620  1.2940us  1.2160us  1.9520us  void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>)
                   11.51%  9.8280ms      5752  1.7080us  1.6310us  13.600us  void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
                   10.79%  9.2173ms      2877  3.2030us  3.1040us  13.344us  void csrMv_kernel<float, float, float, int=128, int=1, int=2>(cusparseCsrMvParams<float, float, float>)
                   10.28%  8.7775ms      5752  1.5250us  1.4710us  2.4960us  void reduce_1Block_kernel<float, float, float, int=128, int=7>(float*, int, float*)
                    5.54%  4.7290ms      3295  1.4350us  1.3760us  1.8240us  [CUDA memcpy HtoD]
                    4.23%  3.6102ms      3079  1.1720us  1.1200us  1.4400us  void scal_kernel_val<float, float, int=0>(cublasScalParamsVal<float, float>)
                    1.06%  908.79us       206  4.4110us  3.5190us  8.1920us  volta_sgemm_128x32_nn
                    0.57%  487.16us       413  1.1790us  1.1520us  1.6640us  [CUDA memcpy DtoD]

nvprof is part of the CUDA SDK (available via the cuda modules on the Cori GPU nodes) but after CUDA 10 will be replaced by Nsight Compute, NVIDIA's new profiling tool.

Documentation about nvprof is here.


nvvp is the profiling GPU which accompanies nvprof. It is used for displaying profiling information collected by nvprof in a GUI. Since X11 window forwarding via SSH is typically slow, one will enjoy a much better nvvp experience by running it on a Cori login node using NoMachine. Example output of nvvp is shown below:

nvvp screenshot

Combining MPI + GPU profiling

While MPI + CPU profilers are common, MPI + CPU + GPU profilers are less common. However, there are a few tools available on Cori GPU which can accomplish this task.

Combining SLURM_TASK_ID with Nsight or nvprof to profile MPI + GPU programs

By default, Nsight and nvprof profile only one task at a time; if one profiles a GPU code which has multiple tasks (e.g., multiple MPI ranks), Nsight and nvprof can save one profile per task if used with the -o flag. However, each file must have a unique filename, or else all tasks will attempt to write to the same file, typically resulting in an unusuable profiling result.

One can use the SLURM_TASK_PID environment variable to save the profiling result from each task to a unique file. The nvvp GUI is able to combine nvprof profiles from multiple tasks into a single result. To do this, one must invoke a bash shell inside the srun statement in order to return a unique value for the SLURM_TASK_PID for each task:

user@cgpu02:~> srun -n 2 -c 2 bash -c 'echo $SLURM_TASK_PID'

Note that, without the bash -c statement, each task will return the same value of SLURM_TASK_ID, because the wrong bash shell is interpreting the value of the variable:

user@cgpu02:~> srun -n 2 -c 2 echo $SLURM_TASK_PID

The easiest way to use SLURM_TASK_PID to produce a unique profiling database per MPI rank is to use it inside a shell script which is then invoked by srun, e.g.,:

user@cgpu02:~> cat

suffix=$(bash -c 'echo $SLURM_TASK_PID')
nsys profile -o result-${suffix} ./main.ex
user@cgpu02:~> srun -n 2 -c 2 --cpu-bind=cores ./
Collecting data...
Collecting data...
(code runs)
Processing events...
Processing events...
Capturing symbol files...
Capturing symbol files...
Saving intermediate "/path/to/result-37352.qdstrm" file to disk...
Saving intermediate "/path/to/result-37353.qdstrm" file to disk...

Importing [===============================================================100%]
Importing [===============================================================100%]
Saved report file to "/path/to/result-37352.qdrep"
Saved report file to "/path/to/result-37353.qdrep"