Profiling¶
The Cori GPU nodes provide a few tools for profiling GPU code.
Nsight¶
Nsight is NVIDIA's new profiling suite which will replace nvprof after CUDA 10. It measures much of the same information as nvprof, but organizes information in different ways. Nsight is divided into two separate tools: Nsight Compute, and Nsight Systems.
Nsight Systems¶
Nsight Systems is a relatively low overhead profiling tool which provides a broad description of a GPU application's performance. It is generally the best tool to start with when profiling a new application. Nsight Systems is available on Cori using the nsight-systems
module. Older versions of Nsight Systems are bundled together with the CUDA SDK, and may be available via the cuda
modules.
A common workflow when profiling an application with nvprof is to simply run srun ... nvprof <application> <args>
, which provides a summary of CUDA API and GPU kernel activity in the application. One can achieve approximately the same output as nvprof ./my_code.exe
using Nsight Systems by running
srun <args> nsys profile --stats=true -t nvtx,cuda <code> <args>
The -t nvtx,cuda
flag instructs Nsight to trace only NVTX and CUDA activity, and to ignore most other activity. (If the application has no NVTX markers, one can use simply -t cuda
.) The --stats=true
flag tells Nsight to print a summary of the application's performance to STDOUT, similar to the output from nvprof. The Nsight output resembles the following:
Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)
Time(%) Total Time Calls Average Minimum Maximum Name
------- -------------- ---------- -------------- -------------- -------------- --------------------------------------------------------------------------------
90.8 8972535373 112 80111923.0 2387 365592608 cudaDeviceSynchronize
6.2 611000099 791 772440.1 385 299035623 cudaFree
2.7 268607547 745 360547.0 1428 245707260 cudaMalloc
0.1 12623512 56 225419.9 10369 813038 cudaMemcpy
0.1 8696988 108 80527.7 4748 7362867 cudaLaunchKernel
0.0 4813263 566 8504.0 6091 132788 cudaMemcpy2D
0.0 3653793 735 4971.1 2602 68339 cudaMemset
0.0 30464 18 1692.4 907 11455 cudaEventDestroy
0.0 17824 18 990.2 434 7087 cudaEventCreateWithFlags
0.0 7115 3 2371.7 2157 2544 cuInit
Generating CUDA Kernel Statistics...
CUDA Kernel Statistics (nanoseconds)
Time(%) Total Time Instances Average Minimum Maximum Name
------- -------------- ---------- -------------- -------------- -------------- --------------------------------------------------------------------------------------------------------------------
73.5 6590057118 36 183057142.2 63000409 365588974 void kronmult6_xbatched<double>(int, double const* const*, int, double**, double**, double**, int)
26.3 2357802925 36 65494525.7 26495997 103555789 void stage_inputs_kronmult_kernel<double>(double const*, double*, int, int)
0.3 23804445 36 661234.6 119999 1512978 void prepare_kronmult_kernel<double>(int const*, double* const*, int, double*, double*, double*, double**, double**, double**, double**, int, int, int, int, int, int, int)
Generating CUDA Memory Operation Statistics...
CUDA Memory Operation Statistics (nanoseconds)
Time(%) Total Time Operations Average Minimum Maximum Name
------- -------------- ---------- -------------- -------------- -------------- --------------------------------------------------------------------------------
51.9 4952754 161 30762.4 1280 530651 [CUDA memcpy HtoD]
32.1 3065729 9 340636.6 327772 359420 [CUDA memcpy DtoH]
10.8 1026448 735 1396.5 1119 9089 [CUDA memset]
5.2 496228 452 1097.8 1055 1728 [CUDA memcpy DtoD]
CUDA Memory Operation Statistics (KiB)
Total Operations Average Minimum Maximum Name
------------------- -------------- ------------------- ----------------- ------------------- --------------------------------------------------------------------------------
54253.937 735 73.815 0.281 2853.352 [CUDA memset]
27030.883 161 167.894 0.109 2853.352 [CUDA memcpy HtoD]
25680.164 9 2853.352 2853.352 2853.352 [CUDA memcpy DtoH]
226.000 452 0.500 0.500 0.500 [CUDA memcpy DtoD]
Nsight Systems will save the profiling output to a .qdrep
file in the present working directory. One can then view the .qdrep
profiling database via the Nsight Systems GUI. Adding the --stats=true
flag to nsys profile
causes Nsight Systems to automatically convert the .qdrep
file into a .sqlite
file. As with nvvp, the Nsight Systems GUI is best viewed from a Cori login node using NoMachine. Inside a NoMachine session, one can launch the Nsight Systems GUI with the nsight-sys
. From the GUI, one can import the .qdrep
file to view the profiling output. An example of this output is shown below:
NVIDIA documentation about Nsight Systems is here.
Reduce Nsight Systems profile database size by disabling CPU sampling
Although Nsight Systems adds relatively little overhead to an application during profiling, it nevertheless can generate large profiling databases (several 100 MB or even GB) if an application runs for a long time (minutes or more). One can reduce the size of the profiling database by disabling CPU sampling via the -s none
flag to nsys profile
.
Nsight Compute¶
The Nsight Compute tool enables deep dives into GPU code performance. It is often useful after an initial analysis has been done with Nsight Systems, and key GPU kernels have been identified as being critical to achieving high overall application performance. It is far less useful if the overall performance characteristics of the application are not known.
Nsight Compute is available on Cori GPU via the nsight-compute
module. Similarly to Nsight Systems, an older version is typically included with the CUDA SDK installation, via the cuda
modules. The command line interface to Nsight Compute is nv-nsight-cu-cli
, and the GUI is accessible via nv-nsight-cu
; starting in version 2020.1, this command has been simplified to ncu
for the CLI and ncu-ui
for the GUI. As with Nsight Systems, it is strongly recommended to use NoMachine when using the Nsight Compute GUI.
A typical Nsight Compute performance collection invocation has a similar form to nvprof; one simply prefixes the application name with ncu
(or nv-nsight-cu-cli
in older versions of Nsight Compute), along with the desired Nsight Compute flags:
srun <srun args> ncu -o <filename> <other Nsight args> <code> <args>
Nsight Compute can add large overhead during application profiling
Unlike Nsight Systems, which generally adds relatively low overhead to an application's runtime, Nsight Compute can increase an application's runtime by orders of magnitude, due to the large amount of performance data it collects from GPU kernels. It is therefore strongly recommended to limit the scope of an Nsight Compute performance collection using the tips described in this document.
A few flags to Nsight Compute can reduce the overhead added to an application's runtime, which are summarized below. All of these flags are documented in more detail in the Nsight Compute documentation.
-k <expr>
: this instructs Nsight Compute to profile kernels whose names are matched by the regular expression<expr
>; all non-matching kernels are ignored;-s <num1> -c <num2>
: these flags instruct Nsight Compute to skip the first<num1>
GPU kernel launches, and to only profile<num2>
kernels after that. These flags are useful when an application launches the same GPU kernels many times, e.g., in an iterative solver or in a time-stepping routine.
A new feature in Nsight Compute 2020.1 is an automated GPU roofline tool. Nsight Compute generates a GPU roofline plot in the GUI automatically, but only if all performance metrics are collected on the kernel; this can be accomplished by adding --set full
to the list of Nsight Compute arguments during application profiling. Users should note that this flag increases profiling overheard by a very large amount, so it is recommended to limit the number of kernels profiled by using the flags summarized above. Below is a depiction of the roofline plot generated in Nsight Compute:
NVIDIA documentation about Nsight Compute is here.
nvprof¶
nvprof has been CUDA's standard profiling tool for several years. It is easy to use - one simply inserts the word nvprof
in front of their application in the srun
command, and it will profile the code and generate a report:
user@cgpu17:~/samples/bin/x86_64/linux/release> srun -n 1 ./nvgraph_SpectralClustering
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0
Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
Done!
user@cgpu17:~/samples/bin/x86_64/linux/release> srun -n 1 nvprof ./nvgraph_SpectralClustering
==152717== NVPROF is profiling process 152717, command: ./nvgraph_SpectralClustering
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0
Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
Done!
==152717== Profiling application: ./nvgraph_SpectralClustering
==152717== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 26.12% 22.309ms 6162 3.6200us 3.3270us 5.4400us void nrm2_kernel<float, float, float, int=1, int=0, int=128, int=0>(cublasNrm2Params<float, float>)
16.07% 13.722ms 8850 1.5500us 1.5040us 2.1120us [CUDA memcpy DtoH]
13.07% 11.161ms 8620 1.2940us 1.2160us 1.9520us void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>)
11.51% 9.8280ms 5752 1.7080us 1.6310us 13.600us void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
10.79% 9.2173ms 2877 3.2030us 3.1040us 13.344us void csrMv_kernel<float, float, float, int=128, int=1, int=2>(cusparseCsrMvParams<float, float, float>)
10.28% 8.7775ms 5752 1.5250us 1.4710us 2.4960us void reduce_1Block_kernel<float, float, float, int=128, int=7>(float*, int, float*)
5.54% 4.7290ms 3295 1.4350us 1.3760us 1.8240us [CUDA memcpy HtoD]
4.23% 3.6102ms 3079 1.1720us 1.1200us 1.4400us void scal_kernel_val<float, float, int=0>(cublasScalParamsVal<float, float>)
1.06% 908.79us 206 4.4110us 3.5190us 8.1920us volta_sgemm_128x32_nn
0.57% 487.16us 413 1.1790us 1.1520us 1.6640us [CUDA memcpy DtoD]
...
nvprof is part of the CUDA SDK (available via the cuda
modules on the Cori GPU nodes) but after CUDA 10 will be replaced by Nsight Compute, NVIDIA's new profiling tool.
Documentation about nvprof is here.
nvvp¶
nvvp is the profiling GPU which accompanies nvprof. It is used for displaying profiling information collected by nvprof in a GUI. Since X11 window forwarding via SSH is typically slow, one will enjoy a much better nvvp experience by running it on a Cori login node using NoMachine. Example output of nvvp is shown below:
Combining MPI + GPU profiling¶
While MPI + CPU profilers are common, MPI + CPU + GPU profilers are less common. However, there are a few tools available on Cori GPU which can accomplish this task.
Combining SLURM_TASK_ID
with Nsight or nvprof to profile MPI + GPU programs¶
By default, Nsight and nvprof profile only one task at a time; if one profiles a GPU code which has multiple tasks (e.g., multiple MPI ranks), Nsight and nvprof can save one profile per task if used with the -o
flag. However, each file must have a unique filename, or else all tasks will attempt to write to the same file, typically resulting in an unusuable profiling result.
One can use the SLURM_TASK_PID
environment variable to save the profiling result from each task to a unique file. The nvvp GUI is able to combine nvprof profiles from multiple tasks into a single result. To do this, one must invoke a bash shell inside the srun
statement in order to return a unique value for the SLURM_TASK_PID
for each task:
user@cgpu02:~> srun -n 2 -c 2 bash -c 'echo $SLURM_TASK_PID'
36032
36031
Note that, without the bash -c
statement, each task will return the same value of SLURM_TASK_ID
, because the wrong bash shell is interpreting the value of the variable:
user@cgpu02:~> srun -n 2 -c 2 echo $SLURM_TASK_PID
35765
35765
The easiest way to use SLURM_TASK_PID
to produce a unique profiling database per MPI rank is to use it inside a shell script which is then invoked by srun
, e.g.,:
user@cgpu02:~> cat profile.sh
#!/bin/bash
suffix=$(bash -c 'echo $SLURM_TASK_PID')
nsys profile -o result-${suffix} ./main.ex
user@cgpu02:~> srun -n 2 -c 2 --cpu-bind=cores ./profile.sh
Collecting data...
Collecting data...
...
(code runs)
...
Processing events...
Processing events...
Capturing symbol files...
Capturing symbol files...
Saving intermediate "/path/to/result-37352.qdstrm" file to disk...
Saving intermediate "/path/to/result-37353.qdstrm" file to disk...
Importing [===============================================================100%]
Importing [===============================================================100%]
Saved report file to "/path/to/result-37352.qdrep"
Saved report file to "/path/to/result-37353.qdrep"
HPCToolkit¶
The HPCToolkit suite of profiling tools is available on Cori GPU for profiling MPI + GPU codes. It can be used by loading the hpctoolkit
module after loading the cgpu
module.