# Hierarchical Roofline Performance Analysis for Deep Learning Applications Charlene Yang, Yunsong Wang Steven Farrell National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory > Berkeley, CA USA {cjyang, yunsongwang, sfarrell}@lbl.gov Thorsten Kurth **NVIDIA** Corporation tkurth@nvidia.com Samuel Williams Computational Research Division Santa Clara, CA USA Lawrence Berkeley National Laboratory Berkeley, CA USA swwilliams@lbl.gov Abstract—This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of data precisions and Tensor Core support and introduces a Nsight Compute based method to accurately collect application performance information. This methodology allows for automated machine characterization and application characterization for Roofline analysis across the entire memory hierarchy on NVIDIA GPUs, and it is validated by a complex deep learning application used for climate image segmentation. We use two versions of the code, in TensorFlow and PyTorch respectively, to demonstrate the use and effectiveness of this methodology. We highlight how the application utilizes the compute and memory capabilities on the GPU and how the implementation and performance differ in two deep learning frameworks. Index Terms—Roofline Model, Performance Analysis, Memory Hierarchy, NVIDIA GPUs, Deep Learning, Image Segmentation ## I. INTRODUCTION The Roofline model [1] is an intuitive performance model that can offer valuable insights into application performance, performance bottlenecks, and possible optimization opportunities. Its capability to extract the key computational characteristics and abstract away the complexity of modern computer architectures has gained its popularity in recent years in both traditional high-performance computing (HPC) and machine learning. Roofline is a throughput-oriented model centered around the interplay of computational capabilities, memory bandwidth, and data locality. Data locality is expressed as the arithmetic intensity (AI), the reuse of data once it is being loaded from memory, and it is commonly calculated as the ratio of the floating-point operations performed to the data movement, i.e. FLOPs per byte. The sustained performance (GFLOP/s) is then bound by two terms: $$GFLOP/s \leq min \begin{cases} Peak \ GFLOP/s \\ Peak \ GB/s \times Arithmetic \ Intensity \end{cases} \tag{1}$$ The Roofline model conventionally only focuses on one level in the memory hierarchy, but this has been extended in recent years to the full memory system to help understand cache reuse and data locality and provide additional insights into code performance. To facilitate the Roofline study, many tools and workflows have sprung to life, for example, the Empirical Roofline Toolkit (ERT) developed at the Lawrence Berkeley National Laboratory, for more accurate machine characterization [2], [3], and other tools, methodologies, and workflows for more streamlined application performance data collection in [4]-[7]. A range of studies have also been conducted on the application of Roofline in both traditional HPC [7]-[12] and Machine Learning [7], [13], [14], and the extension and refinement of the model to other related topics such as instruction Roofline [15], time-based Roofline [14], Roofline scaling trajectories [16], performance portability analysis based on Roofline [3], and power and energy Roofline [17], [18]. Deep learning has become one of the most dominant tools in areas such as pattern recognition, object detection, image segmentation, and language processing [19], [20], and its training or inference process usually takes a long time and requires significant computational resources. To tackle this problem, many innovative methods have been proposed [21], [22] to scale up such applications, and in this paper, we will focus on the Roofline-based performance modeling to analyze and examine how well various deep learning frameworks are utilizing the different aspects of the computer architecture, especially NVIDIA GPUs. We will propose a practical methodology for collecting necessary performance data to conduct hierarchical Roofline analysis on NVIDIA GPUs. There are two components to this methodology, machine characterization using the Empirical Roofline Toolkit (ERT) [2] and application characterization using Nsight Compute [23]. We will discuss the extension of ERT for support on multiple data precisions and Tensor Core operations, and the Nsight Compute metrics used to measure application performance such as the run time, sustained throughput, and data movement across the entire memory hierarchy. This methodology then will be validated by a state-of-the-art deep learning application, DeepCAM [21], in climate image segmentation, to demonstrate its effectiveness in application analysis. Two versions of the code will be examined, in TensorFlow and PyTorch respectively, and some insights will be highlighted on how deep learning applications, in general, utilize the compute/memory capabilities on NVIDIA GPUs and how the two deep learning frameworks, TensorFlow and PyTorch, can differ in implementation and performance. #### II. METHODOLOGIES In this section, we will discuss the extension work done on the Empirical Roofline Toolkit (ERT) in order to support multiple data precisions (such as FP16) and Tensor Core operations on NVIDIA GPUs, and the set of metrics in Nsight Compute that can be used to measure application performance such as run time, sustained throughput and data movement at different levels of the memory hierarchy. These two components together comprise the complete data collection methodology for machine and application characterization in a hierarchical Roofline analysis on NVIDIA GPUs. ## A. ERT Extensions for Machine Characterization The Empirical Roofline Toolkit (ERT) [2] is developed and maintained by the Lawrence Berkeley National Laboratory. It consists of micro-kernels that are finely tuned to test the various aspects of computer architecture such as memory bandwidth and compute throughput. Compared to theoretical values or marketing numbers from vendors, this provides a more accurate understanding of the architecture's capability in real programming environments with real power, thermal constraints, and programming models. ERT is essentially a Python script that wraps around a range of micro-kernels written in C++ and parallelized with various programming models on different architectures. For example, OpenMP and MPI are used on Intel CPUs, CUDA is used on NVIDIA GPUs, and more micro-kernels are currently being added to support AMD architectures, IBM Power processors, and Intel GPUs. These micro-kernels are specifically tuned to test different aspects of the architecture and provide an upper bound for real-life applications on them, i.e. if such kernels can not reach certain performance, there is almost no hope for large complex applications in real life to achieve it. Fig. 1: Roofline graph generated by empirical results for V100 GPU including the new reduced-precision and tensor core ceilings. The ERT prior to this paper only supports double precision (FP64) performance characterization and in this section, we will detail how we have extended it to support single-precision (FP32), and half-precision (FP16), as well as Tensor Core operations on NVIDIA GPUs. The resultant Roofline ceilings are shown in Fig. 1, with 7.7 TFLOP/s for FP64, 15.2 TFLOP/s for FP32, 29.2 TFLOP/s for FP16 on the CUDA core, and 103.7 TFLOP/s on the Tensor Core, on V100 GPUs. 1) Single-Precision (FP32) and Half-Precision (FP16): The original ERT is written in C and only supports double precision (FP64) measurements. While this can be easily extended to single-precision (FP32) by replacing 'double' by 'single' in the code, it requires work to support half-precision (FP16). For maintainability and future extensibility purposes, we have rewritten ERT in C++ and leverage C++ templates to support multiple data types. TABLE I: FP16 Performance on CUDA Core on V100 GPUs | Version | Implementation | Performance (TFLOP/s) | |---------|-------------------------------|-----------------------| | v1 | naive | 15.421 | | v2 | replace half with half2 | 20.142 | | v3 | uint32_t for indexing | 28.152 | | v4 | inline intermediate variables | 28.376 | | v5 | uint32_t only | 29.182 | For FP32, we have easily obtained 15.2 TFLOP/s peak performance, which is within 5% of the advertised 15.7 TFLOP/s performance [24]. For FP16 (on the CUDA Core), some performance tuning is required as detailed in Tab. I. The naive implementation (v1) simply passes half as the data type to the templated functions and that resulted in a similar performance to the FP32 precision's, 15.4 TFLOP/s. This is because V100s do not support FP16 directly on the CUDA Core [24] and each FP16 operation is essentially executed as an FP32 operation (i.e. going through the same pipeline). To efficiently perform FP16 operations (even though utilizing the Tensor Core would be a good option), on the CUDA Core, a vector type half2 can be used to pack two FP16 values together to one FP32 register and be executed in one FP32 instruction. In ERT, we have implemented this using intrinsic functions and obtained an improved performance of 20.1 TFLOP/s (v2) in Tab. I. In real life, it is not feasible to implement large scale applications in intrinsics but out the implementation is an attempt to push the Roofline ceiling as high as we possibly can. The rest three versions v3-v5 in Tab. I are a series of optimizations that have proved to be beneficial to the development of ERT and are expected to be largely helpful to real-life applications and their performance tuning as well. Out of the three, replacing *uint64\_t* indexing variables with the *uint32\_t* data type has proven to bring the most performance gain, from 20.1 TFLOP/s to 28.2 TFLOP/s. This is due to the fact that V100s only support INT32 integer operations on the hardware level and that there is constant type conversion between *uint64\_t* and *uint32\_t* for the second version of ERT (v2). With the inlining of intermediate variables in v4 and conversion of all integers to *uint32\_t* in v5, the FP16 CUDA Core performance of ERT has been brought on par to the theoretical peak with 29.2 TFLOP/s in Fig. 1. 2) Tensor Core: NVIDIA Tensor Cores are designed to accelerate matrix-matrix multiplication operations, which represent the mathematical nature of many deep learning workloads, for example, convolutional neural networks (CNNs). They operate on $4\times 4$ matrices and can perform the following matrix multiplication and accumulation extremely efficiently. $$D = A \times B + C \tag{2}$$ where A and B are matrices in FP16, and C and D are matrices in either FP16 or FP32. V100 has 80 SMs and 8 tensor cores per SM, and at 1.312 GHz clock frequency, its theoretical Tensor Core peak can be calculated as $$80 \times 8 \times 1.312 \times 4^3 \times 2 = 107.479 \text{ TFLOP/s}$$ (3) To stress test the Tensor Cores on V100, we have implemented ERT based on general matrix-matrix multiplications (GEMMs), where $\alpha$ and $\beta$ are constant coefficients: $$D = \alpha * A \times B + \beta * C \tag{4}$$ In general, there are two ways to program on Tensor Core, using the WMMA (Warp Matrix Multiply Accumulate) API in CUDA [25], or libraries such as cuBLAS [26] and cuDNN [27]. The *nvcuda::wmma* namespace in CUDA provides specialized matrix load, multiply, accumulate and store operations and allows for direct programming on Tensor Cores. cuBLAS and cuDNN libraries, on the other hand, shields users away from low-level CUDA programming and provides a very versatile, and highly-tuned, high-level user API for GEMM and other operations. For a given GEMM in Equation 4 with matrix size $M \times N$ for $A, N \times K$ for B, and $M \times K$ for C and D, if M = N = K, the total number of FLOPs performed in this kernel can be calculated as $M^3 \times 2$ . This is an estimation without including the constant efficiency multiplications, which usually are performed on the CUDA Core, not Tensor Core, and are negligible. With the run time t, we can then estimate the FLOP/s performance of the kernel as $(M^3 \times 2)/t$ for a given matrix size in Fig. 2. It is clear that as the matrix size increases, so does the performance of both wmma and cuBLAS approaches. At the largest with M=N=K=32768, we have obtained 103.7 TFLOP/s at 96.5% of the theoretical peak from the cuBLAS approach, and 58 TFLOP/s at 54% from the wmma approach. This is largely due to the optimizations in cuBLAS such as the use of shared memory, data padding (to avoid bank conflicts in shared memory), highly tuned thread block size, tile size, and other parameters. For the rest of this paper, we will use 103.7 TFLOP/s as the Tensor Core peak; however, the 58 TFLOP/s performance provides an empirical upper bound for users who program in *wmma* on the Tensor Core. #### B. Nsight Compute Metrics for Application Characterization The application characterization methodology for Roofline analysis on NVIDIA GPUs has been evolving with the developer toolchain change. The first proposed methodology Fig. 2: Tensor Core Performance as a function of matrix size for cuBLAS and hand-optimized WMMA implementations of matrix multiplication. was based on nvprof [28] in [7], and then an Nsight Compute [29] based methodology is developed at [30] and briefly presented in [31]. In this paper, we will discuss in detail how the Nsight Compute metrics can be used for hierarchical Roofline analysis on NVIDIA GPUs and demonstrate its effectiveness in analyzing deep learning applications. The Nsight profiling toolkit is replacing nvprof as the new performance tool suite for NVIDIA GPU developers. It consists of three components, Nsight Systems, Nsight Compute, and Nsight Graphics, with the first two being most relevant to scientific application and machine learning application development. Nsight Systems can provide a system-wide visualization of application performance and help users identify issues such as insufficient parallelism on the GPU, unnecessary device-host data transfers, and inefficient kernel synchronization, while Nsight Compute dives a bit deeper and allows for the collection of more detailed performance metrics such as warp issues statistics, instruction pipeline utilization, and memory access pattern. Between the two generations of developer tools, nvprof and Nsight Compute have a few major differences. - nvprof uses CUPTI [32] while Nsight Compute is based on PerfWorks [33], a new framework for performance metric collection. - The metrics in Nsight Compute are more nuanced than in nvprof, with some metrics broken down into more in Nsight Compute. - The naming and organizing convention in Nsight Compute is more structured as well, with components such as unit, subunit, interface, counter name, rollup metric and submetric, used to distinguish different metrics. - Kernel replay when multiple metrics are being collected, and profiling overhead, are more optimized in Nsight Compute, to provide faster and more accurate hardware and software counter measurements. To construct a hierarchical Roofline on NVIDIA GPUs, we need to collect the following quantities, kernel run time, the total number of FLOPs performed in each kernel, and the number of bytes being read and written at each level of the memory hierarchy. With Nsight Compute, we can use this command to collect metrics listed in Tab. II. nv-nsight-cu-cli --metrics metric ./application 1) Kernel Run Time: As shown in TABLE II, we use the metric sm\_cycles\_elapsed.avg to obtain the total number of elapsed cycles and its submetric per\_second to get the rate (number of cycles per second), in order to calculate the kernel execution time: $$time = cycles/rate$$ (5) 2) FLOPs: To count the number of FLOPs performed in the kernel, Nsight Compute doesn't provide a unified metric like flop\_count\_dp in nvprof. But for each floating-point precision (FP64, FP32 and FP16), it splits the measurement into three metrics based on the instruction type, addition, multiplication, and fused multiply-add (FMA). Note that each FMA is considered two FLOPs and the total number of FLOPs can be calculated as add + 2 x fma + mul for each data precision. Also, one can tell from the naming of the metrics that only non-predicated threads are counted in these FLOPs, i.e. masked operations are not included. For Tensor Core, we count the number of warp instructions by using the sm\_\_inst\_executed\_pipe\_tensor.sum metric and the total Tensor Core FLOPs is $$FLOP_{tc} = Inst_{tc} \times 512 \tag{6}$$ 3) Bytes: Metrics are listed in TABLE II for measuring the data movement on each level of the memory hierarchy. TABLE II: Nsight Compute metrics for hierarchical Roofline | | Metrics | | | |-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|--|--| | Time | sm_cycles_elapsed.avg<br>sm_cycles_elapsed.avg.per_second | | | | FP64 FLOPs | sm_sass_thread_inst_executed_op_hadd_pred_on.sum<br>sm_sass_thread_inst_executed_op_hmul_pred_on.sum<br>sm_sass_thread_inst_executed_op_hfma_pred_on.sum | | | | FP32 FLOPs | sm_sass_thread_inst_executed_op_fadd_pred_on.sum<br>sm_sass_thread_inst_executed_op_fmul_pred_on.sum<br>sm_sass_thread_inst_executed_op_ffma_pred_on.sum | | | | FP16 FLOPs | sm_sass_thread_inst_executed_op_hadd_pred_on.sum<br>sm_sass_thread_inst_executed_op_hmul_pred_on.sum<br>sm_sass_thread_inst_executed_op_hfma_pred_on.sum | | | | Tensor Core FLOPs | sminst_executed_pipe_tensor.sum | | | | L1 Cache | l1tex <u>t</u> bytes.sum | | | | L2 Cache | lts <u>t</u> bytes.sum | | | | HBM | dram_bytes.sum | | | For device memory (or HBM), L2 cache, and L1 cache, the latest Nsight Compute provides a unified byte metric for each of them to facilitate measurement. Note that shared memory transactions are not included in the current L1 metric. Due to profiling overhead, it is recommended to restrict the number of kernels to run Nsight Compute with at a time, and these metrics can be collected on separate runs as well, as long as the execution of the application is deterministic. Also, note that as of 2020.1.0, Nsight Compute serializes multistream execution so certain performance gain due to kernel overlapping may be overlooked; however, the performance analysis in this paper is still insightful in understanding application performance on a kernel level. ## III. EXPERIMENTAL SETUP ### A. Hardware and Software Configuration Results presented in this paper are obtained from the Cori supercomputer, and in particular its GPU partition, at the National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory (LBNL). The GPU partition is primarily deployed for GPU porting, benchmarking, and testing efforts in the NERSC Exascale Science Application Program (NESAP). Each node contains two Intel Xeon Gold 6148 Skylake CPUs, 384GiB DDR4 memory, and 8 NVIDIA V100 GPUs. Each GPU has 16GiB of HBM2 memory and 80 SMs, and GPUs on a node are connected to each other in a 'hybrid cube-mesh' topology. On the software side, we have used the TensorFlow 1 and PyTorch implementation of the climate image segmentation code in [34], and CUDA 10.2.89, cuDNN 7.6.5, Nsight Compute 2020.1.0, Python 3.7, PyTorch 1.5.0, and TensorFlow 1.15.0 for this study. ## B. DeepCAM Benchmark DeepCAM [34] is a deep learning benchmark extracted from the 2018 Gordon Bell winning project [21], used for detection, classification, and localization of extreme weather patterns in climate images. It has two different implementations, in TensorFlow and PyTorch respectively, with the PyTorch version being selected for MLPerf [35] HPC benchmark suite. In this paper, we will compare the performance of these two implementations using the methodology presented in Sec. II-B. To ensure a fair comparison, we have tuned the parameters to be as close as possible, for example, the number of layers in the encoder-decoder architecture, layer parameters, optimization algorithms, step rates, batch size, usage of batch norm, and Automatic Mixed Precision (AMP) settings. The DeepCAM model is a deep neural network for semantic segmentation with an encoder-decoder architecture based on DeepLabv3+ [36]. The encoder is a ResNet-50 network with atrous spatial pyramid pooling. The decoder is a nine-layer network with convolutional and de-convolutional layers and two skip connections from the input and middle of the encoder. To profile the code, the profile-from-start option is disabled in Nsight Compute and we use CuPy [37] to explicitly restrict the profiling region to include the iteration loop only. To have relatively stable run time behavior during profiling, we also set up a warm-up loop with 5 iterations before the target profiling loop. We collect only one metric during each execution to minimize the profiling overhead which will result in random algorithmic choices due to the TensorFlow runtime auto-tuning. To solve this issue, NVIDIA TensorFlow Determinism [38] is employed to get rid of this uncertainty. If not otherwise stated, the default setting for the Tensor-Flow DeepCAM implementation is with AMP-enabled, and for PyTorch DeepCAM with AMP optimization level 01. The source code and full raw results are available at [34]. #### IV. RESULTS In this section, we will first apply the Nsight Compute methodology in Sec. II-B on the DeepCAM benchmark and discuss its performance implications. On the following Roofline charts, each kernel is represented by a triplet of open circles (blue for L1, red for L2 and green for HBM), and the circle size is proportional to the kernel's run time. Note that we preset a minimum circle size to make all kernels visible on the plot, and that the real run time difference between large and small kernels can be more significant. Besides, there could be many invocations of the same kernel and the data presented on these Roofline charts is the aggregation of all these invocations of the same kernel. One should expect blue, red, and green circles near the L1, L2, and HBM ceilings respectively to show high memory utilization. Triplets of circles close to each other present a "streaming" data access pattern and indicate poor cache locality. Circles to the top right corner show superior performance over the others. In the following subsections, we will discuss how performance is different in the forward and backward pass in both TensorFlow and PyTorch implementations, and the performance impact of the NVIDIA Automatic Mixed Precision package and the zero-AI kernels. Note that the backward pass for TensorFlow DeepCAM includes both gradient calculation and gradient update, whereas the PyTorch DeepCAM backward pass only includes gradient calculation (with its 'optimizer' being the gradient update step). ## A. The TensorFlow version of DeepCAM Fig. 3: Hierarchical Roofline of the TensorFlow DeepCAM in the forward pass with default configurations. The dominant kernel (with three largest circles) has very high Tensor Core utilization and consume 33% of the overall run time. Fig. 3 shows the hierarchical Roofline of the TensorFlow version of DeepCAM in its forward pass. The main computational kernel represented by the three large circles under the Tensor Core ceiling, indicates that it has very high Tensor Core utilization, whereas many of the other circles either do not use Tensor Core or are bandwidth bound. This major kernel's L1 circle (in blue) slightly overlaps with its L2 circle (in red) indicating a relatively low L1 cache locality; however, the large gap between its L2 and HBM circles demonstrates that L2 cache misses rarely happened and that the kernel benefits from high L2 data locality. As for the rest of the kernels, their L1, L2, and HBM kernels are generally close to each other, implying a poor data locality across all levels of memory hierarchies ("streaming" operations). Fig. 4: Hierarchical Roofline of the TensorFlow DeepCAM in the backward pass with default configurations. There are more compute-intensive kernels than in the forward pass. Collectively they constitute 41.9% of the run time and attain near peak Tensor Core performance. Fig. 4 shows the corresponding backward pass of the TensorFlow DeepCAM. Instead of one single major kernel appearing in the forward pass, two very time-consuming kernels are found in the backward pass calculation. It is obvious that these two kernels both require longer run time than the major kernel in the forward pass (notice the size), which implies that the backward pass has more computeintensive kernels than the forward pass and is generally more time-consuming. Compared to a few kernels using Tensor Core in the forward pass, we can find that more kernels benefit from the Tensor Core pipeline in the backward pass since they are sitting above the half-precision peak. Another observation is that more kernel invocations are involved in the backward pass than in the forward. Overall, we can conclude that in either forward or backward pass, the main computational kernels are compute-bound and are highly optimized for the underlying architecture. ## B. The PyTorch version of DeepCAM Compared to the TensorFlow result (Fig. 3), no dominant kernels (kernel run time significantly larger than the others) can be found in the PyTorch forward pass (Fig. 5). The number one kernel is located slightly below the single-precision performance peak, and based on the symbol distance between different memory hierarchies, it has a better cache utilization than the dominant kernel in TensorFlow (even though it runs on the CUDA Core). Besides, similar to TensorFlow, a large number of trivial kernels are HBM-bound in the PyTorch implementation of DeepCAM. Fig. 5: Hierarchical Roofline of the PyTorch DeepCAM in the forward pass with default configurations. No single kernel requires significantly longer run time than the others (no extremely large circles). Fig. 6: Hierarchical Roofline of the PyTorch DeepCAM in its backward pass with default configurations. One can observe the highly compute intensive, but low performing kernel. Fig. 6 shows the PyTorch DeepCAM performance in the backward pass, with default configurations. Surprisingly, the number one time-consuming kernel does not utilize Tensor Core and delivers only about 1 TFLOP/s performance. However, this implementation's overall run time is still lower than that of the TensorFlow case, seen by the size of the circles, thanks to optimizations in other kernels or the overall execution of kernels. Compared to TensorFlow, PyTorch has more flexibility when profiling the model, and the 'optimizer' step can be easily separated from the gradient calculation step in the backpropagation. The optimization step is mainly to update model parameters with newly calculated gradients and is usually low on arithmetic intensity. Fig. 7 confirms this, where all the 'optimizer' kernels are memory-bound and have a much lower FLOP/s performance than some of the kernels in Fig. 5 or Fig. 6. It should be noted that there are 2709 kernel invocations involved in this process, even though there are only a few circles visible. These kernel invocations have very similar arithmetic intensity and performance, and are thus overlapping. Fig. 7: Hierarchical Roofline of the PyTorch DeepCAM in its 'optimizer' step. The gradient update step consists of numerous streaming operations and has poor arithmetic intensity and FLOP/s performance. Fig. 8: Hierarchical Roofline of our FP16 implementation of DeepCAM in TensorFlow (backward pass). AMP (shown in Fig. 4) can deliver the same performance without manual type conversion. ### C. Automatic Mixed Precision The Automatic Mixed Precision (AMP) package developed at NVIDIA is dedicated to accelerating deep learning processes by partially converting single-precision data to halfprecision to reduce data movement and improve computational throughput. It allows for automatic type conversion of certain model parameters and also implements schemes such as loss scaling to ensure numerical correctness and accuracy. We have implemented an FP16 version of DeepCAM in TensorFlow manually, by picking out the appropriate variables by hand and typecasting them explicitly. Fig. 8 shows that the backward pass performance of this implementation is very close to that of the FP32 DeepCAM with AMP-enabled (shown in Fig. 4), demonstrating that even without the knowledge of the implementation details of the network, the AMP package can effectively apply type conversion and leverage lower-precision operations for performance. AMP provides implementation for both TensorFlow and PyTorch, and for PyTorch, there are more detailed optimization levels, rather than just on or off. According to the AMP Fig. 9: Hierarchical Roofline of the PyTorch DeepCAM in its backward pass with AMP O0. documentation [39], 00 level for PyTorch is used to establish a stable baseline for the auto mixed-precision acceleration; 01 follows a conservative type conversion and numerical properties are highly preserved; 02 however, implements a more aggressive FP32 to FP16 conversion and extra care needs to be taken for model convergence concerns. Our default setting is $\bigcirc 1$ and the backward pass performance of the PyTorch DeepCAM with this setting is shown in Fig. 6. From the $\bigcirc 0$ optimization level in Fig. 9, to the $\bigcirc 1$ in Fig. 6, kernel run time has been largely reduced and many kernels have been moved to execute on the Tensor Core, providing a much higher computational throughput and demonstrating the effectiveness of the $\bigcirc 1$ optimization level. ### D. Zero-AI kernels Compared to traditional HPC applications where users usually have full control of kernel invocations, high-level Pythonbased deep learning frameworks tend to implicitly invoke many subsidiary kernels, either for data conversion or devicehost transfer purposes. TABLE III shows the ratio of these kernel invocations to the total number of invocations. Around 40-50% of the invocations are for such zero-AI kernels, where no floating-point operation is performed. This may not inadvertently affect the overall performance much if these kernels are perfectly overlapped with other kernel executions, but it is very hard to achieve that in reality. As hardware constantly evolves, new computer architectures tend to provide higher and higher FLOP/s performance and bandwidth, but with less progressive improvement on kernel launch overhead. To avoid becoming overhead-bound, it is recommended that these deep learning applications avoid such "implicit" zero-AI kernels as much as possible by fusing them or overlapping with the non-zero-AI kernels. ## E. Overall Performance Despite minor differences in implementation (even though we have tried to make an apples-to-apples comparison), the two codes, TensorFlow DeepCAM and PyTorch DeepCAM, have achived similar runtime and convergence performance. The previous subsections presented a deep analysis of these TABLE III: Zero-AI kernel invocations in TensorFlow Deep-CAM and PyTorch DeepCAM | TensorFlow DeepCAM | Forward | Backward <sup>a</sup> | | Total | |----------------------------|---------------------|-------------------------|------------------|---------------| | zero-AI | 304 (54.7%) | 1833 (40.1%) | | 2137 | | non zero-AI | 252 (45.3%) | 2740 (59.9%) | | 2992 | | Total | 556 (100%) | 4573 (100%) | | 5129 | | | | | | | | PyTorch DeepCAM | Forward | Backward | Optimizer | Total | | PyTorch DeepCAM<br>zero-AI | Forward 437 (54.8%) | Backward<br>609 (38.7%) | Optimizer 0 (0%) | Total<br>1046 | | | | | | | <sup>&</sup>lt;sup>a</sup>This includes both gradient calculation and update, i.e. the backward pass and optimizer in the PyTorch case. two implementations on hierarchical Roofline, and it is discovered that TensorFlow tends to utilize Tensor Core more, compared to PyTorch, as seen by the locations of the most time-consuming kernels in Fig. 3-6. These two frameworks have similar cache utilization pattern on L1, L2 and HBM levels, with PyTorch having slightly more high-AI kernels scattered in the range of 100 FLOPs/Byte and 1000 FLOPs/Byte on Fig. 5 and Fig. 6. Overall, similar numbers of kernels are launched in Tensor-Flow DeepCAM and PyTorch DeepCAM, with TensorFlow using over double the amount of zero-AI kernels than in PyTorch, 2137 versus 1046 in Tab. III. These zero-AI kernels may have been launched over multiple streams and overlapped with computational kernels, however, reducing them could further improve the launch overhead and overall run time. These kernels are mostly used for converting data from one precision to another, or for rearranging data layout. They may be fused or done on the host (asynchronous to the GPU computation) in order to save run time. Another note is that the NVIDIA AMP package has been proven to be very effective, through the comparison of Fig. 4 and Fig. 8 for TensorFlow, and Fig. 6 and Fig. 9 for PyTorch. ## V. CONCLUSIONS In this paper, we first revisited the need for mixed-precision performance analysis and extended ERT to incorporate singleprecision, half-precision, and Tensor Core performance measurements. Then, based on the previous nvprof hierarchical Roofline methodology, we established a new Nsight Compute methodology to collect Roofline data on NVIDIA GPUs. In the third part of this paper, we applied this new methodology to a representative real-life deep learning benchmark, DeepCAM, with its two implementations in Tensor-Flow and PyTorch. Results show that this new methodology is very effective in analyzing and better understanding the performance of deep learning applications. Useful performance insights are discussed, for example, computational characteristics of different stages of the training process, the performance impact of the automatic mixed precision (AMP) package and zero-AI kernels. This should be largely helpful to deep learning programmers and framework developers, as it captures data localities within each level of the cache hierarchy, demonstrates overall hardware utilization and indicates potential optimization efforts (get rid of zero-AI kernels to minimize kernel launch latency and improve overall FLOP rate). In the future, we would like to extend the current Nsight Compute methodology to incorporate cross-node performance analysis. New methodologies for alternate architectures and mixed-precision performance ceilings in Roofline will be investigated as well. #### ACKNOWLEDGEMENTS This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center (NERSC) which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. We thank NVIDIA Corporation for their willingness to answer our myriad of questions on Nsight metrics. #### REFERENCES - S. Williams, A. Waterman, and D. Patterson, "Roofline: An insightful visual performance model for floating-point programs and multicore architectures," Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), Tech. Rep., 2009. - [2] "Empirical Roofline Toolkit (ERT)," accessed: 2020-08-01. [Online]. Available: https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/ - [3] C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen, B. Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams, "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability," in 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2018, pp. 14–23. - [4] NERSC Roofline Model Documentation. [Online]. Available: https://docs.nersc.gov/development/performance-debugging-tools/roofline/ - [5] C. Yang, B. Friesen, T. Kurth, B. Cook, and S. Williams, "Toward Automated Application Profiling on Cray Systems," in *Cray User Group Conference (CUG)*, 2018. - [6] J. R. Madsen, M. G. Awan, H. Brunie, J. Deslippe, R. Gayatri, L. Oliker, Y. Wang, C. Yang, and S. Williams, "Timemory: Modular Performance Analysis for HPC," in *International Conference on High Performance Computing*. Springer, 2020, pp. 434–452. - [7] C. Yang, T. Kurth, and S. Williams, "Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System," Concurrency and Computation: Practice and Experience, p. e5547, 2019. [Online]. Available: https://doi.org/10.1002/cpe.5547 - [8] D. Doerfler, J. Deslippe, S. Williams, L. Oliker, B. Cook, T. Kurth, M. Lobet, T. Malas, J.-L. Vay, and H. Vincenti, "Applying the roofline performance model to the intel xeon phi knights landing processor," in *International Conference on High Performance Computing*. Springer, 2016, pp. 339–353. - [9] T. Koskela, Z. Matveev, C. Yang, A. Adedoyin, R. Belenov, P. Thierry, Z. Zhao, R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams, "A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization," in *International Conference on High Performance Computing*. Springer, 2018, pp. 226–245. - [10] M. Del Ben, C. Yang, S. Louie, and J. Deslippe, "Accelerating Large-Scale GW Calculations on Hybrid GPU-CPU Systems," *Bulletin of the American Physical Society*, vol. 65, 2020. - [11] R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, "A Case Study For Performance Portability Using OpenMP 4.5," in *International Workshop* on Accelerator Programming Using Directives. Springer, 2018, pp. 75– 95. - [12] C. Yang. 8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks. [Online]. Available: https://arxiv.org/abs/2008.11326 - [13] M. H. Javed, K. Z. Ibrahim, and X. Lu, "Performance analysis of deep learning workloads using roofline trajectories," *CCF Transactions on High Performance Computing*, vol. 1, no. 3, pp. 224–239, 2019. - [14] Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and S. Williams, "Time-Based Roofline for Deep Learning Performance Analysis," in 2020 IEEE/ACM Deep Learning on Supercomputers Workshop, 2020. [Online]. Available: https://arxiv.org/abs/2009.04598 - [15] N. Ding and S. Williams, "An Instruction Roofline Model for GPUs," in 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 2019, pp. 7–18. - [16] K. Z. Ibrahim, S. Williams, and L. Oliker, "Performance Analysis ff GPU Programming Models Using the Roofline Scaling Trajectories," in International Symposium on Benchmarking, Measuring and Optimization. Springer, 2019, pp. 3–19. - [17] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, "A Roofline Model of Energy," in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013, pp. 661–672. - [18] A. Lopes, F. Pratas, L. Sousa, and A. Ilic, "Exploring GPU Performance, Power And Energy-Efficiency Bounds with Cache-aware Roofline Modeling," in 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2017, pp. 259–268. - [19] Y. LeCun, Y. Bengio et al., "Convolutional networks for images, speech, and time series," The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995. - [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in neural information processing systems, 2014, pp. 2672– 2680. - [21] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica et al., "Exascale deep learning for climate analytics," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 649–660. - [22] W. Joubert, D. Weighill, D. Kainer, S. Climer, A. Justice, K. Fagnan, and D. Jacobson, "Attacking the opioid epidemic: Determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 717–730. - [23] "Nsight compute cli metric comparison," accessed: 2020-08-01. [Online]. Available: https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-cd - [4] T. NVIDIA, "V100 gpu architecture. the world's most advanced data center gpu. version wp-08608-001\_v1. 1," NVIDIA. Aug, p. 108, 2017. - [25] CUDA C++ wmma API. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html - [26] CUDA cuBLAS Library. [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html - [27] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cudnn: Efficient primitives for deep learning," arXiv preprint arXiv:1410.0759, 2014. - [28] "Profiler user's guide," accessed: 2020-08-01. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide/ - [29] "Nvidia developer tools overview," accessed: 2020-08-01. [Online]. Available: https://developer.nvidia.com/tools-overview - [30] "Roofline Methodology on NVIDIA GPUs." [Online]. Available: https://gitlab.com/NERSC/roofline-on-nvidia-gpus - [31] C. Yang. Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs. [Online]. Available: https://arxiv.org/abs/2009.02449 - [32] NVIDIA CUPTI API reference guide. [Online]. Available: https://docs.nvidia.com/cupti/Cupti/ - [33] PerfWorks measurement library for Nsight Compute. [Online]. Available: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-structure - 34] Deep Learning Climate Segmentation Benchmark. [Online]. Available: https://github.com/cyanguwa/DeepLearningProfiling - [35] MLPerf Benchmark. [Online]. Available: https://mlperf.org - [36] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in *Proceedings of the European Conference on Computer Vision (ECCV)*, September 2018. - [37] R. Okuta, Y. Unno, D. Nishino, S. Hido, and C. Loomis, "Cupy: A numpy-compatible library for nvidia gpu calculations," in *Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems* (NIPS), 2017. - [38] Deterministic Profiling for TensorFlow). [Online]. Available: https://github.com/NVIDIA/tensorflow-determinism [39] "apex.amp," accessed: 2020-08-01. [Online]. Available: https://nvidia.github.io/apex/amp