Abstract
Graphics processing units (GPUs) have been the target of a significant body of recent real-time research, but research is often hampered by the “black box” nature of GPU hardware and software. Now that one GPU manufacturer, AMD, has embraced an open-source software stack, one may expect an increased amount of real-time research to use AMD GPUs. Reality, however, is more complicated. Without understanding where internal details may differ, researchers have no basis for assuming that observations made using NVIDIA GPUs will continue to hold for AMD GPUs. Additionally, the openness of AMD’s software does not mean that their scheduling behavior is obvious, especially due to sparse, scattered documentation. In this paper, we gather the disparate pieces of documentation into a single coherent source that provides an end-to-end description of how compute work is scheduled on AMD GPUs. In doing so, we start with a concrete demonstration of how incorrect management triggers extreme worst-case behavior in shared AMD GPUs. Subsequently, we explain the internal scheduling rules for AMD GPUs, how they led to the “worst practices,” and how to correctly manage some of the most performance-critical factors in AMD GPU sharing.
Similar content being viewed by others
Notes
We originally learned many of these details in a private conversation with an AMD engineer, to whom we are extremely grateful. This simplified our search for corresponding information in the publicly available material.
NVIDIA introduced partitioning support, known as MIG (multi-instance GPU) in its most recent top-end GPUs (NVIDIA Corporation 2020). MIG is arguably more powerful than AMD’s CU masking, as, unlike CU masks, MIG allows memory partitioning. Unfortunately, it is unclear if or when MIG will be supported in NVIDIA’s consumer-oriented GPUs.
Our test framework, including code for GPU kernels, is available at https://github.com/yalue/hip_plugin_framework. Scripts and data specific to this paper are available at https://github.com/yalue/rtns2021_figures.
The material in Sect. 4 does not entirely explain this particular anomaly, but the improvement likely is due to a competitor’s presence improving the performance of block-dispatching hardware. Two factors support this assumption. MM256 launches four times the number of blocks as MM1024, meaning that speeding up block launches provides a stronger benefit to MM256. For example, the presence of the MM1024 competitor may help keep some hardware components active, but, as we shall see in Sect. 4, it will cause minimal additional contention for resources against MM256. Second, even though MM256’s times are faster in this case than in isolation, it still is not as fast as MM1024 in isolation. This indicates that MM1024 still has some advantage arising from its block configuration, as it is otherwise identical to MM256.
The competitor’s response times are barely impacted by sharing one CU with the measured task. For brevity, we chose to exclude these measurements from Table 1.
This is largely defined in platform/commandqueue.hpp in ROCclr’s source code.
We do not cover memory-transfer requests further in this paper, but they follow the same queuing structure as kernel launches. Ultimately, memory transfers are dispatched to hardware “DMA engines” (Bauman et al. 2019) rather than asynchronous compute engines.
This, and related behavior can be observed by examining ROCclr’s source code. For example, the acquireQueue function in https://github.com/ROCm-Developer-Tools/ROCclr/blob/master/device/rocm/rocdevice.cpp implements the functionality for selecting a single HSA queue from the pool of available queues.
When ROCm was first introduced, compute-specific code for AMD GPUs was instead in a separate amdkfd driver. It was merged into the amdgpu driver in of version 4.20 of the Linux kernel. While not particlularly relevant to this paper’s content, this distinction may be useful when consulting some of the older reference material we cite.
For example, the AqlQueue constructor and related functions in the ROCR-Runtime library’s core/runtime/amd_aql_queue.cpp source file are responsible for much of this low-level logic.
As of Linux 5.14.0-rc3, source code for runlist construction is mostly contained in drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c in the Linux source tree.
This is described in a comment in drivers/gpu/drm/amd/include/kgd_kfd_interface.h in the Linux 5.14.0-rc3 source tree.
This can be confirmed in Linux 5.14 sources by observing where num_pipe_per_mec and num_queues_per_pipe are set in drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c. Note that ACEs are typically called “pipes” in AMD’s source code (Bridgman 2016).
This is configured by the sched_policy parameter to the amdgpu driver, defined in drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c.
This behavior can be observed in the acquireQueue function defined in ROCclr’s device/rocm/rocdevice.cpp source file.
Our only source for this claim remains private correspondence, which indicated that hardware enforces this rule using a “baton-passing” mechanism between the SEs. Despite the lack of additional external support for this claim, it is certainly well-supported by our experiments, i.e., Table 1 or Fig. 9.
Unfortunately, we also learned this from private conversation and were not able to find corroborating published material. Nonetheless, this claim is supported by the observations in Fig. 7b.
In the source tree for Linux 5.14.0, this is found in the mqd_symmetrically_map_cu_mask function in drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c.
Specifically, the GPU_MAX_HW_QUEUES variable.
References
Aaltonen S (2017) Optimizing GPU occupancy and resource usage with large thread groups. https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/
Amert T, Otterness N, Anderson JH et al (2017) GPU scheduling on the NVIDIA TX2: hidden details revealed. In: IEEE real-time systems symposium (RTSS)
Bauman P, Chalmers N, Curtis N et al (2019) Introduction to AMD GPU programming with HIP. Presentation at Oak Ridge National Laboratory. https://www.olcf.ornl.gov/calendar/intro-to-amd-gpu-programming-with-hip/
Bridgman (2016) amdgpu questions. Phoronix Forums. https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/856534-amdgpu-questions?p=857850#post857850. Accessed 2020
Capodieci N, Cavicchioli R, Bertogna M et al (2018) Deadline-based scheduling for GPU with preemption support. In: IEEE real-time systems symposium (RTSS)
Fujii Y, Azumi T, Nishio N et al (2013) Exploring microcontrollers in GPUs. In: Asia-pacific workshop on systems (APSys)
HSA Foundation (2018a) HSA programmer’s reference manual: HSAIL virtual ISA and programming model, compiler writer, and object format (BRIG), version 1.2. http://hsa.glossner.org/wp-content/uploads/2021/02/HSA-PRM-1.2.pdf
HSA Foundation (2018b) HSA runtime programmer’s reference manual, version 1.2. http://hsa.glossner.org/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf
Jain S, Baek I, Wang S et al (2019) Fractional GPUs: software-based compute and memory bandwidth reservation for GPUs. In: IEEE real-time and embedded technology and applications symposium (RTAS)
Jia Z, Maggioni M, Staiger B et al (2018) Dissecting the NVIDIA volta GPU architecture via microbenchmarking. CoRR abs/1804.06826. http://arxiv.org/abs/1804.06826, https://arxiv.org/abs/arXiv:1804.06826
Jia Z, Maggioni M, Smith J et al (2019) Dissecting the NVidia turing T4 GPU via microbenchmarking. CoRR abs/1903.07486. http://arxiv.org/abs/1903.07486, https://arxiv.org/abs/arXiv:1903.07486
Kato S, Lakshmanan K, Rajkumar R et al (2011) TimeGraph: GPU scheduling for real-time multi-tasking environments. In: USENIX ATC
Larabel M (2020) The AMD Radeon graphics driver makes up roughly 10.5% of the Linux kernel. Phoronixcom. https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.9-AMDGPU-Stats
Mei X, Chu X (2016) Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans Parallel Distrib Syst 28(1):72–86
NVIDIA Corporation (2020) NVIDIA multi-instance GPU user guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
Olmedo IS, Capodieci N, Martínez JL et al (2020) Dissecting the CUDA scheduling hierarchy: a performance and predictability perspective. In: IEEE real-time and embedded technology and applications symposium (RTAS)
Otterness N, Anderson JH (2020) AMD GPUs as an alternative to NVIDIA for supporting real-time workloads. In: Euromicro conference on real-time systems (ECRTS)
Peres M (2013) Reverse engineering power management on NVIDIA GPUs-anatomy of an autonomic-ready system. In: Workshop on operating systems platforms for embedded real-time applications (OSPERT)
Puthoor S, Tang X, Gross J et al (2018) Oversubscribed command queues in GPUs. In: ACM workshop on general purpose GPUs (GPGPU)
Sorensen T, Evrard H, Donaldson AF (2018) GPU schedulers: how fair is fair enough? In: International conference on concurrency theory (CONCUR), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
Yang M, Amert T, Yang K et al (2018) Making OpenVX really ‘real time’. In: IEEE real-time systems symposium (RTSS)
Acknowledgements
Work was supported by NSF Grants CNS 1563845, CNS 1717589, CPS 1837337, CPS 2038855, and CPS 2038960, ARO Grant W911NF-20-1-0237, and ONR Grant N00014-20-1-2698.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Detailed kernel-launch behavior
Section 4.1 attempts to describe ROCm’s kernel-launch behavior in human-readable terms, but researchers attempting to build modifications on top of existing ROCm code may benefit from a more detailed description, including more specific source-code references. This appendix assumes that the reader is already familiar with the content discussed in Sect. 4. While this is not necessary to understand the main body of our paper, we have elected to include it as a reference for researchers hoping to modify or work with AMD’s ROCm software stack.
Figure 10 summarizes the kernel-launch process in greater detail. Each component of the flowchart in Fig. 10 contains four pieces of information: the name of a function in ROCm’s source code, a (brief) comment describing the purpose of the function, the ROCm component in which the function is defined, and the specific source-code file within the component containing the function’s definition.
As shown in Fig. 2, there are several ROCm components, all of which are open source. The source code for each relevant component is available online: HIP,Footnote 20 ROCclr,Footnote 21 the HSA runtimeFootnote 22 and low-level driver interface (ROCT-Thunk-Interface).Footnote 23 This paper was based on ROCm version 4.2, but continues to apply to ROCm 4.3 (current at the time of submission), and has remained relatively stable since ROCm version 3.7.
Figure 10 follows the process outlined in Fig. 3, but with a greater level of detail. As discussed in Sect. 4.1, kernel-launch requests are first enqueued in a userspace C++ HostQueue object, and eventually converted into an AQL packet and inserted into an HSA queue. One new detail shown in Fig. 10 is that the conversion from a HostQueue entry into an AQL packet is carried out by an asynchronous thread, i.e., a thread other than the one that called hipLaunchKernelGGL. While unsurprising, given the asynchronous behavior expected when launching kernels, this may be an important detail when designing real-time systems, as such a thread may block any other thread waiting for kernels to complete.
Curious readers may notice that Fig. 10 does not cover queue creation in detail. Instead, we provide a deeper explanation of queue creation, including the code responsible for assigning queues to GPU hardware, in Appendix 2. Note that the Stream::Create block occurs in both flowcharts, giving an indication of where, if necessary, queue creation will take place in the overall kernel-launch process.
Appendix 2: Detailed queue-handling behavior
As with Appendix 1, this appendix is intended for readers already familiar with the material in Sects. 4.1 and 4.2. Figure 11 presents a more detailed view of the ROCm source code required to create queues and assign them to GPU hardware. Once again, this information is intended for researchers hoping to work with AMD’s code, especially those hoping to modify AMD’s GPU queue management at either the user level or within the driver.
As discussed in the main body of the paper, kernel launches on AMD GPUs do not require driver intervention, so Fig. 10, in Appendix 1, did not include any driver code. This is unfortunately not the case with the more complicated logic behind creating queues in the first place, meaning that Fig. 11 also must include portions of AMD’s driver code, located within the Linux kernel. In addition to labeling every individual component in Fig. 11 with their respective ROCm components, driver portions of the flowchart are also enclosed in the dashed rectangle. Unfortunately, the full “File” path for the driver components of the flowchart is too long to cleanly fit in the flowchart, so we instead note here that all of the paths given in the “amdgpu Driver” boxes are located under the drivers/gpu/drm/amd directory in the Linux 5.14 source tree.
When casually observing ROCm’s code, it may initially be difficult to discern where HSA queues are created. The key point is the createVirtualDevice function, called when ROCm starts a new thread to process a HIP stream’s kernel launches. The “virtual device” is, in reality, a C++ interface granting access to a GPU; many virtual devices may be associated with a single underlying GPU.
An interesting characteristic of Fig. 11 is the presence of the function names with the _cpsch postfix in the driver code. The “cpsch” postfix stands for Command Processor Scheduling, referring to the use of HWS (discussed in Sect. 4.2). Alternative versions of these functions (postfixed _nocpsch) can also be found in kernel code, and are used when HWS is disabled. Internally, the driver is able to alternate between different versions of such functions by calling them indirectly, using a list of function pointers.
Similarly to how some functions in Fig. 11 depend on whether or not HWS is enabled, some functions may change depending on GPU architecture. These functions are likewise called indirectly, via lists of function pointers. We only include one such architecture-specific function in Fig. 11 (the init_mqd function), but several more are used for still-lower-level details, such as populating the contents of the runlist packet or the unmap-queues request.
Rights and permissions
About this article
Cite this article
Otterness, N., Anderson, J.H. Exploring AMD GPU scheduling details by experimenting with “worst practices”. Real-Time Syst 58, 105–133 (2022). https://doi.org/10.1007/s11241-022-09381-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11241-022-09381-y