Skip to main content
Log in

Exploring AMD GPU scheduling details by experimenting with “worst practices”

  • Published:
Real-Time Systems Aims and scope Submit manuscript

Abstract

Graphics processing units (GPUs) have been the target of a significant body of recent real-time research, but research is often hampered by the “black box” nature of GPU hardware and software. Now that one GPU manufacturer, AMD, has embraced an open-source software stack, one may expect an increased amount of real-time research to use AMD GPUs. Reality, however, is more complicated. Without understanding where internal details may differ, researchers have no basis for assuming that observations made using NVIDIA GPUs will continue to hold for AMD GPUs. Additionally, the openness of AMD’s software does not mean that their scheduling behavior is obvious, especially due to sparse, scattered documentation. In this paper, we gather the disparate pieces of documentation into a single coherent source that provides an end-to-end description of how compute work is scheduled on AMD GPUs. In doing so, we start with a concrete demonstration of how incorrect management triggers extreme worst-case behavior in shared AMD GPUs. Subsequently, we explain the internal scheduling rules for AMD GPUs, how they led to the “worst practices,” and how to correctly manage some of the most performance-critical factors in AMD GPU sharing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. We originally learned many of these details in a private conversation with an AMD engineer, to whom we are extremely grateful. This simplified our search for corresponding information in the publicly available material.

  2. NVIDIA introduced partitioning support, known as MIG (multi-instance GPU) in its most recent top-end GPUs (NVIDIA Corporation 2020). MIG is arguably more powerful than AMD’s CU masking, as, unlike CU masks, MIG allows memory partitioning. Unfortunately, it is unclear if or when MIG will be supported in NVIDIA’s consumer-oriented GPUs.

  3. Our test framework, including code for GPU kernels, is available at https://github.com/yalue/hip_plugin_framework. Scripts and data specific to this paper are available at https://github.com/yalue/rtns2021_figures.

  4. The material in Sect. 4 does not entirely explain this particular anomaly, but the improvement likely is due to a competitor’s presence improving the performance of block-dispatching hardware. Two factors support this assumption. MM256 launches four times the number of blocks as MM1024, meaning that speeding up block launches provides a stronger benefit to MM256. For example, the presence of the MM1024 competitor may help keep some hardware components active, but, as we shall see in Sect. 4, it will cause minimal additional contention for resources against MM256. Second, even though MM256’s times are faster in this case than in isolation, it still is not as fast as MM1024 in isolation. This indicates that MM1024 still has some advantage arising from its block configuration, as it is otherwise identical to MM256.

  5. The competitor’s response times are barely impacted by sharing one CU with the measured task. For brevity, we chose to exclude these measurements from Table 1.

  6. This is largely defined in platform/commandqueue.hpp in ROCclr’s source code.

  7. We do not cover memory-transfer requests further in this paper, but they follow the same queuing structure as kernel launches. Ultimately, memory transfers are dispatched to hardware “DMA engines” (Bauman et al. 2019) rather than asynchronous compute engines.

  8. This, and related behavior can be observed by examining ROCclr’s source code. For example, the acquireQueue function in https://github.com/ROCm-Developer-Tools/ROCclr/blob/master/device/rocm/rocdevice.cpp implements the functionality for selecting a single HSA queue from the pool of available queues.

  9. When ROCm was first introduced, compute-specific code for AMD GPUs was instead in a separate amdkfd driver. It was merged into the amdgpu driver in of version 4.20 of the Linux kernel. While not particlularly relevant to this paper’s content, this distinction may be useful when consulting some of the older reference material we cite.

  10. For example, the AqlQueue constructor and related functions in the ROCR-Runtime library’s core/runtime/amd_aql_queue.cpp source file are responsible for much of this low-level logic.

  11. As of Linux 5.14.0-rc3, source code for runlist construction is mostly contained in drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c in the Linux source tree.

  12. This is described in a comment in drivers/gpu/drm/amd/include/kgd_kfd_interface.h in the Linux 5.14.0-rc3 source tree.

  13. This can be confirmed in Linux 5.14 sources by observing where num_pipe_per_mec and num_queues_per_pipe are set in drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c. Note that ACEs are typically called “pipes” in AMD’s source code (Bridgman 2016).

  14. This is configured by the sched_policy parameter to the amdgpu driver, defined in drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c.

  15. This behavior can be observed in the acquireQueue function defined in ROCclr’s device/rocm/rocdevice.cpp source file.

  16. Our only source for this claim remains private correspondence, which indicated that hardware enforces this rule using a “baton-passing” mechanism between the SEs. Despite the lack of additional external support for this claim, it is certainly well-supported by our experiments, i.e., Table 1 or Fig. 9.

  17. Unfortunately, we also learned this from private conversation and were not able to find corroborating published material. Nonetheless, this claim is supported by the observations in Fig. 7b.

  18. In the source tree for Linux 5.14.0, this is found in the mqd_symmetrically_map_cu_mask function in drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c.

  19. Specifically, the GPU_MAX_HW_QUEUES variable.

  20. https://github.com/ROCm-Developer-Tools/HIP/tree/rocm-4.2.0.

  21. https://github.com/ROCm-Developer-Tools/ROCclr/tree/rocm-4.2.0.

  22. https://github.com/RadeonOpenCompute/ROCR-Runtime/tree/rocm-4.2.0.

  23. https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/tree/roc-4.2.x.

References

Download references

Acknowledgements

Work was supported by NSF Grants CNS 1563845, CNS 1717589, CPS 1837337, CPS 2038855, and CPS 2038960, ARO Grant W911NF-20-1-0237, and ONR Grant N00014-20-1-2698.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nathan Otterness.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Detailed kernel-launch behavior

Section 4.1 attempts to describe ROCm’s kernel-launch behavior in human-readable terms, but researchers attempting to build modifications on top of existing ROCm code may benefit from a more detailed description, including more specific source-code references. This appendix assumes that the reader is already familiar with the content discussed in Sect. 4. While this is not necessary to understand the main body of our paper, we have elected to include it as a reference for researchers hoping to modify or work with AMD’s ROCm software stack.

Figure 10 summarizes the kernel-launch process in greater detail. Each component of the flowchart in Fig. 10 contains four pieces of information: the name of a function in ROCm’s source code, a (brief) comment describing the purpose of the function, the ROCm component in which the function is defined, and the specific source-code file within the component containing the function’s definition.

As shown in Fig. 2, there are several ROCm components, all of which are open source. The source code for each relevant component is available online: HIP,Footnote 20 ROCclr,Footnote 21 the HSA runtimeFootnote 22 and low-level driver interface (ROCT-Thunk-Interface).Footnote 23 This paper was based on ROCm version 4.2, but continues to apply to ROCm 4.3 (current at the time of submission), and has remained relatively stable since ROCm version 3.7.

Figure 10 follows the process outlined in Fig. 3, but with a greater level of detail. As discussed in Sect. 4.1, kernel-launch requests are first enqueued in a userspace C++ HostQueue object, and eventually converted into an AQL packet and inserted into an HSA queue. One new detail shown in Fig. 10 is that the conversion from a HostQueue entry into an AQL packet is carried out by an asynchronous thread, i.e., a thread other than the one that called hipLaunchKernelGGL. While unsurprising, given the asynchronous behavior expected when launching kernels, this may be an important detail when designing real-time systems, as such a thread may block any other thread waiting for kernels to complete.

Curious readers may notice that Fig. 10 does not cover queue creation in detail. Instead, we provide a deeper explanation of queue creation, including the code responsible for assigning queues to GPU hardware, in Appendix 2. Note that the Stream::Create block occurs in both flowcharts, giving an indication of where, if necessary, queue creation will take place in the overall kernel-launch process.

Fig. 10
figure 10

Overview of ROCm source code involved in HIP kernel launches

Appendix 2: Detailed queue-handling behavior

As with Appendix 1, this appendix is intended for readers already familiar with the material in Sects. 4.1 and 4.2. Figure 11 presents a more detailed view of the ROCm source code required to create queues and assign them to GPU hardware. Once again, this information is intended for researchers hoping to work with AMD’s code, especially those hoping to modify AMD’s GPU queue management at either the user level or within the driver.

As discussed in the main body of the paper, kernel launches on AMD GPUs do not require driver intervention, so Fig. 10, in Appendix 1, did not include any driver code. This is unfortunately not the case with the more complicated logic behind creating queues in the first place, meaning that Fig. 11 also must include portions of AMD’s driver code, located within the Linux kernel. In addition to labeling every individual component in Fig. 11 with their respective ROCm components, driver portions of the flowchart are also enclosed in the dashed rectangle. Unfortunately, the full “File” path for the driver components of the flowchart is too long to cleanly fit in the flowchart, so we instead note here that all of the paths given in the “amdgpu Driver” boxes are located under the drivers/gpu/drm/amd directory in the Linux 5.14 source tree.

When casually observing ROCm’s code, it may initially be difficult to discern where HSA queues are created. The key point is the createVirtualDevice function, called when ROCm starts a new thread to process a HIP stream’s kernel launches. The “virtual device” is, in reality, a C++ interface granting access to a GPU; many virtual devices may be associated with a single underlying GPU.

An interesting characteristic of Fig. 11 is the presence of the function names with the _cpsch postfix in the driver code. The “cpsch” postfix stands for Command Processor Scheduling, referring to the use of HWS (discussed in Sect. 4.2). Alternative versions of these functions (postfixed _nocpsch) can also be found in kernel code, and are used when HWS is disabled. Internally, the driver is able to alternate between different versions of such functions by calling them indirectly, using a list of function pointers.

Similarly to how some functions in Fig. 11 depend on whether or not HWS is enabled, some functions may change depending on GPU architecture. These functions are likewise called indirectly, via lists of function pointers. We only include one such architecture-specific function in Fig. 11 (the init_mqd function), but several more are used for still-lower-level details, such as populating the contents of the runlist packet or the unmap-queues request.

Fig. 11
figure 11

Overview of ROCm source code involved in creating GPU queues

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Otterness, N., Anderson, J.H. Exploring AMD GPU scheduling details by experimenting with “worst practices”. Real-Time Syst 58, 105–133 (2022). https://doi.org/10.1007/s11241-022-09381-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11241-022-09381-y

Keywords

Navigation