skip to main content
10.1145/3508352.3549431acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators

Published: 22 December 2022 Publication History

Abstract

While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the overheads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.

References

[1]
2022. Coyote: Reconfigurable Heterogeneous Architecture Framework aiming to provide operating system abstractions. https://github.com/fpgasystems/Coyote
[2]
2022. PCIe ATS using Xilinx QDMA. https://github.com/Xilinx/pcie_qdma_ats_example
[3]
Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 141--150.
[4]
Amazon. [n.d.]. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/.
[5]
AMD. [n.d.]. AMD I/O Virtualization Technology (IOMMU) Specification. http://developer.amd.com/wordpress/media/2013/12/48882_IOMMU.pdf.
[6]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163--174.
[7]
Deming Chen, J. Cong, and Junjuan Xu. 2005. Optimal module and voltage assignment for low-power. In Asia and South Pacific Design Automation Conference, 2005., Vol. 2. 850--855 Vol. 2.
[8]
Jonathan Corbet. [n.d.]. Five-level page tables. https://lwn.net/Articles/717293/.
[9]
Ashutosh Dhar et al. 2021. DML: Dynamic Partial Reconfiguration with Scalable Task Scheduling for Multi-Applications on FPGAs. IEEE Trans. Comput. (2021), 1--1.
[10]
Carsten Heinz et al. 2021. The Tapasco Open-Source Toolflow. In Journal of Signal Processing Systems (Pittsburgh, Pennsylvania, USA).
[11]
Gabriel Weisz et al. 2016. A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems. In International Symposium on Field-Programmable Gate Arrays (Monterey, California, USA) (FPGA '16). Association for Computing Machinery, New York, NY, USA, 264--273.
[12]
Jan Vesely et al. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161--171.
[13]
Oreste Villa et al. 2021. Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 868--880.
[14]
Seunghee Shin et al. 2018. Scheduling Page Table Walks for Irregular GPU Applications. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 180--192.
[15]
Yao Chen et al. 2019. Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs. In International Symposium on Field-Programmable Gate Arrays (Seaside, CA, USA) (FPGA '19). Association for Computing Machinery, New York, NY, USA, 73--82.
[16]
Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. 2019. Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture (FPGA '19). Association for Computing Machinery, New York, NY, USA, 84--93.
[17]
Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA '19). Association for Computing Machinery, New York, NY, USA, 224--235.
[18]
Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems (Pittsburgh, Pennsylvania, USA) (ASPLOS XV). Association for Computing Machinery, New York, NY, USA, 347--358.
[19]
Y. Hao, Z. Fang, G. Reinman, and J. Cong. 2017. Supporting Address Translation for Accelerator-Centric Architectures. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 37--48.
[20]
John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM 62, 2 (jan 2019), 48--60.
[21]
Intel. [n.d.]. 5-Level Paging and 5-Level EPT (White paper). https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf.
[22]
Intel. [n.d.]. Intel Virtualization Technology for Directed I/O. https://software.intel.com/sites/default/files/managed/c5/15/vt-directed-io-spec.pdf.
[23]
Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. 2019. DUCATI: High-Performance Address Translation by Extending TLB Reach of GPU-Accelerated Systems. ACM Trans. Archit. Code Optim. 16, 1, Article 6 (March 2019), 24 pages.
[24]
Torben Kalkhof and Andreas Koch. 2021. Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems. In 2021 International Conference on Field-Programmable Technology (ICFPT). 1--10.
[25]
The Linux Kernel. [n.d.]. Heterogeneous Memory Management (HMM). https://www.kernel.org/doc/html/latest/vm/hmm.html.
[26]
Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. 2018. Sharing, Protection, and Compatibility for Reconfigurable Fabric with AmorphOS. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 107--127. https://www.usenix.org/conference/osdi18/presentation/khawaja
[27]
Scott Knowlton. [n.d.]. Compute express link. https://www.computeexpresslink.org/post/introduction-to-compute-express-link-cxl-the-cpu-to-device-interconnect-breakthrough
[28]
Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS abstractions make sense on FPGAs?. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association. https://www.usenix.org/conference/osdi20/presentation/roscoe
[29]
Alexey Lavrov and David Wentzlaff. 2020. HyperTRIO: Hyper-Tenant Translation of I/O Addresses. In International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 487--500.
[30]
Fei Li, Yan Lin, Lei He, and Jason Cong. 2004. Low-Power FPGA Using Pre-Defined Dual-Vdd/Dual-Vt Fabrics. In International Symposium on Field Programmable Gate Arrays (Monterey, California, USA) (FPGA '04). Association for Computing Machinery, New York, NY, USA, 42--50.
[31]
Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. EDD: Efficient Differentiable DNN Architecture and Implementation Co-Search for Embedded AI Solutions. In ACM/EDAC/IEEE Design Automation Conference (Virtual Event, USA) (DAC '20). IEEE Press, Article 130, 6 pages.
[32]
Jiacheng Ma, Gefei Zuo, Kevin Loughlin, Xiaohe Cheng, Yanqiang Liu, Abel Mulugeta Eneyew, Zhengwei Qi, and Baris Kasikci. 2020. A Hypervisor for Shared-Memory FPGA Platforms. In International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 827--844.
[33]
Seungwon Min, Kun Wu, Sitao Huang, Mert Hidayetoglu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-Mei W. Hwu. 2021. Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture. CoRR abs/2103.03330 (2021). arXiv:2103.03330 https://arxiv.org/abs/2103.03330
[34]
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe Performance for End Host Networking. In Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM '18). Association for Computing Machinery, New York, NY, USA, 327--341.
[35]
NVIDIA. [n.d.]. MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9727-memory-management-on-modern-gpu-architectures.pdf.
[36]
Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors. 35--42.
[37]
PCI-SIG. [n.d.]. Address Translation Services Revision 1.1. https://composter.com.ua/documents/ats_r1.1_26Jan09.pdf.
[38]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568--578.
[39]
Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2019. Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU. IEEE Trans. Comput. 68, 4 (2019), 510--525.
[40]
Felix Winterstein and George Constantinides. 2017. Pass a pointer: Exploring shared virtual memory abstractions in OpenCL tools for FPGAs. In 2017 International Conference on Field Programmable Technology (ICFPT). 104--111.
[41]
Xilinx. [n.d.]. Alveo Platform Loading Overview. https://xilinx.github.io/XRT/master/html/platforms_partitions.html.
[42]
Xilinx. [n.d.]. MicroBlaze Micro Controller System v3.0. https://www.xilinx.com/content/dam/xilinx/support/documentation/ip_-documentation/microblaze_mcs/v3_0/pg116-microblaze-mcs.pdf.
[43]
Xilinx. [n.d.]. QDMA Subsystem for PCI Express v4.0. https://www.xilinx.com/support/documentation/ip_documentation/qdma/v4_0/pg302-qdma.pdf.
[44]
Xilinx. [n.d.]. Zynq UltraScale+ Device - Technical Reference Manual.
[45]
T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 345--357.

Cited By

View all
  • (2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024

Index Terms

  1. Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design
          October 2022
          1467 pages
          ISBN:9781450392174
          DOI:10.1145/3508352
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          In-Cooperation

          • IEEE-EDS: Electronic Devices Society
          • IEEE CAS
          • IEEE CEDA

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 22 December 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. FPGA
          2. memory management
          3. shared-virtual memory
          4. virtualization

          Qualifiers

          • Research-article

          Conference

          ICCAD '22
          Sponsor:
          ICCAD '22: IEEE/ACM International Conference on Computer-Aided Design
          October 30 - November 3, 2022
          California, San Diego

          Acceptance Rates

          Overall Acceptance Rate 457 of 1,762 submissions, 26%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)46
          • Downloads (Last 6 weeks)7
          Reflects downloads up to 28 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media