research-article

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators

Authors:

Edward Richter,

Deming ChenAuthors Info & Claims

ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

Article No.: 23, Pages 1 - 9

https://doi.org/10.1145/3508352.3549431

Published: 22 December 2022 Publication History

Abstract

While the tight integration of components in heterogeneous systems has increased the popularity of the Shared-Virtual Memory (SVM) system programming model, the overhead of SVM can significantly impact end-to-end application performance. However, studying SVM implementations is difficult, as there is no open and flexible system to explore trade-offs between different SVM implementations and the SVM design space is not clearly defined. To this end, we present Qilin, the first open-source system which enables thorough study of SVM in heterogeneous computing environments for discrete accelerators. Qilin is a transparent and flexible system built on top of an open-source FPGA shell, which allows researchers to alter components of the underlying SVM implementation to understand how SVM design decisions impact performance. Using Qilin, we perform an extensive quantitative analysis on the overheads of three SVM architectures, and generate several insights which highlight the cost and benefits of each architecture. From these insights, we propose a flowchart of how to choose the best SVM implementation given the application characteristics and the SVM capabilities of the system. Qilin also provides application developers a flexible SVM shell for high-performance virtualized applications. Optimizations enabled by Qilin can reduce the latency of translations by 6.86x compared to an open-source FPGA shell.

References

[1]

2022. Coyote: Reconfigurable Heterogeneous Architecture Framework aiming to provide operating system abstractions. https://github.com/fpgasystems/Coyote

[2]

2022. PCIe ATS using Xilinx QDMA. https://github.com/Xilinx/pcie_qdma_ats_example

[3]

Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 141--150.

[4]

Amazon. [n.d.]. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/.

[5]

AMD. [n.d.]. AMD I/O Virtualization Technology (IOMMU) Specification. http://developer.amd.com/wordpress/media/2013/12/48882_IOMMU.pdf.

[6]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163--174.

[7]

Deming Chen, J. Cong, and Junjuan Xu. 2005. Optimal module and voltage assignment for low-power. In Asia and South Pacific Design Automation Conference, 2005., Vol. 2. 850--855 Vol. 2.

Digital Library

[8]

Jonathan Corbet. [n.d.]. Five-level page tables. https://lwn.net/Articles/717293/.

[9]

Ashutosh Dhar et al. 2021. DML: Dynamic Partial Reconfiguration with Scalable Task Scheduling for Multi-Applications on FPGAs. IEEE Trans. Comput. (2021), 1--1.

Digital Library

[10]

Carsten Heinz et al. 2021. The Tapasco Open-Source Toolflow. In Journal of Signal Processing Systems (Pittsburgh, Pennsylvania, USA).

[11]

Gabriel Weisz et al. 2016. A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems. In International Symposium on Field-Programmable Gate Arrays (Monterey, California, USA) (FPGA '16). Association for Computing Machinery, New York, NY, USA, 264--273.

Digital Library

[12]

Jan Vesely et al. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161--171.

[13]

Oreste Villa et al. 2021. Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 868--880.

[14]

Seunghee Shin et al. 2018. Scheduling Page Table Walks for Irregular GPU Applications. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 180--192.

Digital Library

[15]

Yao Chen et al. 2019. Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs. In International Symposium on Field-Programmable Gate Arrays (Seaside, CA, USA) (FPGA '19). Association for Computing Machinery, New York, NY, USA, 73--82.

Digital Library

[16]

Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. 2019. Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture (FPGA '19). Association for Computing Machinery, New York, NY, USA, 84--93.

Digital Library

[17]

Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA '19). Association for Computing Machinery, New York, NY, USA, 224--235.

Digital Library

[18]

Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems (Pittsburgh, Pennsylvania, USA) (ASPLOS XV). Association for Computing Machinery, New York, NY, USA, 347--358.

Digital Library

[19]

Y. Hao, Z. Fang, G. Reinman, and J. Cong. 2017. Supporting Address Translation for Accelerator-Centric Architectures. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 37--48.

[20]

John L. Hennessy and David A. Patterson. 2019. A New Golden Age for Computer Architecture. Commun. ACM 62, 2 (jan 2019), 48--60.

Digital Library

[21]

Intel. [n.d.]. 5-Level Paging and 5-Level EPT (White paper). https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf.

[22]

Intel. [n.d.]. Intel Virtualization Technology for Directed I/O. https://software.intel.com/sites/default/files/managed/c5/15/vt-directed-io-spec.pdf.

[23]

Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. 2019. DUCATI: High-Performance Address Translation by Extending TLB Reach of GPU-Accelerated Systems. ACM Trans. Archit. Code Optim. 16, 1, Article 6 (March 2019), 24 pages.

Digital Library

[24]

Torben Kalkhof and Andreas Koch. 2021. Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems. In 2021 International Conference on Field-Programmable Technology (ICFPT). 1--10.

[25]

The Linux Kernel. [n.d.]. Heterogeneous Memory Management (HMM). https://www.kernel.org/doc/html/latest/vm/hmm.html.

[26]

Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. 2018. Sharing, Protection, and Compatibility for Reconfigurable Fabric with AmorphOS. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 107--127. https://www.usenix.org/conference/osdi18/presentation/khawaja

[27]

Scott Knowlton. [n.d.]. Compute express link. https://www.computeexpresslink.org/post/introduction-to-compute-express-link-cxl-the-cpu-to-device-interconnect-breakthrough

[28]

Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS abstractions make sense on FPGAs?. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association. https://www.usenix.org/conference/osdi20/presentation/roscoe

[29]

Alexey Lavrov and David Wentzlaff. 2020. HyperTRIO: Hyper-Tenant Translation of I/O Addresses. In International Symposium on Computer Architecture (Virtual Event) (ISCA '20). IEEE Press, 487--500.

Digital Library

[30]

Fei Li, Yan Lin, Lei He, and Jason Cong. 2004. Low-Power FPGA Using Pre-Defined Dual-Vdd/Dual-Vt Fabrics. In International Symposium on Field Programmable Gate Arrays (Monterey, California, USA) (FPGA '04). Association for Computing Machinery, New York, NY, USA, 42--50.

Digital Library

[31]

Yuhong Li, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2020. EDD: Efficient Differentiable DNN Architecture and Implementation Co-Search for Embedded AI Solutions. In ACM/EDAC/IEEE Design Automation Conference (Virtual Event, USA) (DAC '20). IEEE Press, Article 130, 6 pages.

[32]

Jiacheng Ma, Gefei Zuo, Kevin Loughlin, Xiaohe Cheng, Yanqiang Liu, Abel Mulugeta Eneyew, Zhengwei Qi, and Baris Kasikci. 2020. A Hypervisor for Shared-Memory FPGA Platforms. In International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 827--844.

Digital Library

[33]

Seungwon Min, Kun Wu, Sitao Huang, Mert Hidayetoglu, Jinjun Xiong, Eiman Ebrahimi, Deming Chen, and Wen-Mei W. Hwu. 2021. Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture. CoRR abs/2103.03330 (2021). arXiv:2103.03330 https://arxiv.org/abs/2103.03330

[34]

Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. 2018. Understanding PCIe Performance for End Host Networking. In Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM '18). Association for Computing Machinery, New York, NY, USA, 327--341.

Digital Library

[35]

NVIDIA. [n.d.]. MEMORY MANAGEMENT ON MODERN GPU ARCHITECTURES. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9727-memory-management-on-modern-gpu-architectures.pdf.

[36]

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors. 35--42.

[37]

PCI-SIG. [n.d.]. Address Translation Services Revision 1.1. https://composter.com.ua/documents/ats_r1.1_26Jan09.pdf.

[38]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568--578.

[39]

Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2019. Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU. IEEE Trans. Comput. 68, 4 (2019), 510--525.

Digital Library

[40]

Felix Winterstein and George Constantinides. 2017. Pass a pointer: Exploring shared virtual memory abstractions in OpenCL tools for FPGAs. In 2017 International Conference on Field Programmable Technology (ICFPT). 104--111.

[41]

Xilinx. [n.d.]. Alveo Platform Loading Overview. https://xilinx.github.io/XRT/master/html/platforms_partitions.html.

[42]

Xilinx. [n.d.]. MicroBlaze Micro Controller System v3.0. https://www.xilinx.com/content/dam/xilinx/support/documentation/ip_-documentation/microblaze_mcs/v3_0/pg116-microblaze-mcs.pdf.

[43]

Xilinx. [n.d.]. QDMA Subsystem for PCI Express v4.0. https://www.xilinx.com/support/documentation/ip_documentation/qdma/v4_0/pg302-qdma.pdf.

[44]

Xilinx. [n.d.]. Zynq UltraScale+ Device - Technical Reference Manual.

[45]

T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 345--357.

Cited By

Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957

Index Terms

Qilin: Enabling Performance Analysis and Optimization of Shared-Virtual Memory Systems with FPGA Accelerators

Index terms have been assigned to the content through auto-classification.

Recommendations

Leap scratchpads: automatic memory and cache management for reconfigurable logic
FPGA '11: Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays

Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a ...
Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure
Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA ...
Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing
CODES/ISSS '10: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

Although hardware/software partitioning of embedded applications onto FPGAs is widely known to have performance and power advantages, FPGA usage has been typically limited to hardware experts, due largely to several problems: 1) difficulty of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

October 2022

1467 pages

ISBN:9781450392174

DOI:10.1145/3508352

Conference Chair:
Tulika Mitra
National University of Singapore
,
Program Chairs:
Evangeline Young
The Chinese University of Hong Kong
,
Jinjun Xiong
University at Buffalo (UB)

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEEE-EDS: Electronic Devices Society
IEEE CAS
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICCAD '22

Sponsor:

SIGDA

ICCAD '22: IEEE/ACM International Conference on Computer-Aided Design

October 30 - November 3, 2022

California, San Diego

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
195
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guo KLi DLuo BShen YPeng KLuo NDai SLiang CSong JYang HZhang XMi ZWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695957

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten