skip to main content
10.1145/3490422.3502369acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Public Access

HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs

Published: 11 February 2022 Publication History

Abstract

To achieve high performance with FPGA-equipped heterogeneous compute systems, it is crucial to co-optimize data placement and compute scheduling to maximize data reuse and bandwidth utilization for both on- and off-chip memory accesses. However, optimizing the data placement for FPGA accelerators is a complex task. One must acquire in-depth knowledge of the target FPGA device and its associated memory system in order to apply a set of advanced optimizations. Even with the latest high-level synthesis (HLS) tools, programmers often have to insert many low-level vendor-specific pragmas and substantially restructure the algorithmic code so that the right data are accessed at the right loop level using the right communication schemes. These code changes can significantly compromise the composability and portability of the original program. To address these challenges, we propose HeteroFlow, an FPGA accelerator programming model that decouples the algorithm specification from optimizations related to orchestrating the placement of data across a customized memory hierarchy. Specifically, we introduce a new primitive named .to(), which provides a unified programming interface for specifying data placement optimizations at different levels of granularity: (1) coarse-grained data placement between host and accelerator, (2) medium-grained kernel-level data placement within an accelerator, and (3) fine-grained data placement within a kernel. We build HeteroFlow on top of the open-source HeteroCL DSL and compilation framework. Experimental results on a set of realistic benchmarks show that, programs written in HeteroFlow can match the performance of extensively optimized manual HLS design with much fewer lines of code.

Supplementary Material

MP4 File (FPGA22-fp217a.mp4)
Presentation video - full version

References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. USENIX Symp. on Operating Systems Design and Implementation (OSDI), 2016.
[2]
M. Abid, K. Jerbi, M. Raulet, O. Déforges, and M. Abid. System Level Synthesis of Dataflow Programs: HEVC Decoder Case Study. Electronic System Level Synthesis Conf. (ESLsyn), 2013.
[3]
N. S. Altman. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician, 46(3):175--185, 1992.
[4]
R. Baghdadi, J. Ray, M. B. Romdhane, E. D. Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, and S. Amarasinghe. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. Int'l Symp. on Code Generation and Optimization (CGO), 2019.
[5]
S. S. Bhattacharyya, G. Brebner, J. W. Janneck, J. Eker, C. von Platen, M. Mattavelli, and M. Raulet. OpenDF: A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems. SIGARCH Comput. Archit. News, 2009.
[6]
U. Bondhugula, J. Ramanujam, and P. Sadayappan. Automatic Mapping of Nested Loops to FPGAs. ACM SIGPLAN Conf. on Principles and Practice of Parallel Programming (PPoPP), 2007.
[7]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv preprint arXiv:1512.01274, 2015.
[8]
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. USENIX Symp. on Operating Systems Design and Implementation (OSDI), 2018.
[9]
Y. Chi, J. Cong, P. Wei, and P. Zhou. SODA: Stencil with Optimized Dataflow Architecture. Int'l Conf. on Computer-Aided Design (ICCAD), 2018.
[10]
Y. Chi, L. Guo, Y. Choi, J. Wang, and J. Cong. Extending High-Level Synthesis for Task-Parallel Programs. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2021.
[11]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., 2011.
[12]
J. Cong and J. Wang. PolySA: Polyhedral-Based Systolic Array Auto-Compilation. Int'l Conf. on Computer-Aided Design (ICCAD), 2018.
[13]
D. Diamantopoulos and C. Kachris. High-level Synthesizable Dataflow MapReduce Accelerator for FPGA-Coupled Data Centers. Int'l Conf. on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015.
[14]
J. Eker and J. Janneck. CAL Language Report: Specification of the CAL Actor Language. ERL Technical Memo UCB/ERL, 2003.
[15]
E. Fix and J. L. Hodges. Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. Int'l Statistical Review / Revue Internationale de Statistique, 57(3):238--247, 1989.
[16]
B. Hagedorn, A. S. Elliott, H. Barthels, R. Bodík, and V. Grover. Fireiron: A datamovement- aware scheduling language for GPUs. Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), 2020.
[17]
J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, and P. Hanrahan. Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines. ACM Trans. Graph., 2014.
[18]
A. Hormati, M. Kudlur, S. Mahlke, D. Bacon, and R. Rabbah. Optimus: Efficient Realization of Streaming Applications on FPGAs. Intl'l Conf. on Compilers, Architectures and Synthesis of Embedded Systems (CASES), 2008.
[19]
Intel. Intel FPGA SDK for OpenCL. https://www.intel.com/content/www/us/en/ software/programmable/sdk-for-opencl/overview.html. Accessed: 2021--12--16.
[20]
L. Josipovic, R. Ghosal, and P. Ienne. Dynamically Scheduled High-Level Synthesis. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2018.
[21]
L. Josipovic, A. Guerrieri, and P. Ienne. Synthesizing General-Purpose Code Into Dynamically Scheduled Circuits. IEEE Circuits and Systems Magazine, 2021.
[22]
L. Josipovic, S. Sheikhha, A. Guerrieri, P. Ienne, and J. Cortadella. Buffer Placement and Sizing for High-Performance Dataflow Circuits. Int'l Symp. on Field- Programmable Gate Arrays (FPGA), 2020.
[23]
D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis, et al. Spatial: A Language and Compiler for Application Accelerators. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), 2018.
[24]
Y.-H. Lai, Y. Chi, Y. Hu, J.Wang, C. H. Yu, Y. Zhou, J. Cong, and Z. Zhang. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2019.
[25]
Y.-H. Lai, H. Rong, S. Zheng,W. Zhang, X. Cui, Y. Jia, J.Wang, B. Sullivan, Z. Zhang, Y. Liang, et al. SuSy: A Programming Model for Productive Construction of High- Performance Systolic Arrays on FPGAs. Int'l Conf. on Computer-Aided Design (ICCAD), 2020.
[26]
Y.-H. Lai, E. Ustun, S. Xiang, Z. Fang, H. Rong, and Z. Zhang. Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects. ACM Trans. on Reconfigurable Technology and Systems (TRETS), 14(4):1--39, 2021.
[27]
J. Li, Y. Chi, and J. Cong. HeteroHalide: From image processing DSL to efficient FPGA acceleration. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2020.
[28]
R. Li, Y. Yang, L. Berkley, and R. Manohar. Fluid: An Asynchronous High-level Synthesis Tool for Complex Program Structures. IEEE Int'l Symp. on Asynchronous Circuits and Systems (ASYNC), 2021.
[29]
T. Liang, J. Zhao, L. Feng, S. Sinha, and W. Zhang. Hi-ClockFlow: Multi-Clock Dataflow Automation and Throughput Optimization in High-Level Synthesis. Int'l Conf. on Computer-Aided Design (ICCAD), 2019.
[30]
A. Lu, Z. Fang,W. Liu, and L. Shannon. Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2021.
[31]
Maxeler. Maxeler high-performance dataflow computing systems. https://www. maxeler.com/products/software/maxcompiler/. Accessed: 2021--12--16.
[32]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems (NIPS), 2019.
[33]
M. Pellauer, Y. S. Shao, J. Clemons, N. Crago, K. Hegde, R. Venkatesan, S.W. Keckler, C. W. Fletcher, and J. Emer. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019.
[34]
F. Peverelli, M. Rabozzi, E. Del Sozzo, and M. D. Santambrogio. OXiGen: A Tool for Automatic Acceleration of C Functions into Dataflow FPGA-Based Kernels. Int'l Parallel and Distributed Processing Symp. Workshops (IPDPSW), 2018.
[35]
J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, and M. Horowitz. Programming Heterogeneous Systems from an Image Processing DSL. ACM Trans. on Architecture and Code Optimization (TACO), 2017.
[36]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPALN Notices, 2013.
[37]
Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong. ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA. IEEE Symp. on Field Programmable Custom Computing Machines (FCCM), 2018.
[38]
C. Rubattu, F. Palumbo, C. Sau, R. Salvador, J. Sérot, K. Desnos, L. Raffo, and M. Pelcat. Dataflow-Functional High-Level Synthesis for Coarse-Grained Reconfigurable Accelerators. IEEE Embedded Systems Letters, 2019.
[39]
J. Sérot, F. Berry, and S. Ahmed. CAPH: A Language for Implementing Stream- Processing Applications on FPGAs. Embedded Systems Design with FPGAs, 2013.
[40]
N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao, Z. Zhang, D. Albonesi, V. Sarkar, W. Chen, P. Petersen, et al. T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations. IEEE Symp. on Field Programmable Custom Computing Machines (FCCM), 2019.
[41]
M. Technologies. Maxcompiler white paper. https://www.maxeler.com/media/do cuments/MaxelerWhitePaperMaxCompiler.pdf. Accessed: 2021--12--16.
[42]
R. Townsend, M. A. Kim, and S. A. Edwards. From Functional Programs to Pipelined Dataflow Circuits. Int'l Conf. on Compiler Construction (CC), 2017.
[43]
J. Wang, L. Guo, and J. Cong. AutoSA: A Polyhedral Compiler for High- Performance Systolic Arrays on FPGA. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2021.
[44]
Xilinx. Vitis Unified Software Platform 2019.2. https://www.xilinx.com/support/d ocumentation/sw_manuals/xilinx2019_1/ug902-vivado-high-level-synthesis.p df. Accessed: 2021--12--16.
[45]
K. Zhan, J. Guo, B. Song, W. Zhang, and Z. Bao. UltraNet: An FPGA-based Object Detection for the DAC-SDC 2020. https://github.com/heheda365/ultra_net, 2020.
[46]
Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston, Y.-H. Lai, G. Liu, G. A. Velasquez, et al. Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2018.

Cited By

View all
  • (2025)Latency Insensitivity Testing for Dataflow HLS DesignsProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708872(199-210)Online publication date: 27-Feb-2025
  • (2025)ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI EnginesProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708870(92-102)Online publication date: 27-Feb-2025
  • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
  • Show More Cited By

Index Terms

  1. HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
      February 2022
      211 pages
      ISBN:9781450391498
      DOI:10.1145/3490422
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 February 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. decoupled data placement
      2. dsl
      3. programming model

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      FPGA '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 125 of 627 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)334
      • Downloads (Last 6 weeks)29
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Latency Insensitivity Testing for Dataflow HLS DesignsProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708872(199-210)Online publication date: 27-Feb-2025
      • (2025)ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI EnginesProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708870(92-102)Online publication date: 27-Feb-2025
      • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
      • (2024)Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model InferenceACM Transactions on Reconfigurable Technology and Systems10.1145/365617718:1(1-29)Online publication date: 17-Dec-2024
      • (2024)Formal Verification of Source-to-Source Transformations for HLSProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637563(97-107)Online publication date: 1-Apr-2024
      • (2024)Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic eHDL and SynthesisProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637561(211-222)Online publication date: 1-Apr-2024
      • (2024)Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis WorkflowACM Transactions on Reconfigurable Technology and Systems10.1145/362610317:1(1-26)Online publication date: 27-Jan-2024
      • (2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
      • (2024)An Optimizing Framework on MLIR for Efficient FPGA-based Accelerator Generation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00017(75-90)Online publication date: 2-Mar-2024
      • (2024)AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based AcceleratorsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444801(143-157)Online publication date: 2-Mar-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media