skip to main content
10.1145/3289602.3293910acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

Published: 20 February 2019 Publication History

Abstract

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and an FPGA-targeted compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.
[2]
A. Althoff and R. Kastner. A Scalable FPGA Architecture for Nonnegative Least Squares Problems. Int'l Conf. on Field Programmable Logic and Applications (FPL), 2015.
[3]
R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, P. Suriana, S. Kamil, and S. Amarasinghe. Tiramisu: A Code Optimization Framework for High Performance Systems. arXiv preprint arXiv:1804.10694, 2018.
[4]
S. Borkar and A. A. Chien. The Future of Microprocessors. Communications of the ACM, 2011.
[5]
A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski. LegUp: High-level Synthesis for FPGA-Based Processor/Accelerator Systems. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2011.
[6]
T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. TVM: End-to-End Optimization Stack for Deep Learning. arXiv preprint arXiv:1802.04799, 2018.
[7]
Y. Chi, J. Cong, P. Wei, and P. Zhou. SODA: Stencil with Optimized Dataflow Architecture. Int'l Conf. on Computer-Aided Design (ICCAD), 2018.
[8]
A. A. Chien, A. Snavely, and M. Gahagan. 10x10: A General-Purpose Architectural Approach to Heterogeneity and Energy Efficiency. Procedia Computer Science, 2011.
[9]
E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Int'l Symp. on Microarchitecture (MICRO), 2010.
[10]
J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. Accelerator-Rich Architectures: Opportunities and Progresses. Design Automation Conf. (DAC), 2014.
[11]
J. Cong, M. Huang, P. Pan, D. Wu, and P. Zhang. Software Infrastructure for Enabling FPGA-Based Accelerations in Data Centers. Int'l Symp. on Low Power Electronics and Design (ISLPED), 2016.
[12]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. on Computer- Aided Design of Integrated Circuits and Systems (TCAD), 2011.
[13]
J. Cong and J. Wang. PolySA: Polyhedral-Based Systolic Array Auto Compilation. Int'l Conf. on Computer-Aided Design (ICCAD), 2018.
[14]
J. Cong, P. Wei, C. H. Yu, and P. Zhang. Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture. Design Automation Conf. (DAC), 2018.
[15]
L. Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine, 2012.
[16]
H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark Silicon and the End of Multicore Scaling. Int'l Symp. on Computer Architecture (ISCA), 2011.
[17]
J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, and P. Hanrahan. Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines. ACM Trans. Graph., 2014.
[18]
G. Inggs, S. Fleming, D. Thomas, and W. Luk. Is High Level Synthesis Ready for Business? A Computational Finance Case Study. Int'l Conf. on Field Programmable Technology (FPT), 2014.
[19]
Intel. Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/calcm/carl/lib/exe/fetch.php? media=carl15-gupta.pdf.
[20]
Intel. Intel Math Kernel Library. 2007.
[21]
Intel. Intel High Level Synthesis Compiler User Guide. 2017.
[22]
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The Tensor Algebra Compiler. Intl'l Conf. on Object-Oriented Programming, Systems, Languages, and Applications, 2017.
[23]
D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis, et al. Spatial: A Language and Compiler for Application Accelerators. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), 2018.
[24]
D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun. Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. Int'l Symp. on Computer Architecture (ISCA), 2016.
[25]
H. Kung and C. E. Leiserson. Systolic Arrays (for VLSI). Sparse Matrix Proceedings, 1979.
[26]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 1998.
[27]
R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. Hipacc: A Domain-Specific Language and Compiler for Image Processing. IEEE Transactions on Parallel and Distributed Systems, 2016.
[28]
T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy. VTA: An Open Hardware-Software Stack for Deep Learning. arXiv preprint arXiv:1807.04188, 2018.
[29]
D. Pellerin. Fpga accelerated computing using aws f1 instances. AWS Public Sector Summit, 2017.
[30]
L.-N. Pouchet. Polybench: The Polyhedral Benchmark Suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench, 2012.
[31]
J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, and M. Horowitz. Programming Heterogeneous Systems from an Image Processing DSL. ACM Trans. on Architecture and Code Optimization (TACO), 2017.
[32]
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPLAN Notices, 2013.
[33]
H. Rong. Programmatic Control of a Compiler for Generating High-Performance Spatial Hardware. arXiv preprint arXiv:1711.07606, 2017.
[34]
S. Skalicky, J. Monson, A. Schmidt, and M. French. Hot & Spicy: Improving Productivity with Python and HLS for FPGAs. IEEE Symp. on Field Programmable Custom Computing Machines (FCCM), 2018.
[35]
Z. Wang, B. He, and W. Zhang. A Study of Data Partitioning on OpenCL-Based FPGAs. Int'l Conf. on Field Programmable Logic and Applications (FPL), 2015.
[36]
M. Waterman. Identification of Common Molecular Subsequence. Mol. Biol, 1981.
[37]
R. Wei, V. Adve, and L. Schwartz. DLVM: A Modern Compiler Infrastructure for Deep Learning. arXiv preprint arXiv:1711.03016, 2017.
[38]
X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. Design Automation Conf. (DAC), 2017.
[39]
S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 2009.
[40]
Xilinx Inc. Vivado Design Suite User Guide: High-Level Synthesis. 2012.
[41]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. HotCloud, 2010.
[42]
R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating Binarized Convolutional Neural Networks with Software- Programmable FPGAs. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2017.
[43]
Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston, Y.-H. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang. Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2018.

Cited By

View all
  • (2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
  • (2025)Stream-HLS: Towards Automatic Dataflow AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708878(103-114)Online publication date: 27-Feb-2025
  • (2025)A Unified Framework for Automated Code Transformation and Pragma InsertionProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708873(187-198)Online publication date: 27-Feb-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2019
360 pages
ISBN:9781450361378
DOI:10.1145/3289602
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compiler
  2. domain-specific language
  3. fpga
  4. hardware accelerator
  5. high-level synthesis
  6. multi-paradigm programming
  7. python
  8. reconfigurable computing
  9. spatial architecture
  10. stencil
  11. systolic array

Qualifiers

  • Research-article

Funding Sources

  • CRISP one of six centers in JUMP a Semiconductor Research Corporation (SRC) program sponsored by DARPA
  • NSF/Intel CAPA Award
  • DARPA Young Faculty Award

Conference

FPGA '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)6
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
  • (2025)Stream-HLS: Towards Automatic Dataflow AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708878(103-114)Online publication date: 27-Feb-2025
  • (2025)A Unified Framework for Automated Code Transformation and Pragma InsertionProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708873(187-198)Online publication date: 27-Feb-2025
  • (2025)Latency Insensitivity Testing for Dataflow HLS DesignsProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708872(199-210)Online publication date: 27-Feb-2025
  • (2025)High-Level Synthesis for FPGAs—A Hardware Engineer’s PerspectiveIEEE Access10.1109/ACCESS.2025.354032013(28574-28593)Online publication date: 2025
  • (2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
  • (2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
  • (2024)Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model InferenceACM Transactions on Reconfigurable Technology and Systems10.1145/365617718:1(1-29)Online publication date: 17-Dec-2024
  • (2024)Application-level Validation of Accelerator Designs Using a Formal Software/Hardware InterfaceACM Transactions on Design Automation of Electronic Systems10.1145/363905129:2(1-25)Online publication date: 14-Feb-2024
  • (2024)POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor ComputationsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637566(199-210)Online publication date: 1-Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media