research-article

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

Authors:

Zhiru ZhangAuthors Info & Claims

FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 242 - 251

https://doi.org/10.1145/3289602.3293910

Published: 20 February 2019 Publication History

Abstract

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and an FPGA-targeted compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

References

[1]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.

[2]

A. Althoff and R. Kastner. A Scalable FPGA Architecture for Nonnegative Least Squares Problems. Int'l Conf. on Field Programmable Logic and Applications (FPL), 2015.

[3]

R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, P. Suriana, S. Kamil, and S. Amarasinghe. Tiramisu: A Code Optimization Framework for High Performance Systems. arXiv preprint arXiv:1804.10694, 2018.

[4]

S. Borkar and A. A. Chien. The Future of Microprocessors. Communications of the ACM, 2011.

Digital Library

[5]

A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski. LegUp: High-level Synthesis for FPGA-Based Processor/Accelerator Systems. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2011.

Digital Library

[6]

T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. TVM: End-to-End Optimization Stack for Deep Learning. arXiv preprint arXiv:1802.04799, 2018.

[7]

Y. Chi, J. Cong, P. Wei, and P. Zhou. SODA: Stencil with Optimized Dataflow Architecture. Int'l Conf. on Computer-Aided Design (ICCAD), 2018.

Digital Library

[8]

A. A. Chien, A. Snavely, and M. Gahagan. 10x10: A General-Purpose Architectural Approach to Heterogeneity and Energy Efficiency. Procedia Computer Science, 2011.

[9]

E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Int'l Symp. on Microarchitecture (MICRO), 2010.

Digital Library

[10]

J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. Accelerator-Rich Architectures: Opportunities and Progresses. Design Automation Conf. (DAC), 2014.

Digital Library

[11]

J. Cong, M. Huang, P. Pan, D. Wu, and P. Zhang. Software Infrastructure for Enabling FPGA-Based Accelerations in Data Centers. Int'l Symp. on Low Power Electronics and Design (ISLPED), 2016.

Digital Library

[12]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. on Computer- Aided Design of Integrated Circuits and Systems (TCAD), 2011.

Digital Library

[13]

J. Cong and J. Wang. PolySA: Polyhedral-Based Systolic Array Auto Compilation. Int'l Conf. on Computer-Aided Design (ICCAD), 2018.

Digital Library

[14]

J. Cong, P. Wei, C. H. Yu, and P. Zhang. Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture. Design Automation Conf. (DAC), 2018.

Digital Library

[15]

L. Deng. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Processing Magazine, 2012.

[16]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark Silicon and the End of Multicore Scaling. Int'l Symp. on Computer Architecture (ISCA), 2011.

Digital Library

[17]

J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, and P. Hanrahan. Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines. ACM Trans. Graph., 2014.

Digital Library

[18]

G. Inggs, S. Fleming, D. Thomas, and W. Luk. Is High Level Synthesis Ready for Business? A Computational Finance Case Study. Int'l Conf. on Field Programmable Technology (FPT), 2014.

[19]

Intel. Xeon+FPGA Platform for the Data Center. https://www.ece.cmu.edu/calcm/carl/lib/exe/fetch.php? media=carl15-gupta.pdf.

[20]

Intel. Intel Math Kernel Library. 2007.

[21]

Intel. Intel High Level Synthesis Compiler User Guide. 2017.

[22]

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The Tensor Algebra Compiler. Intl'l Conf. on Object-Oriented Programming, Systems, Languages, and Applications, 2017.

[23]

D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis, et al. Spatial: A Language and Compiler for Application Accelerators. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), 2018.

Digital Library

[24]

D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun. Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. Int'l Symp. on Computer Architecture (ISCA), 2016.

Digital Library

[25]

H. Kung and C. E. Leiserson. Systolic Arrays (for VLSI). Sparse Matrix Proceedings, 1979.

[26]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 1998.

[27]

R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. Hipacc: A Domain-Specific Language and Compiler for Image Processing. IEEE Transactions on Parallel and Distributed Systems, 2016.

Digital Library

[28]

T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy. VTA: An Open Hardware-Software Stack for Deep Learning. arXiv preprint arXiv:1807.04188, 2018.

[29]

D. Pellerin. Fpga accelerated computing using aws f1 instances. AWS Public Sector Summit, 2017.

[30]

L.-N. Pouchet. Polybench: The Polyhedral Benchmark Suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench, 2012.

[31]

J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, and M. Horowitz. Programming Heterogeneous Systems from an Image Processing DSL. ACM Trans. on Architecture and Code Optimization (TACO), 2017.

Digital Library

[32]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. ACM SIGPLAN Notices, 2013.

Digital Library

[33]

H. Rong. Programmatic Control of a Compiler for Generating High-Performance Spatial Hardware. arXiv preprint arXiv:1711.07606, 2017.

[34]

S. Skalicky, J. Monson, A. Schmidt, and M. French. Hot & Spicy: Improving Productivity with Python and HLS for FPGAs. IEEE Symp. on Field Programmable Custom Computing Machines (FCCM), 2018.

[35]

Z. Wang, B. He, and W. Zhang. A Study of Data Partitioning on OpenCL-Based FPGAs. Int'l Conf. on Field Programmable Logic and Applications (FPL), 2015.

[36]

M. Waterman. Identification of Common Molecular Subsequence. Mol. Biol, 1981.

[37]

R. Wei, V. Adve, and L. Schwartz. DLVM: A Modern Compiler Infrastructure for Deep Learning. arXiv preprint arXiv:1711.03016, 2017.

[38]

X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. Design Automation Conf. (DAC), 2017.

Digital Library

[39]

S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 2009.

Digital Library

[40]

Xilinx Inc. Vivado Design Suite User Guide: High-Level Synthesis. 2012.

[41]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. HotCloud, 2010.

Digital Library

[42]

R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating Binarized Convolutional Neural Networks with Software- Programmable FPGAs. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2017.

Digital Library

[43]

Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston, Y.-H. Lai, G. Liu, G. A. Velasquez, W. Wang, and Z. Zhang. Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), 2018.

Digital Library

Cited By

Pouget SPouchet LCong J(2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
https://dl.acm.org/doi/10.1145/3711847
Basalama SCong JPutnam ALi J(2025)Stream-HLS: Towards Automatic Dataflow AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708878(103-114)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708878
Pouget SPouchet LCong JPutnam ALi J(2025)A Unified Framework for Automated Code Transformation and Pragma InsertionProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708873(187-198)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708873
Show More Cited By

Recommendations

Code generation from a domain-specific language for C-based HLS of hardware accelerators
CODES '14: Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis

As today's computer architectures are becoming more and more heterogeneous, a plethora of options including CPUs, GPUs, DSPs, reconfigurable logic (FPGAs), and other application-specific processors come into consideration for close-to-sensor processing. ...
MARC: A Many-Core Approach to Reconfigurable Computing
RECONFIG '10: Proceedings of the 2010 International Conference on Reconfigurable Computing and FPGAs

We present a Many-core Approach to Reconfigurable Computing (MARC), enabling efficient high-performance computing for applications expressed using parallel programming models such as OpenCL. The MARC system exploits abundant special FPGA resources such ...
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2019

360 pages

ISBN:9781450361378

DOI:10.1145/3289602

General Chair:
Kia Bazargan
Univ. of Minnesota, USA
,
Program Chair:
Stephen Neuendorffer
Xilinx, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

CRISP one of six centers in JUMP a Semiconductor Research Corporation (SRC) program sponsored by DARPA
NSF/Intel CAPA Award
DARPA Young Faculty Award

Conference

FPGA '19

Sponsor:

SIGDA

FPGA '19: The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 24 - 26, 2019

CA, Seaside, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

95
Total Citations
View Citations
1,399
Total Downloads

Downloads (Last 12 months)139
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pouget SPouchet LCong J(2025)Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming ApproachACM Transactions on Design Automation of Electronic Systems10.1145/371184730:2(1-44)Online publication date: 7-Feb-2025
https://dl.acm.org/doi/10.1145/3711847
Basalama SCong JPutnam ALi J(2025)Stream-HLS: Towards Automatic Dataflow AccelerationProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708878(103-114)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708878
Pouget SPouchet LCong JPutnam ALi J(2025)A Unified Framework for Automated Code Transformation and Pragma InsertionProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708873(187-198)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708873
Cheng JWang LJiang ZBao YShi KPutnam ALi J(2025)Latency Insensitivity Testing for Dataflow HLS DesignsProceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3706628.3708872(199-210)Online publication date: 27-Feb-2025
https://dl.acm.org/doi/10.1145/3706628.3708872
Lahti SHämäläinen T(2025)High-Level Synthesis for FPGAs—A Hardware Engineer’s PerspectiveIEEE Access10.1109/ACCESS.2025.354032013(28574-28593)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3540320
Kim CLi PMohan AButt ASampson ANigam R(2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689790
Chen HZhang NXiang SZeng ZDai MZhang Z(2024)Allo: A Programming Model for Composable Accelerator DesignProceedings of the ACM on Programming Languages10.1145/36564018:PLDI(593-620)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656401
Chen HZhang JDu YXiang SYue ZZhang NCai YZhang Z(2024)Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model InferenceACM Transactions on Reconfigurable Technology and Systems10.1145/365617718:1(1-29)Online publication date: 17-Dec-2024
https://dl.acm.org/doi/10.1145/3656177
Huang BLyubomirsky SLi YHe MSmith GTambe TGaonkar ACanumalla VCheung AWei GGupta ATatlock ZMalik S(2024)Application-level Validation of Accelerator Designs Using a Formal Software/Hardware InterfaceACM Transactions on Design Automation of Electronic Systems10.1145/363905129:2(1-25)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1145/3639051
Hao XRong HZhang MSun CJiang HLiang YZhang ZPutnam A(2024)POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor ComputationsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637566(199-210)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637566
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten