skip to main content
10.1145/3352460.3358292acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

μIR -An intermediate representation for transforming and optimizing the microarchitecture of application accelerators

Published: 12 October 2019 Publication History

Abstract

Creating high quality application-specific accelerators requires us to make iterative changes to both algorithm behavior and microarchitecture, and this is a tedious and error-prone process. High-Level Synthesis (HLS) tools [5, 10] generate RTL for application accelerators from annotated software. Unfortunately, the generated RTL is challenging to change and optimize. The primary limitation of HLS is that the functionality and microarchitecture are conflated together in a single language (such as C++). Making changes to the accelerator design may require code restructuring, and microarchitecture optimizations are tied with program correctness.
We propose a generalized intermediate representation for describing accelerator microarchitecture, μIR, and an associated pass framework, μopt. μIR represents the accelerator as a concurrent structural graph in which the components roughly correspond to microarchitecture level hardware blocks (e.g., function units, network, memory banks). There are two important benefits i) it decouples microarchitecture optimizations from algorithm/program optimizations. ii) it decouples microarchitecture optimizations from the RTL generation. Computer architects express their ideas as a set of iterative transformations of the μIR graph that successively refine the accelerator architecture. The μIR graph is then translated to Chisel, while maintaining the execution model and cycle-level performance characteristics. In this paper, we study three broad classes of optimizations: Timing (e.g., Pipeline re-timing), Spatial (e.g., Compute tiling), and Higher-order Ops (e.g., Tensor function units) that deliver between 1.5 --- 8× improvement in performance; overall 5---20× speedup compared to an ARM A9 1Ghz. We evaluate the quality of the autogenerated accelerators on an Arria 10 FPGA and under ASIC UMC 28nm technology.

References

[1]
Catapult High-Level Synthesis. https://www.mentor.com/hls-lp/catapult-high-level-synthesis/.
[2]
Enabling rapid design space exploration and prototyping of dnn accelerators. http://pwp.gatech.edu/ece-synergy/wp-content/uploads/sites/332/2019/02/2_NNDataflowAnalysis.pdf.
[3]
Mlir primer: A compiler infrastructure for the end of mooreâĂŹs law. https://github.com/tensorflow/mlir.
[4]
Specification for the firrtl language. https://github.com/freechipsproject/firrtl/blob/master/spec/spec.pdf.
[5]
Vivado Design Suite. https://www.xilinx.com/products/design-tools/vivado.html.
[6]
Arvind and Rishiyur S. Nikhil. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Computers, 1990.
[7]
Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. Chisel: Constructing hardware in a scala embedded language. https://github.com/freechipsproject/chisel3.
[8]
David F Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Communications of the ACM, 56(4):56--63, 2013.
[9]
Mihai Budiu and Seth Copen Goldstein. Pegasus: An efficient intermediate representation. Technical Report CMU-CS-02-107, Carnegie Mellon University, May 2002.
[10]
Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. From software to accelerators with LegUp high-level synthesis. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pages 1--9. IEEE, 2013.
[11]
Christopher Celio, Palmer Dabbelt, David A Patterson, and Krste Asanović. The renewed case for the reduced instruction set computer: Avoiding isa bloat with macro-op fusion for risc-v. arXiv preprint arXiv:1607.02318, 2016.
[12]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.
[13]
Jongsok Choi, Stephen Dean Brown, and Jason Helge Anderson. From pthreads to multicore hardware systems in legup high-level synthesis for fpgas. IEEE Trans. VLSI Syst., 25(10), 2017.
[14]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-level synthesis for fpgas: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 2011.
[15]
Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. Automated accelerator generation and optimization with composable, parallel and pipeline architecture. 2018.
[16]
David E Culler, Anurag Sah, Klaus E Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine. In Proc. of PROC of the 4th ASPLOS, 1991.
[17]
A DeHon, J Adams, M deLorimier, N Kapre, Y Matsuda, H Naeimi, M Vanier, and M Wrighton. Design patterns for reconfigurable computing. In Proc. of the 12th FCCM, 2004.
[18]
Stephen A Edwards. The Challenges of Synthesizing Hardware from C-Like Languages. IEEE Design & Test of Computers, 23(5):375--386, 2006.
[19]
Vladimir Gajinov, Srdjan Stipic, Osman S Unsal, Tim Harris 0001, Eduard Ayguadé, and Adrián Cristal. Supporting stateful tasks in a dataflow graph. In Proc. of PACT, 2012.
[20]
Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J Brown, Arvind K Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. Hardware system synthesis from Domain-Specific Languages. In Proc. of FPL, pages 1--8. IEEE, 2014.
[21]
Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In PROC of the 44th MICRO, 2011.
[22]
James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. Darkroom - compiling high-level image processing code into hardware pipelines. ACM Trans. Graph., 33(4):1--11, 2014.
[23]
S Hu, I Kim, M H Lipasti, and J E Smith. An approach for implementing efficient superscalar CISC processors. In PROC of the 12th HPCA, 2006.
[24]
Adam Izraelevitz, Jack Koenig, Patrick Li, Richard Lin, Angie Wang, Albert Magyar, Donggyu Kim, Colin Schmidt, Chick Markley, Jim Lawson, et al. Reusability is firrtl ground: Hardware construction languages, compiler frameworks, and transformations. In Proceedings of the 36th International Conference on Computer-Aided Design, pages 209--216. IEEE Press, 2017.
[25]
Lana Josipović, Radhika Ghosal, and Paolo Ienne. Dynamically scheduled high-level synthesis. In Proc. of the FPGA, 2018.
[26]
N Kapre and H Patel. Applying Models of Computation to OpenCL Pipes for FPGA Computing. Proceedings of the 5th International Workshop on OpenCL, 2017.
[27]
John Kessenich, Graham Sellers, and Dave Shreiner. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 4.5 with SPIR-V. Addison-Wesley Professional, 2016.
[28]
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A language and compiler for application accelerators. In Proceedings of the PLDI, 2018.
[29]
David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos Kozyrakis, and Kunle Olukotun. Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. In Proc. of the 43rd ISCA, pages 115--127, 2016.
[30]
Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. Hpvm: Heterogeneous parallel virtual machine. In Proc. of the 23rd PPOPP, 2018.
[31]
Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. Heterocl: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proc. of FPGA, 2019.
[32]
Maysam Lavasani. Generating irregular data-stream accelerators: methodology and applications. PhD thesis, 2015.
[33]
Chris Leary and Todd Wang. Xla: Tensorflow, compiled! TensorFlow Dev Summit, Feb 2017.
[34]
Charles E Leiserson. The Cilk++ concurrency platform. The Journal of Supercomputing, 51(3):244--257, 2010.
[35]
Derek Lockhart, Gary Zibrat, and Christopher Batten. PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. In Proc. of the 47th MICRO, pages 280--292, 2014.
[36]
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In Proc. of the 22nd HPCA, 2016.
[37]
Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu-Ting Chen, Hsuan Hsiao, Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. A Survey and Evaluation of FPGA High-Level Synthesis Tools. IEEE Trans. on CAD of Integrated Circuits and Systems, 35(10):1591--1604, 2016.
[38]
D. H. Noronha, B. Salehpour, and S. J. E. Wilton. LeFlow: Enabling Flexible FPGA High-Level Synthesis of Tensorflow Deep Neural Networks. ArXiv e-prints, July 2018.
[39]
Louis-Noel Pouchet and Uday Bondugula. Polybench 3.2. 2013. http://www.cse.ohio-state.edu/~pouchet/software/polybench.
[40]
Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating Configurable Hardware from Parallel Patterns. In Proc. of the 21st ASPLOS, 2016.
[41]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture for parallel paterns. In Proc. of the 44th ISCA, 2017.
[42]
Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. Programming Heterogeneous Systems from an Image Processing DSL. TACO, 14(3):1--25, 2017.
[43]
Andrew Putnam. FPGAs in the Datacenter - Combining the Worlds of Hardware and Software Development. ACM Great Lakes Symposium on VLSI, 2017.
[44]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P Amarasinghe. Halide - a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. of PLDI, 2013.
[45]
Brandon Reagen, Robert Adolf, Sophia Yakun Shao, Gu-Yeon Wei and David Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IEEE International Symposium on Workload Characterization (IISWC), 2014.
[46]
Oliver Reiche, Moritz Schmid, Frank Hannig, Richard Membarth, and Jürgen Teich. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proc. of CODES+ISSS, pages 1--10, New York, New York, USA, 2014. ACM Press.
[47]
Hongbo Rong. Programmatic Control of a Compiler for Generating High-performance Spatial Hardware. In arXiv.org, November 2017.
[48]
Sameer D. Sahasrabuddhe, Sreenivas Subramanian, Kunal P. Ghosh, Kavi Arya, and Madhav P. Desai. A c-to-rtl flow as an energy efficient alternative to embedded processors in digital systems. In 13th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, DSD 2010, 1-3 September 2010, Lille, France, 2010.
[49]
Tao B Schardl, William S Moses, and Charles E Leiserson. Tapir - Embedding Fork-Join Parallelism into LLVM's Intermediate Representation. In In Proc. of PPOPP, 2017.
[50]
Prakalp Srivastava, Rakesh Komuravelli, Sarita Adve, Maria Kotsifakou, Matthew D Sinclair, and Vikram Adve. HPVM: heterogeneous parallel virtual machine. In Proc. of ACM SIGPLAN Notices, pages 68--80. ACM, March 2018.
[51]
James Stanier and Des Watson. Intermediate representations in imperative compilers: A survey. ACM Comput. Surv.
[52]
Arvind K Sujeeth, Kevin J Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite - A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embedded Comput. Syst., 2014.
[53]
Richard Townsend, Martha A Kim, and Stephen A Edwards. From functional programs to pipelined dataflow circuits. In Proceedings of the 26th International Conference on Compiler Construction, pages 76--86. ACM, 2017.
[54]
Ken Traub, James Hicks, and Shail Aditya. A dataflow compiler substrate. 1991.
[55]
Hasitha Muthumala Waidyasooriya, Masanori Hariyama, and Kunio Uchiyama. FPGA-Oriented Parallel Programming. In Design of FPGA-Based Computing Systems with OpenCL. October 2017.
[56]
Ali Mustafa Zaidi and David Greaves. A new dataflow compiler ir for accelerating control-intensive code in spatial hardware. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014.
[57]
Sizhuo Zhang, Andrew Wright, Thomas Bourgeat, and Arvind. Composable building blocks to open up processor design. In Proc. of the 51st MICRO, 2018.

Cited By

View all
  • (2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
  • (2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
  • (2024)Hestia: An Efficient Cross-Level Debugger for High-Level Synthesis2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00062(765-779)Online publication date: 2-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
October 2019
1104 pages
ISBN:9781450369381
DOI:10.1145/3352460
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MICRO '52
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)90
  • Downloads (Last 6 weeks)11
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
  • (2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
  • (2024)Hestia: An Efficient Cross-Level Debugger for High-Level Synthesis2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00062(765-779)Online publication date: 2-Nov-2024
  • (2024)UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow ArchitecturesIEEE Computer Architecture Letters10.1109/LCA.2023.334213023:1(99-103)Online publication date: Jan-2024
  • (2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
  • (2023)Simulator Independent Coverage for RTL Hardware LanguagesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582019(606-615)Online publication date: 25-Mar-2023
  • (2023)A Multi-threaded Fast Hardware Compiler for HDLsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580254(25-36)Online publication date: 17-Feb-2023
  • (2023)Challenges and Opportunities of Security-Aware EDAACM Transactions on Embedded Computing Systems10.1145/357619922:3(1-34)Online publication date: 19-Apr-2023
  • (2023)ShakeFlow: Functional Hardware Description with Latency-Insensitive Interface CombinatorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575701(702-717)Online publication date: 27-Jan-2023
  • (2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media