research-article

μIR -An intermediate representation for transforming and optimizing the microarchitecture of application accelerators

Authors:

Amirali Sharifian,

Arrvindh ShriramanAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 940 - 953

https://doi.org/10.1145/3352460.3358292

Published: 12 October 2019 Publication History

Abstract

Creating high quality application-specific accelerators requires us to make iterative changes to both algorithm behavior and microarchitecture, and this is a tedious and error-prone process. High-Level Synthesis (HLS) tools [5, 10] generate RTL for application accelerators from annotated software. Unfortunately, the generated RTL is challenging to change and optimize. The primary limitation of HLS is that the functionality and microarchitecture are conflated together in a single language (such as C++). Making changes to the accelerator design may require code restructuring, and microarchitecture optimizations are tied with program correctness.

We propose a generalized intermediate representation for describing accelerator microarchitecture, μIR, and an associated pass framework, μopt. μIR represents the accelerator as a concurrent structural graph in which the components roughly correspond to microarchitecture level hardware blocks (e.g., function units, network, memory banks). There are two important benefits i) it decouples microarchitecture optimizations from algorithm/program optimizations. ii) it decouples microarchitecture optimizations from the RTL generation. Computer architects express their ideas as a set of iterative transformations of the μIR graph that successively refine the accelerator architecture. The μIR graph is then translated to Chisel, while maintaining the execution model and cycle-level performance characteristics. In this paper, we study three broad classes of optimizations: Timing (e.g., Pipeline re-timing), Spatial (e.g., Compute tiling), and Higher-order Ops (e.g., Tensor function units) that deliver between 1.5 --- 8× improvement in performance; overall 5---20× speedup compared to an ARM A9 1Ghz. We evaluate the quality of the autogenerated accelerators on an Arria 10 FPGA and under ASIC UMC 28nm technology.

References

[1]

Catapult High-Level Synthesis. https://www.mentor.com/hls-lp/catapult-high-level-synthesis/.

[2]

Enabling rapid design space exploration and prototyping of dnn accelerators. http://pwp.gatech.edu/ece-synergy/wp-content/uploads/sites/332/2019/02/2_NNDataflowAnalysis.pdf.

[3]

Mlir primer: A compiler infrastructure for the end of mooreâĂ&Zacute;s law. https://github.com/tensorflow/mlir.

[4]

Specification for the firrtl language. https://github.com/freechipsproject/firrtl/blob/master/spec/spec.pdf.

[5]

Vivado Design Suite. https://www.xilinx.com/products/design-tools/vivado.html.

[6]

Arvind and Rishiyur S. Nikhil. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Computers, 1990.

[7]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. Chisel: Constructing hardware in a scala embedded language. https://github.com/freechipsproject/chisel3.

[8]

David F Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Communications of the ACM, 56(4):56--63, 2013.

Digital Library

[9]

Mihai Budiu and Seth Copen Goldstein. Pegasus: An efficient intermediate representation. Technical Report CMU-CS-02-107, Carnegie Mellon University, May 2002.

[10]

Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. From software to accelerators with LegUp high-level synthesis. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pages 1--9. IEEE, 2013.

[11]

Christopher Celio, Palmer Dabbelt, David A Patterson, and Krste Asanović. The renewed case for the reduced instruction set computer: Avoiding isa bloat with macro-op fusion for risc-v. arXiv preprint arXiv:1607.02318, 2016.

[12]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, 2018.

[13]

Jongsok Choi, Stephen Dean Brown, and Jason Helge Anderson. From pthreads to multicore hardware systems in legup high-level synthesis for fpgas. IEEE Trans. VLSI Syst., 25(10), 2017.

[14]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-level synthesis for fpgas: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 2011.

Digital Library

[15]

Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. Automated accelerator generation and optimization with composable, parallel and pipeline architecture. 2018.

[16]

David E Culler, Anurag Sah, Klaus E Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine. In Proc. of PROC of the 4th ASPLOS, 1991.

Digital Library

[17]

A DeHon, J Adams, M deLorimier, N Kapre, Y Matsuda, H Naeimi, M Vanier, and M Wrighton. Design patterns for reconfigurable computing. In Proc. of the 12th FCCM, 2004.

Digital Library

[18]

Stephen A Edwards. The Challenges of Synthesizing Hardware from C-Like Languages. IEEE Design & Test of Computers, 23(5):375--386, 2006.

Digital Library

[19]

Vladimir Gajinov, Srdjan Stipic, Osman S Unsal, Tim Harris 0001, Eduard Ayguadé, and Adrián Cristal. Supporting stateful tasks in a dataflow graph. In Proc. of PACT, 2012.

Digital Library

[20]

Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J Brown, Arvind K Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. Hardware system synthesis from Domain-Specific Languages. In Proc. of FPL, pages 1--8. IEEE, 2014.

[21]

Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In PROC of the 44th MICRO, 2011.

Digital Library

[22]

James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. Darkroom - compiling high-level image processing code into hardware pipelines. ACM Trans. Graph., 33(4):1--11, 2014.

Digital Library

[23]

S Hu, I Kim, M H Lipasti, and J E Smith. An approach for implementing efficient superscalar CISC processors. In PROC of the 12th HPCA, 2006.

[24]

Adam Izraelevitz, Jack Koenig, Patrick Li, Richard Lin, Angie Wang, Albert Magyar, Donggyu Kim, Colin Schmidt, Chick Markley, Jim Lawson, et al. Reusability is firrtl ground: Hardware construction languages, compiler frameworks, and transformations. In Proceedings of the 36th International Conference on Computer-Aided Design, pages 209--216. IEEE Press, 2017.

[25]

Lana Josipović, Radhika Ghosal, and Paolo Ienne. Dynamically scheduled high-level synthesis. In Proc. of the FPGA, 2018.

Digital Library

[26]

N Kapre and H Patel. Applying Models of Computation to OpenCL Pipes for FPGA Computing. Proceedings of the 5th International Workshop on OpenCL, 2017.

Digital Library

[27]

John Kessenich, Graham Sellers, and Dave Shreiner. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 4.5 with SPIR-V. Addison-Wesley Professional, 2016.

Digital Library

[28]

David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Spatial: A language and compiler for application accelerators. In Proceedings of the PLDI, 2018.

[29]

David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos Kozyrakis, and Kunle Olukotun. Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. In Proc. of the 43rd ISCA, pages 115--127, 2016.

[30]

Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. Hpvm: Heterogeneous parallel virtual machine. In Proc. of the 23rd PPOPP, 2018.

[31]

Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. Heterocl: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proc. of FPGA, 2019.

[32]

Maysam Lavasani. Generating irregular data-stream accelerators: methodology and applications. PhD thesis, 2015.

[33]

Chris Leary and Todd Wang. Xla: Tensorflow, compiled! TensorFlow Dev Summit, Feb 2017.

[34]

Charles E Leiserson. The Cilk++ concurrency platform. The Journal of Supercomputing, 51(3):244--257, 2010.

[35]

Derek Lockhart, Gary Zibrat, and Christopher Batten. PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. In Proc. of the 47th MICRO, pages 280--292, 2014.

[36]

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In Proc. of the 22nd HPCA, 2016.

[37]

Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu-Ting Chen, Hsuan Hsiao, Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. A Survey and Evaluation of FPGA High-Level Synthesis Tools. IEEE Trans. on CAD of Integrated Circuits and Systems, 35(10):1591--1604, 2016.

Digital Library

[38]

D. H. Noronha, B. Salehpour, and S. J. E. Wilton. LeFlow: Enabling Flexible FPGA High-Level Synthesis of Tensorflow Deep Neural Networks. ArXiv e-prints, July 2018.

[39]

Louis-Noel Pouchet and Uday Bondugula. Polybench 3.2. 2013. http://www.cse.ohio-state.edu/~pouchet/software/polybench.

[40]

Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating Configurable Hardware from Parallel Patterns. In Proc. of the 21st ASPLOS, 2016.

Digital Library

[41]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture for parallel paterns. In Proc. of the 44th ISCA, 2017.

[42]

Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. Programming Heterogeneous Systems from an Image Processing DSL. TACO, 14(3):1--25, 2017.

Digital Library

[43]

Andrew Putnam. FPGAs in the Datacenter - Combining the Worlds of Hardware and Software Development. ACM Great Lakes Symposium on VLSI, 2017.

[44]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P Amarasinghe. Halide - a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. of PLDI, 2013.

[45]

Brandon Reagen, Robert Adolf, Sophia Yakun Shao, Gu-Yeon Wei and David Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IEEE International Symposium on Workload Characterization (IISWC), 2014.

[46]

Oliver Reiche, Moritz Schmid, Frank Hannig, Richard Membarth, and Jürgen Teich. Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proc. of CODES+ISSS, pages 1--10, New York, New York, USA, 2014. ACM Press.

[47]

Hongbo Rong. Programmatic Control of a Compiler for Generating High-performance Spatial Hardware. In arXiv.org, November 2017.

[48]

Sameer D. Sahasrabuddhe, Sreenivas Subramanian, Kunal P. Ghosh, Kavi Arya, and Madhav P. Desai. A c-to-rtl flow as an energy efficient alternative to embedded processors in digital systems. In 13th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, DSD 2010, 1-3 September 2010, Lille, France, 2010.

Digital Library

[49]

Tao B Schardl, William S Moses, and Charles E Leiserson. Tapir - Embedding Fork-Join Parallelism into LLVM's Intermediate Representation. In In Proc. of PPOPP, 2017.

[50]

Prakalp Srivastava, Rakesh Komuravelli, Sarita Adve, Maria Kotsifakou, Matthew D Sinclair, and Vikram Adve. HPVM: heterogeneous parallel virtual machine. In Proc. of ACM SIGPLAN Notices, pages 68--80. ACM, March 2018.

[51]

James Stanier and Des Watson. Intermediate representations in imperative compilers: A survey. ACM Comput. Surv.

[52]

Arvind K Sujeeth, Kevin J Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite - A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embedded Comput. Syst., 2014.

[53]

Richard Townsend, Martha A Kim, and Stephen A Edwards. From functional programs to pipelined dataflow circuits. In Proceedings of the 26th International Conference on Compiler Construction, pages 76--86. ACM, 2017.

Digital Library

[54]

Ken Traub, James Hicks, and Shail Aditya. A dataflow compiler substrate. 1991.

[55]

Hasitha Muthumala Waidyasooriya, Masanori Hariyama, and Kunio Uchiyama. FPGA-Oriented Parallel Programming. In Design of FPGA-Based Computing Systems with OpenCL. October 2017.

Digital Library

[56]

Ali Mustafa Zaidi and David Greaves. A new dataflow compiler ir for accelerating control-intensive code in spatial hardware. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014.

Digital Library

[57]

Sizhuo Zhang, Andrew Wright, Thomas Bourgeat, and Arvind. Composable building blocks to open up processor design. In Proc. of the 51st MICRO, 2018.

Digital Library

Cited By

Kim CLi PMohan AButt ASampson ANigam R(2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689790
Lu LLuo ZZheng SYin JCong JLiang YYin J(2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3337208
Xu RLuo JZhang YLin YWang RHuang RLiang Y(2024)Hestia: An Efficient Cross-Level Debugger for High-Level Synthesis2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00062(765-779)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00062
Show More Cited By

Recommendations

Microarchitecture of HaL's CPU
COMPCON '95: Proceedings of the 40th IEEE Computer Society International Conference

The HaL PM1 CPU is the first implementation of the 64-bit SPARC Version 9 instruction set architecture. The processor utilizes superscalar instruction issue, register renaming, and a dataflow model of execution. Instructions can complete out-of-order ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Post-Silicon Microarchitecture

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
1,204
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)11

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim CLi PMohan AButt ASampson ANigam R(2024)Unifying Static and Dynamic Intermediate Languages for Accelerator GeneratorsProceedings of the ACM on Programming Languages10.1145/36897908:OOPSLA2(2242-2267)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689790
Lu LLuo ZZheng SYin JCong JLiang YYin J(2024)Rubick: A Unified Infrastructure for Analyzing, Exploring, and Implementing Spatial Architectures via Dataflow DecompositionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333720843:4(1177-1190)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3337208
Xu RLuo JZhang YLin YWang RHuang RLiang Y(2024)Hestia: An Efficient Cross-Level Debugger for High-Level Synthesis2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00062(765-779)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00062
Agarwal NFream MGhosh SSchwedock BBeckmann N(2024)UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow ArchitecturesIEEE Computer Architecture Letters10.1109/LCA.2023.334213023:1(99-103)Online publication date: Jan-2024
https://doi.org/10.1109/LCA.2023.3342130
Majumder KBondhugula UAamodt TSwift MJerger N(2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624767
Laeufer KIyer VBiancolin DBachrach JNikolić BSen KAamodt TJerger NSwift M(2023)Simulator Independent Coverage for RTL Hardware LanguagesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582019(606-615)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582019
Wang SCoffman HMayer KGarg SRenau JVerbrugge CLhoták OShen X(2023)A Multi-threaded Fast Hardware Compiler for HDLsProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580254(25-36)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580254
Feldtkeller JSasdrich PGüneysu T(2023)Challenges and Opportunities of Security-Aware EDAACM Transactions on Embedded Computing Systems10.1145/357619922:3(1-34)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3576199
Han SJang MKang JAamodt TJerger NSwift M(2023)ShakeFlow: Functional Hardware Description with Latency-Insensitive Interface CombinatorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575701(702-717)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575701
Vahdatniya PSharifian AHojabr RShriraman AKloeckner AMoreira J(2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569671
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten