research-article

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Authors:

Michael Pellauer,

Angshuman Parashar,

Kermin Fleming,

Tushar Krishna,

Stephen Maresh,

Vladimir Pavlov,

Joel EmerAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 33, Issue 3

Article No.: 10, Pages 1 - 32

https://doi.org/10.1145/2754930

Published: 11 September 2015 Publication History

Abstract

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading.

Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

References

[1]

Arvind and Rishiyur S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions on Computers 39, 3, 300--318.

Digital Library

[2]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.

[3]

Bluespec, Inc. 2007. Bluespec System Verilog Reference Guide. Bluespec.

[4]

Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55.

Digital Library

[5]

Luca P. Carloni, Kenneth L. McMillan, and Alberto L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20, 9, 1059--1076.

Digital Library

[6]

K. Mani Chandy and Jayadev Misra. 1988. Parallel Program Design: A Foundation. Addison-Wesley.

Digital Library

[7]

Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computer Surveys 34, 2, 171--210.

Digital Library

[8]

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA.

Digital Library

[9]

Jack B. Dennis and David P. Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture. 126--132.

Digital Library

[10]

Edsger W. Dijkstra. 1975. Guarded commands, nondeterminacy and formal derivation of programs. Communications of the ACM 18, 8, 453--457.

Digital Library

[11]

Joel Emer, Pritpal Ahuja, Eric Borch, Artur Klauser, Chi-Keung Luk, Srilatha Manne, Shubhendu S. Mukherjee, Harish Patil, Steven Wallace, Nathan Binkert, Roger Espasa, and Toni Juan. 2002. Asim: A performance model framework. Computer 35, 2, 68--76.

Digital Library

[12]

Joel S. Emer and Douglas W. Clark. 1984. A characterization of processor performance in the VAX-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA’84). 301--310.

Digital Library

[13]

Kermin Elliott Fleming, Michael Adler, Michael Pellauer, Angshuman Parashar, Arvind Mithal, and Joel Emer. 2012. Leveraging latency-insensitivity to ease multiple FPGA design. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 175--184.

Digital Library

[14]

Robert A. Van De Geijin and Jarell Watts. 1997. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report.

Digital Library

[15]

Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11). 503--514.

Digital Library

[16]

John R. Hauser and John Wawrzynek. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’97). 12--21.

Digital Library

[17]

Jan Hoogerbrugge and Henk Corporaal. 1994. Transport-triggering vs. operation-triggering. In Compiler Construction. Lecture Notes in Computer Science, Vol. 786. Springer, 435--449.

Digital Library

[18]

Myron King, Nirav Dave, and Arvind. 2012. Automatic generation of hardware/software interfaces. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 325--336.

Digital Library

[19]

Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. 1977. Fast pattern matching in strings. SIAM Journal of Computing 6, 2, 323--350.

[20]

Hsiang-Tsung Kung. 1986. The CMU warp processor. In Supercomputers: Algorithms, Architectures, and Scientific Computation, F. A. Matsen and T. Tajima (Eds.). University of Texas Press, Austin, TX, 235--247.

Digital Library

[21]

Alexander Marquardt, Vaughn Betz, and Jonathan Rose. 2000. Speed and area tradeoffs in cluster-based FPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8, 1, 84--93.

Digital Library

[22]

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of 13th International Conference on Field-Programmable Logic and Applications. 61--70.

[23]

Duane G. Merrill and Andrew S. Grimshaw. 2010. Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 545--546.

Digital Library

[24]

Ethan Mirsky and Andre DeHon. 1996. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. 157--166.

[25]

Gajinder Panesar, Daniel Towner, Andrew Duller, Alan Gray, and Will Robbins. 2006. Deterministic parallel processing. International Journal of Parallel Programming 34, 4, 323--341.

Digital Library

[26]

Li-Shiuan Peh and Natalie Enright Jerger. 2009. On-Chip Networks. Morgan and Claypool.

Digital Library

[27]

Michael Pellauer, Michael Adler, Derek Chiou, and Joel Emer. 2009. Soft connections: Addressing the hardware-design modularity problem. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC’09). 276--281.

Digital Library

[28]

Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin Levine, and R. Reed Taylor. 2002. PipeRench: A virtualized programmable datapath in 0.18 micron technology. In Proceedings of the 2002 IEEE Custom Integrated Circuits Conference. 63--66.

[29]

Aaron Smith, Ramadass Nagarajan, Karthikeyan Sankaralingam, Robert McDonald, Doug Burger, Stephen W. Keckler, and Kathryn S. McKinley. 2006. Dataflow predication. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39). 89--102.

Digital Library

[30]

Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4.

Digital Library

[31]

Michael B. Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2, 25--35.

Digital Library

[32]

Dean N. Truong, Wayne H. Cheng, Tinoosh Mohsenin, Zhiyi Yu, Anthony T. Jacobson, Gouri Landge, Michael J. Meeuwsen, Christine Watnik, Ahn T. Tran, Zhibin Xiao, Eric W. Work, Jeremy W. Webb, Paul V. Mejia, and Bevan M. Baas. 2009. A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits 44, 4, 1130--1144.

[33]

Muralidaran Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE’09). IEEE, Los Alamitos, CA, 171--180. http://dl.acm.org/citation.cfm? id=1715759.1715781

Digital Library

[34]

Zhi A. Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee. 2000. CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functional unit. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA’00). 225--235.

Digital Library

[35]

Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, and Bevan Baas. 2006. An asynchronous array of simple processors for DSP applications. In Proceedings of the Solid-State Circuits Conference (ISSCC’06). 1696--1705.

Cited By

Szafarczyk RNabi SVanderbauwhede W(2023)A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00018(115-124)Online publication date: 12-Dec-2023
https://doi.org/10.1109/ICFPT59805.2023.00018
Szafarczyk RNabi SVanderbauwhede W(2023)Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00009(1-9)Online publication date: 4-Sep-2023
https://doi.org/10.1109/FPL60245.2023.00009
Li ZWu DWijerathne DMitra T(2022)LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00040(444-459)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00040
Show More Cited By

Index Terms

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Triggered instructions: a control paradigm for spatially-programmed architectures
ICSA '13

In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition ...
Triggered instructions: a control paradigm for spatially-programmed architectures
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition ...
Inter-cluster communication in VLIW architectures

The traditional VLIW (very long instruction word) architecture with a single register file does not scale up well to address growing performance demands on embedded media processors. However, splitting a VLIW processor in smaller clusters, which are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 33, Issue 3

September 2015

140 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/2818727

Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2015

Accepted: 01 March 2015

Received: 01 December 2014

Published in TOCS Volume 33, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
651
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Szafarczyk RNabi SVanderbauwhede W(2023)A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00018(115-124)Online publication date: 12-Dec-2023
https://doi.org/10.1109/ICFPT59805.2023.00018
Szafarczyk RNabi SVanderbauwhede W(2023)Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00009(1-9)Online publication date: 4-Sep-2023
https://doi.org/10.1109/FPL60245.2023.00009
Li ZWu DWijerathne DMitra T(2022)LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00040(444-459)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00040
Ma L(2022)Position Synchronization Control Algorithm of Legged Robot Based on DSP Centralized ControlMobile Networks and Applications10.1007/s11036-022-01914-w27:3(955-964)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1007/s11036-022-01914-w
Lu YLiu LLiu JYin SWei S(2020)A Reconfigurable Branch Predictor for Spatial Computing ArchitecturesProceedings of the 2020 4th International Conference on Digital Signal Processing10.1145/3408127.3408168(295-299)Online publication date: 10-Sep-2020
https://doi.org/10.1145/3408127.3408168
Wang YZarei MBonakdarpour BPajic M(2019)Statistical Verification of Hyperproperties for Cyber-Physical SystemsACM Transactions on Embedded Computing Systems10.1145/335823218:5s(1-23)Online publication date: 8-Oct-2019
https://dl.acm.org/doi/10.1145/3358232
Ghosh BDuggirala P(2019)Robust Reachable SetACM Transactions on Embedded Computing Systems10.1145/335822918:5s(1-22)Online publication date: 8-Oct-2019
https://dl.acm.org/doi/10.1145/3358229
Huang CFan JLi WChen XZhu Q(2019)ReachNNACM Transactions on Embedded Computing Systems10.1145/335822818:5s(1-22)Online publication date: 8-Oct-2019
https://dl.acm.org/doi/10.1145/3358228
Luo ZZuo FJiang YGao JJiao XSun J(2019)PolarACM Transactions on Embedded Computing Systems10.1145/335822718:5s(1-22)Online publication date: 8-Oct-2019
https://dl.acm.org/doi/10.1145/3358227
Pan RParmer G(2019)MxUACM Transactions on Embedded Computing Systems10.1145/335822418:5s(1-20)Online publication date: 8-Oct-2019
https://dl.acm.org/doi/10.1145/3358224
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents