skip to main content
research-article

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Published: 11 September 2015 Publication History

Abstract

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading.
Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

References

[1]
Arvind and Rishiyur S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions on Computers 39, 3, 300--318.
[2]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.
[3]
Bluespec, Inc. 2007. Bluespec System Verilog Reference Guide. Bluespec.
[4]
Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55.
[5]
Luca P. Carloni, Kenneth L. McMillan, and Alberto L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20, 9, 1059--1076.
[6]
K. Mani Chandy and Jayadev Misra. 1988. Parallel Program Design: A Foundation. Addison-Wesley.
[7]
Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computer Surveys 34, 2, 171--210.
[8]
William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA.
[9]
Jack B. Dennis and David P. Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture. 126--132.
[10]
Edsger W. Dijkstra. 1975. Guarded commands, nondeterminacy and formal derivation of programs. Communications of the ACM 18, 8, 453--457.
[11]
Joel Emer, Pritpal Ahuja, Eric Borch, Artur Klauser, Chi-Keung Luk, Srilatha Manne, Shubhendu S. Mukherjee, Harish Patil, Steven Wallace, Nathan Binkert, Roger Espasa, and Toni Juan. 2002. Asim: A performance model framework. Computer 35, 2, 68--76.
[12]
Joel S. Emer and Douglas W. Clark. 1984. A characterization of processor performance in the VAX-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA’84). 301--310.
[13]
Kermin Elliott Fleming, Michael Adler, Michael Pellauer, Angshuman Parashar, Arvind Mithal, and Joel Emer. 2012. Leveraging latency-insensitivity to ease multiple FPGA design. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 175--184.
[14]
Robert A. Van De Geijin and Jarell Watts. 1997. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report.
[15]
Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11). 503--514.
[16]
John R. Hauser and John Wawrzynek. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’97). 12--21.
[17]
Jan Hoogerbrugge and Henk Corporaal. 1994. Transport-triggering vs. operation-triggering. In Compiler Construction. Lecture Notes in Computer Science, Vol. 786. Springer, 435--449.
[18]
Myron King, Nirav Dave, and Arvind. 2012. Automatic generation of hardware/software interfaces. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 325--336.
[19]
Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. 1977. Fast pattern matching in strings. SIAM Journal of Computing 6, 2, 323--350.
[20]
Hsiang-Tsung Kung. 1986. The CMU warp processor. In Supercomputers: Algorithms, Architectures, and Scientific Computation, F. A. Matsen and T. Tajima (Eds.). University of Texas Press, Austin, TX, 235--247.
[21]
Alexander Marquardt, Vaughn Betz, and Jonathan Rose. 2000. Speed and area tradeoffs in cluster-based FPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8, 1, 84--93.
[22]
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of 13th International Conference on Field-Programmable Logic and Applications. 61--70.
[23]
Duane G. Merrill and Andrew S. Grimshaw. 2010. Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 545--546.
[24]
Ethan Mirsky and Andre DeHon. 1996. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. 157--166.
[25]
Gajinder Panesar, Daniel Towner, Andrew Duller, Alan Gray, and Will Robbins. 2006. Deterministic parallel processing. International Journal of Parallel Programming 34, 4, 323--341.
[26]
Li-Shiuan Peh and Natalie Enright Jerger. 2009. On-Chip Networks. Morgan and Claypool.
[27]
Michael Pellauer, Michael Adler, Derek Chiou, and Joel Emer. 2009. Soft connections: Addressing the hardware-design modularity problem. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC’09). 276--281.
[28]
Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin Levine, and R. Reed Taylor. 2002. PipeRench: A virtualized programmable datapath in 0.18 micron technology. In Proceedings of the 2002 IEEE Custom Integrated Circuits Conference. 63--66.
[29]
Aaron Smith, Ramadass Nagarajan, Karthikeyan Sankaralingam, Robert McDonald, Doug Burger, Stephen W. Keckler, and Kathryn S. McKinley. 2006. Dataflow predication. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39). 89--102.
[30]
Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4.
[31]
Michael B. Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2, 25--35.
[32]
Dean N. Truong, Wayne H. Cheng, Tinoosh Mohsenin, Zhiyi Yu, Anthony T. Jacobson, Gouri Landge, Michael J. Meeuwsen, Christine Watnik, Ahn T. Tran, Zhibin Xiao, Eric W. Work, Jeremy W. Webb, Paul V. Mejia, and Bevan M. Baas. 2009. A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits 44, 4, 1130--1144.
[33]
Muralidaran Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE’09). IEEE, Los Alamitos, CA, 171--180. http://dl.acm.org/citation.cfm? id=1715759.1715781
[34]
Zhi A. Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee. 2000. CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functional unit. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA’00). 225--235.
[35]
Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, and Bevan Baas. 2006. An asynchronous array of simple processors for DSP applications. In Proceedings of the Solid-State Circuits Conference (ISSCC’06). 1696--1705.

Cited By

View all
  • (2023)A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00018(115-124)Online publication date: 12-Dec-2023
  • (2023)Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00009(1-9)Online publication date: 4-Sep-2023
  • (2022)LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00040(444-459)Online publication date: Apr-2022
  • Show More Cited By

Index Terms

  1. Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 33, Issue 3
    September 2015
    140 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/2818727
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 September 2015
    Accepted: 01 March 2015
    Received: 01 December 2014
    Published in TOCS Volume 33, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Spatial programming
    2. reconfigurable accelerators

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00018(115-124)Online publication date: 12-Dec-2023
    • (2023)Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00009(1-9)Online publication date: 4-Sep-2023
    • (2022)LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00040(444-459)Online publication date: Apr-2022
    • (2022)Position Synchronization Control Algorithm of Legged Robot Based on DSP Centralized ControlMobile Networks and Applications10.1007/s11036-022-01914-w27:3(955-964)Online publication date: 1-Jun-2022
    • (2020)A Reconfigurable Branch Predictor for Spatial Computing ArchitecturesProceedings of the 2020 4th International Conference on Digital Signal Processing10.1145/3408127.3408168(295-299)Online publication date: 10-Sep-2020
    • (2019)Statistical Verification of Hyperproperties for Cyber-Physical SystemsACM Transactions on Embedded Computing Systems10.1145/335823218:5s(1-23)Online publication date: 8-Oct-2019
    • (2019)Robust Reachable SetACM Transactions on Embedded Computing Systems10.1145/335822918:5s(1-22)Online publication date: 8-Oct-2019
    • (2019)ReachNNACM Transactions on Embedded Computing Systems10.1145/335822818:5s(1-22)Online publication date: 8-Oct-2019
    • (2019)PolarACM Transactions on Embedded Computing Systems10.1145/335822718:5s(1-22)Online publication date: 8-Oct-2019
    • (2019)MxUACM Transactions on Embedded Computing Systems10.1145/335822418:5s(1-20)Online publication date: 8-Oct-2019
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media