research-article

Stream-Dataflow Acceleration

Authors:

Vinay Gangadhar,

Newsha Ardalani,

Karthikeyan SankaralingamAuthors Info & Claims

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Pages 416 - 429

https://doi.org/10.1145/3079856.3080255

Published: 24 June 2017 Publication History

Abstract

Demand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-magnitude improvements and industry adoption of application and domain-specific accelerators in important areas like machine learning, computer vision and big data. The stark tradeoffs between efficiency and generality at these two extremes poses a difficult question: how could domain-specific hardware efficiency be achieved without domain-specific hardware solutions?

In this work, we rely on the insight that "acceleratable" algorithms have broad common properties: high computational intensity with long phases, simple control patterns and dependences, and simple streaming memory access and reuse patterns. We define a general architecture (a hardware-software interface) which can more efficiently expresses program with these properties called stream-dataflow. The dataflow component of this architecture enables high concurrency, and the stream component enables communication and coordination at very-low power and area overhead. This paper explores the hardware and software implications, describes its detailed microarchitecture, and evaluates an implementation. Compared to a state-of-the-art domain specific accelerator (DianNao), and fixed-function accelerators for MachSuite, Softbrain can match their performance with only 2x power overhead on average.

References

[1]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing Hardware in a Scala Embedded Language. In Proceedings of the 49th Annual Design Automation Conference (DAC '12). ACM, New York, NY, USA, 1216--1225.

Digital Library

[2]

Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, William Yoder, and the TRIPS Team. 2004. Scaling to the End of Silicon with EDGE Architectures. Computer 37, 7 (July 2004), 44--55.

Digital Library

[3]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 269--284.

Digital Library

[4]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 367--379.

Digital Library

[5]

Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The Reconfigurable Streaming Vector Processor (RSVP). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 141--. http://dl.acm.org/citation.cfm?id=956417.956540

Digital Library

[6]

Nathan Clark, Amir Hormati, and Scott Mahlke. 2008. VEAL: Virtualized Execution Accelerator for Loops. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA '08). IEEE Computer Society, Washington, DC, USA, 389--400.

Digital Library

[7]

Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (Sept. 2012), 38--51.

Digital Library

[8]

Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE Press, Piscataway, NJ, USA, 341--352.

Digital Library

[9]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding Sources of Inefficiency in General-purpose Chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 37--47.

Digital Library

[10]

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient Execution of Memory Access Phases Using Dataflow Specialization. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 118--130.

Digital Library

[11]

Mircea Horea Ionica and David Gregg. 2015. The Movidius Myriad Architecture's Potential for Scientific Computing. IEEE Micro 35, 1 (2015), 6--14.

[12]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and others. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In ISCA '17.

Digital Library

[13]

Brucek Khailany, William J Dally, Ujval J Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media processing with streams. IEEE micro 21, 2 (2001), 35--46.

Digital Library

[14]

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten. 2013. Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 130--141.

Digital Library

[15]

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanovic. 2004. The Vector-Thread Architecture. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA '04). IEEE Computer Society, Washington, DC, USA, 52--.

Digital Library

[16]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the Tradeoffs Between Programmability and Efficiency in Data-parallel Accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA '11). ACM, New York, NY, USA, 129--140.

Digital Library

[17]

Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An Instruction Set Architecture for Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 393--405.

Digital Library

[18]

Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. 2012. Composite Cores: Pushing Heterogeneity Into a Core. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 317--328.

Digital Library

[19]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.

[20]

Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. 2015. Exploring the Potential of Heterogeneous Von Neumann/Dataflow Execution Models. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 298--310.

Digital Library

[21]

Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright. 2016. Pushing the limits of accelerator efficiency while retaining programmability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 27--39.

[22]

Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 495--506.

Digital Library

[23]

Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 142--153.

Digital Library

[24]

Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 84--95.

Digital Library

[25]

Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 13--24.

Digital Library

[26]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 110--119.

[27]

Karthikeyan Sankaralingam, Stephen W. Keckler, William R. Mark, and Doug Burger. 2003. Universal Mechanisms for Data-Parallel Architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 303--.

Digital Library

[28]

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A Pre-RTL, Power-performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 97--108.

Digital Library

[29]

Hartej Singh, Ming-Hau Lee, Guangming Lu, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. 2000. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Trans. Comput. 49, 5 (May 2000), 465--481.

Digital Library

[30]

James E. Smith. 1982. Decoupled Access/Execute Computer Architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (ISCA '82). IEEE Computer Society Press, Los Alamitos, CA, USA, 112--119.

Digital Library

[31]

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. 2014. Architectural Specialization for Inter-Iteration Loop Dependence Patterns. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 583--595.

Digital Library

[32]

Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 291--.

Digital Library

[33]

Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro 22, 2 (March 2002), 25--35.

Digital Library

[34]

Dani Voitsechov and Yoav Etsion. 2014. Single-graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 205--216.

Digital Library

[35]

Matthew A Watkins, Tony Nowatzki, and Anthony Carno. 2016. Software Transparent Dynamic Binary Translation for Coarse-Grain Reconfigurable Architectures. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 138--150.

[36]

Gabriel Weisz and James C Hoe. 2015. CoRAM++: Supporting data-structure-specific memory interfaces for FPGA computing. In 25th International Conference on Field Programmable Logic and Applications (FPL). 1--8.

[37]

Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. 2014. Q100: The Architecture and Design of a Database Processing Unit. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 255--268.

Digital Library

Cited By

Szafarczyk RNabi SVanderbauwhede WKluss DAchour SPalsberg J(2025)Compiler Support for Speculation in Decoupled Access/Execute ArchitecturesProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712695(192-204)Online publication date: 25-Feb-2025
https://dl.acm.org/doi/10.1145/3708493.3712695
Li ZDangi PYin CBandara TJuneja RTan CBai ZMitra TEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Enhancing CGRA Efficiency Through Aligned Compute and Communication ProvisioningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707230(410-425)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707230
Hui HGu JHu XWei SYin SNakamura YWang Y(2025)DIAG: A Refined Four-layer Agile Hardware Developing Flow for Generating Flexible Reconfigurable ArchitecturesProceedings of the 30th Asia and South Pacific Design Automation Conference10.1145/3658617.3697636(548-553)Online publication date: 20-Jan-2025
https://dl.acm.org/doi/10.1145/3658617.3697636
Show More Cited By

Index Terms

Stream-Dataflow Acceleration
1. Computer systems organization
  1. Architectures
    1. Other architectures
    2. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Stream-Dataflow Acceleration
ISCA'17

Demand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-magnitude improvements and industry adoption of ...
RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and ...
Power Mitigation by Performance Equalization in a Heterogeneous Reconfigurable Multicore Architecture

This paper presents an integrated self-aware computing model mitigating the power dissipation of a heterogeneous reconfigurable multicore architecture by dynamically scaling the operating frequency of each core. The power mitigation is achieved by ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

June 2017

736 pages

ISBN:9781450348928

DOI:10.1145/3079856

ACM SIGARCH Computer Architecture News Volume 45, Issue 2
ISCA'17
May 2017
715 pages
ISSN:0163-5964
DOI:10.1145/3140659
Editor:
Babak Falsafi
Interim
Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE: IEEE Computer Society Technical Committee on Design Automation
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF CCF

Conference

ISCA '17

Sponsor:

IEEE
SIGARCH

ISCA '17: The 44th Annual International Symposium on Computer Architecture

June 24 - 28, 2017

ON, Toronto, Canada

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

148
Total Citations
View Citations
1,764
Total Downloads

Downloads (Last 12 months)206
Downloads (Last 6 weeks)18

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Szafarczyk RNabi SVanderbauwhede WKluss DAchour SPalsberg J(2025)Compiler Support for Speculation in Decoupled Access/Execute ArchitecturesProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712695(192-204)Online publication date: 25-Feb-2025
https://dl.acm.org/doi/10.1145/3708493.3712695
Li ZDangi PYin CBandara TJuneja RTan CBai ZMitra TEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Enhancing CGRA Efficiency Through Aligned Compute and Communication ProvisioningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707230(410-425)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707230
Hui HGu JHu XWei SYin SNakamura YWang Y(2025)DIAG: A Refined Four-layer Agile Hardware Developing Flow for Generating Flexible Reconfigurable ArchitecturesProceedings of the 30th Asia and South Pacific Design Automation Conference10.1145/3658617.3697636(548-553)Online publication date: 20-Jan-2025
https://dl.acm.org/doi/10.1145/3658617.3697636
Deng JTang XZhang JLi YZhang LTu FWei SHu YYin S(2025)Rethinking Control Flow in Spatial Architectures: Insights Into Control Flow Plane DesignIEEE Transactions on Computers10.1109/TC.2024.347558274:1(185-199)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TC.2024.3475582
Fernandes ACrespo LNeves NTomás PRoma NFalcao G(2025)Functional Validation of the RISC-V Unlimited Vector ExtensionIEEE Embedded Systems Letters10.1109/LES.2024.341682017:1(2-5)Online publication date: Feb-2025
https://doi.org/10.1109/LES.2024.3416820
Chen KYang CSun YTseng CFayazi MHe XFeng SYue YMudge TDreslinski RKim HBlaauw D(2025)DAP: A 507-GMACs/J 256-Core Domain Adaptive Processor for Wireless Communication and Linear Algebra Kernels in 12-nm FINFETIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.343875860:2(672-684)Online publication date: Feb-2025
https://doi.org/10.1109/JSSC.2024.3438758
Xu TWang YZhang YLu Y(2024)A Task Parallel Accelerator with Dynamic Pipeline BalancingWorld Journal of Innovation and Modern Technology10.53469/wjimt.2024.07(06).117:6(86-94)Online publication date: 26-Nov-2024
https://doi.org/10.53469/wjimt.2024.07(06).11
Chen KMason Nelson TKhadem AFayazi MSingapuram SDreslinski RTalati NKim HBlaauw D(2024)Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless CommunicationACM Transactions on Reconfigurable Technology and Systems10.1145/369588017:4(1-32)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3695880
Chen KAbdelrahman TAzimi RCzajkowski TGoudarzi M(2024)RoDMap: A Reserve-on-Demand Mapper for Spatially-Configured Coarse-Grained Reconfigurable ArraysProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673145(876-886)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673145
de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3656642
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten