research-article

Analyzing Behavior Specialized Acceleration

Authors:

Karthikeyan SankaralingamAuthors Info & Claims

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 697 - 711

https://doi.org/10.1145/2872362.2872412

Published: 25 March 2016 Publication History

Abstract

Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with only certain properties, but remain largely configurable or programmable. The most important strength of BSAs -- their ability to target a wide variety of codes -- also makes their interactions and analysis complex, raising the following questions: can multiple BSAs be composed synergistically, what are their interactions with the general purpose core, and what combinations favor which workloads? From a methodological standpoint, BSAs are also challenging, as they each require ISA development, compiler and assembler extensions, and either simulator or RTL models.

To study the potential of BSAs, we propose a novel modeling technique called the Transformable Dependence Graph (TDG) - a higher level alternative to the time-consuming traditional compiler+simulator approach, while still enabling detailed microarchitectural models for both general cores and accelerators. We then propose a multi-BSA organization, called ExoCore, which we model and study using the TDG. A design space exploration reveals that an ExoCore organization can push designs beyond the established energy-performance frontiers for general purpose cores. For example, a 2-wide OOO processor with three BSAs matches the performance of a conventional 6-wide OOO core, has 40% lower area, and is 2.6x more energy efficient.

References

[1]

Parboil Benchmark Suite. impact.crhc.illinois.edu/parboil/parboil.aspx.

[2]

Vertical Microbenchmarks. http://cs.wisc.edu/vertical/microbench.

[3]

Renee St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014.

[4]

Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.

Digital Library

[5]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.

Digital Library

[6]

Shekhar Borkar and Andrew A. Chien. The future of micro- processors. Commun. ACM, 54(5):67--77, 2011.

Digital Library

[7]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine- learning. In ASPLOS, 2014.

Digital Library

[8]

Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. Application-specific processing on a general-purpose core via transparent instruction set customization. In MICRO, 2004.

Digital Library

[9]

Rajagopalan Desikan, Doug Burger, and Stephen W. Keckler. Measuring experimental error in microprocessor simulation. In ISCA, 2001.

[10]

Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 2010.

Digital Library

[11]

Peter A. Milder Eric S. Chung, James C. Hoe, and Ken Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO '10.

[12]

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News, 2011.

Digital Library

[13]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In MICRO, 2012.

Digital Library

[14]

Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst., 2009.

Digital Library

[15]

Brian Fields, Shai Rubin, and Rastislav Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001.

[16]

Saturnino Garcia, Donghwan Jeon, Chris Louie, and Michael Bedford Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, 2011.

Digital Library

[17]

Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 2012.

Digital Library

[18]

Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO, 2011.

Digital Library

[19]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In ISCA '10.

[20]

Mark Hempstead, Gu-Yeon Wei, and David Brooks. Navigo: An early-stage model to study power-contrained architectures and specialization. In Proceedings of Workshop on Modeling, Benchmarking, and Simulations (MoBS), 2009.

[21]

Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In ISCA '10.

[22]

R. Iyer. Accelerator-rich architectures: Implications, opportunities and challenges. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, 2012.

[23]

Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA, 2011.

[24]

Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013.

[25]

Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, 2003.

Digital Library

[26]

Rakesh Kumar, Dean M. Tullsen, and Norman P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In PACT, 2006.

Digital Library

[27]

Chunho Lee, M. Potkonjak, and W.H. Mangione-Smith. Me- diaBench: a tool for evaluating and synthesizing multimedia and communications systems. In MICRO, 1997.

[28]

Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. Rpstacks: Fast and accurate processor design space exploration using representative stall-event stacks. In MICRO, 2014.

Digital Library

[29]

Sheng Li, Jung-Ho Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009.

Digital Library

[30]

Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. PuDianNao: a polyvalent machine learning accelerator. In ASPLOS, 2015.

Digital Library

[31]

Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. Composite Cores: Pushing heterogeneity into a core. In MICRO, 2012.

[32]

J. Meng, V.A. Morozov, K. Kumaran, V. Vishwanath, and T.D. Uram. GROPHECY: GPU performance projection from CPU code skeletons. In SC'11. ACM, 2011.

Digital Library

[33]

Tipp Moseley, Dirk Grunwald, Daniel A Connors, Ram Ra- manujam, Vasanth Tovinkere, and Ramesh Peri. Loopprof: Dynamic techniques for loop detection and profiling. In Proceedings of the 2006 Workshop on Binary Instrumentation and Applications (WBIA), 2006.

[34]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, 2009.

[35]

Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In PACT, 2013.

[36]

Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Exploring the potential of heterogeneous Von Neumann/Dataflow execution models. In ISCA, 2015.

Digital Library

[37]

Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Pushing the limits of accelerator efficiency while retaining programmability. In HPCA, 2016.

[38]

Tony Nowatzki, Venkatraman. Govindaraju, and Karthikeyan Sankaralingam. A graph-based program representation for analyzing hardware specialization approaches. Computer Architecture Letters, 2015.

Digital Library

[39]

Cedric Nugteren and Henk Corporaal. A modular and parameterisable classification of algorithms. Technical Report ESR-2011-02, Eindhoven University of Technology, 2011.

[40]

Cedric Nugteren and Henk Corporaal. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing. In PPOPP, 2012.

[41]

Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. In MICRO, 2013.

Digital Library

[42]

Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. Convolution engine: Balancing efficiency and flexibility in specialized computing. In ISCA, 2013.

[43]

Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace processors. In MICRO, 1997.

Digital Library

[44]

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ISCA, 2014.

[45]

M. Shoaib Bin Altaf and D.A. Wood. LogCA: a performance model for hardware accelerators. Computer Architecture Letters, 2015.

Digital Library

[46]

Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP, 2012.

Digital Library

[47]

H. Singh, Ming-Hau Lee, Guangming Lu, F.J. Kurdahi, N. Bagherzadeh, and E.M. Chaves Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. Computers, IEEE Trans- actions on, 2000.

[48]

Tyler Sondag and Hridesh Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In CGO, 2011.

Digital Library

[49]

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural specialization for inter-iteration loop dependence patterns. In MICRO, 2014.

Digital Library

[50]

Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. Wavescalar. In MICRO, pages 291--, 2003.

[51]

Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multicores through performance impact estimation (pie). In ISCA, 2012.

[52]

Ashish Venkat and Dean M. Tullsen. Harnessing isa diversity: Design of a heterogeneous-isa chip multiprocessor. In ISCA, 2014.

Digital Library

[53]

Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In ASPLOS '10.

[54]

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multi-core architectures. Commun. ACM, 2009.

[55]

Jonathan A. Winter, David H. Albonesi, and Christine A. Shoemaker. Scalable thread scheduling and global power management for heterogeneous many-core architectures. In PACT, 2010.

Digital Library

[56]

Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. Q100: The architecture and design of a database processing unit. In ASPLOS, 2014.

Digital Library

[57]

T. Zidenberg, I. Keslassy, and U. Weiser. Optimal resource allocation with multiamdahl. Computer, 2013.

Cited By

Xiao YNazarian SBogdan P(2021)Plasticity-on-Chip Design: Exploiting Self-Similarity for Data CommunicationsIEEE Transactions on Computers10.1109/TC.2021.307150770:6(950-962)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TC.2021.3071507
Rogers SSlycord JBaharani MTabkhi H(2020)gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00047(471-482)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00047
Nowatzki TGangadhar VSankaralingam K(2019)Heterogeneous Von Neumann/dataflow microprocessorsCommunications of the ACM10.1145/332392362:6(83-91)Online publication date: 21-May-2019
https://dl.acm.org/doi/10.1145/3323923
Show More Cited By

Index Terms

Analyzing Behavior Specialized Acceleration
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Analyzing Behavior Specialized Acceleration
ASPLOS '16

Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with ...
Analyzing Behavior Specialized Acceleration
ASPLOS'16

Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

March 2016

824 pages

ISBN:9781450340915

DOI:10.1145/2872362

General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

ACM SIGPLAN Notices Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

ASPLOS '16

Sponsor:

ASPLOS '16: Architectural Support for Programming Languages and Operating Systems

April 2 - 6, 2016

Georgia, Atlanta, USA

Acceptance Rates

ASPLOS '16 Paper Acceptance Rate 53 of 232 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
583
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiao YNazarian SBogdan P(2021)Plasticity-on-Chip Design: Exploiting Self-Similarity for Data CommunicationsIEEE Transactions on Computers10.1109/TC.2021.307150770:6(950-962)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TC.2021.3071507
Rogers SSlycord JBaharani MTabkhi H(2020)gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00047(471-482)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00047
Nowatzki TGangadhar VSankaralingam K(2019)Heterogeneous Von Neumann/dataflow microprocessorsCommunications of the ACM10.1145/332392362:6(83-91)Online publication date: 21-May-2019
https://dl.acm.org/doi/10.1145/3323923
Wang ZNowatzki TManne SHunter HAltman E(2019)Stream-based memory access specialization for general purpose processorsProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322229(736-749)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322229
Rogers SSlycord JRaheja RTabkhi H(2019)Scalable LLVM-Based Accelerator Modeling in gem5IEEE Computer Architecture Letters10.1109/LCA.2019.289393218:1(18-21)Online publication date: 17-Jul-2019
https://dl.acm.org/doi/10.1109/LCA.2019.2893932
Fuchs AWentzlaff D(2019)The Accelerator Wall: Limits of Chip Specialization2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00023(1-14)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00023
Le TNing RZhao DWu HBayoumi M(2017)Optimizing the heterogeneous network on-chip design in manycore architectures2017 30th IEEE International System-on-Chip Conference (SOCC)10.1109/SOCC.2017.8226033(184-189)Online publication date: Sep-2017
https://doi.org/10.1109/SOCC.2017.8226033
Le TZhao DBayoumi M(2017)Efficient Reconfigurable Global Network-on-Chip Designs towards Heterogeneous CPU-GPU Systems: An Application-Aware Approach2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.83(439-444)Online publication date: Jul-2017
https://doi.org/10.1109/ISVLSI.2017.83
Sharifian AKumar SGuha AShriraman AHsu WYang CLipasti MLee H(2016)CHAINSAWThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195698(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195698
Sharifian AKumar SGuha AShriraman A(2016)Chainsaw: Von-neumann accelerators to leverage fused instruction chains2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO.2016.7783752(1-14)Online publication date: Oct-2016
https://doi.org/10.1109/MICRO.2016.7783752
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten