ABSTRACT
Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with only certain properties, but remain largely configurable or programmable. The most important strength of BSAs -- their ability to target a wide variety of codes -- also makes their interactions and analysis complex, raising the following questions: can multiple BSAs be composed synergistically, what are their interactions with the general purpose core, and what combinations favor which workloads? From a methodological standpoint, BSAs are also challenging, as they each require ISA development, compiler and assembler extensions, and either simulator or RTL models.
To study the potential of BSAs, we propose a novel modeling technique called the Transformable Dependence Graph (TDG) - a higher level alternative to the time-consuming traditional compiler+simulator approach, while still enabling detailed microarchitectural models for both general cores and accelerators. We then propose a multi-BSA organization, called ExoCore, which we model and study using the TDG. A design space exploration reveals that an ExoCore organization can push designs beyond the established energy-performance frontiers for general purpose cores. For example, a 2-wide OOO processor with three BSAs matches the performance of a conventional 6-wide OOO core, has 40% lower area, and is 2.6x more energy efficient.
- Parboil Benchmark Suite. impact.crhc.illinois.edu/parboil/parboil.aspx.Google Scholar
- Vertical Microbenchmarks. http://cs.wisc.edu/vertical/microbench.Google Scholar
- Renee St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014.Google Scholar
- Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.Google ScholarDigital Library
- Shekhar Borkar and Andrew A. Chien. The future of micro- processors. Commun. ACM, 54(5):67--77, 2011.Google ScholarDigital Library
- Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine- learning. In ASPLOS, 2014.Google ScholarDigital Library
- Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. Application-specific processing on a general-purpose core via transparent instruction set customization. In MICRO, 2004.Google ScholarDigital Library
- Rajagopalan Desikan, Doug Burger, and Stephen W. Keckler. Measuring experimental error in microprocessor simulation. In ISCA, 2001.Google Scholar
- Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 2010.Google ScholarDigital Library
- Peter A. Milder Eric S. Chung, James C. Hoe, and Ken Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO '10.Google Scholar
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News, 2011.Google ScholarDigital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In MICRO, 2012.Google ScholarDigital Library
- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst., 2009.Google ScholarDigital Library
- Brian Fields, Shai Rubin, and Rastislav Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001.Google Scholar
- Saturnino Garcia, Donghwan Jeon, Chris Louie, and Michael Bedford Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, 2011.Google ScholarDigital Library
- Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 2012.Google ScholarDigital Library
- Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO, 2011.Google ScholarDigital Library
- Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In ISCA '10.Google Scholar
- Mark Hempstead, Gu-Yeon Wei, and David Brooks. Navigo: An early-stage model to study power-contrained architectures and specialization. In Proceedings of Workshop on Modeling, Benchmarking, and Simulations (MoBS), 2009.Google Scholar
- Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In ISCA '10.Google Scholar
- R. Iyer. Accelerator-rich architectures: Implications, opportunities and challenges. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, 2012.Google Scholar
- Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA, 2011.Google Scholar
- Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013.Google Scholar
- Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, 2003.Google ScholarDigital Library
- Rakesh Kumar, Dean M. Tullsen, and Norman P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In PACT, 2006.Google ScholarDigital Library
- Chunho Lee, M. Potkonjak, and W.H. Mangione-Smith. Me- diaBench: a tool for evaluating and synthesizing multimedia and communications systems. In MICRO, 1997.Google Scholar
- Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. Rpstacks: Fast and accurate processor design space exploration using representative stall-event stacks. In MICRO, 2014.Google ScholarDigital Library
- Sheng Li, Jung-Ho Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009.Google ScholarDigital Library
- Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. PuDianNao: a polyvalent machine learning accelerator. In ASPLOS, 2015.Google ScholarDigital Library
- Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. Composite Cores: Pushing heterogeneity into a core. In MICRO, 2012.Google Scholar
- J. Meng, V.A. Morozov, K. Kumaran, V. Vishwanath, and T.D. Uram. GROPHECY: GPU performance projection from CPU code skeletons. In SC'11. ACM, 2011.Google ScholarDigital Library
- Tipp Moseley, Dirk Grunwald, Daniel A Connors, Ram Ra- manujam, Vasanth Tovinkere, and Ramesh Peri. Loopprof: Dynamic techniques for loop detection and profiling. In Proceedings of the 2006 Workshop on Binary Instrumentation and Applications (WBIA), 2006.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, 2009.Google Scholar
- Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In PACT, 2013.Google Scholar
- Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Exploring the potential of heterogeneous Von Neumann/Dataflow execution models. In ISCA, 2015.Google ScholarDigital Library
- Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Pushing the limits of accelerator efficiency while retaining programmability. In HPCA, 2016.Google ScholarCross Ref
- Tony Nowatzki, Venkatraman. Govindaraju, and Karthikeyan Sankaralingam. A graph-based program representation for analyzing hardware specialization approaches. Computer Architecture Letters, 2015.Google ScholarDigital Library
- Cedric Nugteren and Henk Corporaal. A modular and parameterisable classification of algorithms. Technical Report ESR-2011-02, Eindhoven University of Technology, 2011.Google Scholar
- Cedric Nugteren and Henk Corporaal. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing. In PPOPP, 2012.Google Scholar
- Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. In MICRO, 2013.Google ScholarDigital Library
- Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. Convolution engine: Balancing efficiency and flexibility in specialized computing. In ISCA, 2013.Google Scholar
- Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace processors. In MICRO, 1997.Google ScholarDigital Library
- Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ISCA, 2014.Google Scholar
- M. Shoaib Bin Altaf and D.A. Wood. LogCA: a performance model for hardware accelerators. Computer Architecture Letters, 2015.Google ScholarDigital Library
- Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP, 2012.Google ScholarDigital Library
- H. Singh, Ming-Hau Lee, Guangming Lu, F.J. Kurdahi, N. Bagherzadeh, and E.M. Chaves Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. Computers, IEEE Trans- actions on, 2000.Google Scholar
- Tyler Sondag and Hridesh Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In CGO, 2011.Google ScholarDigital Library
- Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural specialization for inter-iteration loop dependence patterns. In MICRO, 2014.Google ScholarDigital Library
- Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. Wavescalar. In MICRO, pages 291--, 2003.Google Scholar
- Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multicores through performance impact estimation (pie). In ISCA, 2012.Google Scholar
- Ashish Venkat and Dean M. Tullsen. Harnessing isa diversity: Design of a heterogeneous-isa chip multiprocessor. In ISCA, 2014.Google ScholarDigital Library
- Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In ASPLOS '10.Google Scholar
- Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multi-core architectures. Commun. ACM, 2009.Google Scholar
- Jonathan A. Winter, David H. Albonesi, and Christine A. Shoemaker. Scalable thread scheduling and global power management for heterogeneous many-core architectures. In PACT, 2010.Google ScholarDigital Library
- Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. Q100: The architecture and design of a database processing unit. In ASPLOS, 2014.Google ScholarDigital Library
- T. Zidenberg, I. Keslassy, and U. Weiser. Optimal resource allocation with multiamdahl. Computer, 2013.Google Scholar
Index Terms
- Analyzing Behavior Specialized Acceleration
Recommendations
Analyzing Behavior Specialized Acceleration
ASPLOS '16Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with ...
Analyzing Behavior Specialized Acceleration
ASPLOS'16Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...
Comments