skip to main content
10.1145/2872362.2872412acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Analyzing Behavior Specialized Acceleration

Published:25 March 2016Publication History

ABSTRACT

Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with only certain properties, but remain largely configurable or programmable. The most important strength of BSAs -- their ability to target a wide variety of codes -- also makes their interactions and analysis complex, raising the following questions: can multiple BSAs be composed synergistically, what are their interactions with the general purpose core, and what combinations favor which workloads? From a methodological standpoint, BSAs are also challenging, as they each require ISA development, compiler and assembler extensions, and either simulator or RTL models.

To study the potential of BSAs, we propose a novel modeling technique called the Transformable Dependence Graph (TDG) - a higher level alternative to the time-consuming traditional compiler+simulator approach, while still enabling detailed microarchitectural models for both general cores and accelerators. We then propose a multi-BSA organization, called ExoCore, which we model and study using the TDG. A design space exploration reveals that an ExoCore organization can push designs beyond the established energy-performance frontiers for general purpose cores. For example, a 2-wide OOO processor with three BSAs matches the performance of a conventional 6-wide OOO core, has 40% lower area, and is 2.6x more energy efficient.

References

  1. Parboil Benchmark Suite. impact.crhc.illinois.edu/parboil/parboil.aspx.Google ScholarGoogle Scholar
  2. Vertical Microbenchmarks. http://cs.wisc.edu/vertical/microbench.Google ScholarGoogle Scholar
  3. Renee St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014.Google ScholarGoogle Scholar
  4. Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shekhar Borkar and Andrew A. Chien. The future of micro- processors. Commun. ACM, 54(5):67--77, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine- learning. In ASPLOS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. Application-specific processing on a general-purpose core via transparent instruction set customization. In MICRO, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Rajagopalan Desikan, Doug Burger, and Stephen W. Keckler. Measuring experimental error in microprocessor simulation. In ISCA, 2001.Google ScholarGoogle Scholar
  10. Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Peter A. Milder Eric S. Chung, James C. Hoe, and Ken Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO '10.Google ScholarGoogle Scholar
  12. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In MICRO, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst., 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Brian Fields, Shai Rubin, and Rastislav Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001.Google ScholarGoogle Scholar
  16. Saturnino Garcia, Donghwan Jeon, Chris Louie, and Michael Bedford Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In ISCA '10.Google ScholarGoogle Scholar
  20. Mark Hempstead, Gu-Yeon Wei, and David Brooks. Navigo: An early-stage model to study power-contrained architectures and specialization. In Proceedings of Workshop on Modeling, Benchmarking, and Simulations (MoBS), 2009.Google ScholarGoogle Scholar
  21. Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In ISCA '10.Google ScholarGoogle Scholar
  22. R. Iyer. Accelerator-rich architectures: Implications, opportunities and challenges. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, 2012.Google ScholarGoogle Scholar
  23. Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA, 2011.Google ScholarGoogle Scholar
  24. Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013.Google ScholarGoogle Scholar
  25. Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rakesh Kumar, Dean M. Tullsen, and Norman P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In PACT, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chunho Lee, M. Potkonjak, and W.H. Mangione-Smith. Me- diaBench: a tool for evaluating and synthesizing multimedia and communications systems. In MICRO, 1997.Google ScholarGoogle Scholar
  28. Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. Rpstacks: Fast and accurate processor design space exploration using representative stall-event stacks. In MICRO, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sheng Li, Jung-Ho Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. PuDianNao: a polyvalent machine learning accelerator. In ASPLOS, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. Composite Cores: Pushing heterogeneity into a core. In MICRO, 2012.Google ScholarGoogle Scholar
  32. J. Meng, V.A. Morozov, K. Kumaran, V. Vishwanath, and T.D. Uram. GROPHECY: GPU performance projection from CPU code skeletons. In SC'11. ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tipp Moseley, Dirk Grunwald, Daniel A Connors, Ram Ra- manujam, Vasanth Tovinkere, and Ramesh Peri. Loopprof: Dynamic techniques for loop detection and profiling. In Proceedings of the 2006 Workshop on Binary Instrumentation and Applications (WBIA), 2006.Google ScholarGoogle Scholar
  34. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, 2009.Google ScholarGoogle Scholar
  35. Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In PACT, 2013.Google ScholarGoogle Scholar
  36. Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Exploring the potential of heterogeneous Von Neumann/Dataflow execution models. In ISCA, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Pushing the limits of accelerator efficiency while retaining programmability. In HPCA, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  38. Tony Nowatzki, Venkatraman. Govindaraju, and Karthikeyan Sankaralingam. A graph-based program representation for analyzing hardware specialization approaches. Computer Architecture Letters, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Cedric Nugteren and Henk Corporaal. A modular and parameterisable classification of algorithms. Technical Report ESR-2011-02, Eindhoven University of Technology, 2011.Google ScholarGoogle Scholar
  40. Cedric Nugteren and Henk Corporaal. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing. In PPOPP, 2012.Google ScholarGoogle Scholar
  41. Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. In MICRO, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. Convolution engine: Balancing efficiency and flexibility in specialized computing. In ISCA, 2013.Google ScholarGoogle Scholar
  43. Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace processors. In MICRO, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ISCA, 2014.Google ScholarGoogle Scholar
  45. M. Shoaib Bin Altaf and D.A. Wood. LogCA: a performance model for hardware accelerators. Computer Architecture Letters, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Singh, Ming-Hau Lee, Guangming Lu, F.J. Kurdahi, N. Bagherzadeh, and E.M. Chaves Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. Computers, IEEE Trans- actions on, 2000.Google ScholarGoogle Scholar
  48. Tyler Sondag and Hridesh Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In CGO, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural specialization for inter-iteration loop dependence patterns. In MICRO, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. Wavescalar. In MICRO, pages 291--, 2003.Google ScholarGoogle Scholar
  51. Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multicores through performance impact estimation (pie). In ISCA, 2012.Google ScholarGoogle Scholar
  52. Ashish Venkat and Dean M. Tullsen. Harnessing isa diversity: Design of a heterogeneous-isa chip multiprocessor. In ISCA, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In ASPLOS '10.Google ScholarGoogle Scholar
  54. Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multi-core architectures. Commun. ACM, 2009.Google ScholarGoogle Scholar
  55. Jonathan A. Winter, David H. Albonesi, and Christine A. Shoemaker. Scalable thread scheduling and global power management for heterogeneous many-core architectures. In PACT, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. Q100: The architecture and design of a database processing unit. In ASPLOS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. T. Zidenberg, I. Keslassy, and U. Weiser. Optimal resource allocation with multiamdahl. Computer, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Analyzing Behavior Specialized Acceleration

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2016
        824 pages
        ISBN:9781450340915
        DOI:10.1145/2872362
        • General Chair:
        • Tom Conte,
        • Program Chair:
        • Yuanyuan Zhou

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ASPLOS '16 Paper Acceptance Rate53of232submissions,23%Overall Acceptance Rate535of2,713submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader