skip to main content
10.1145/2749469.2750390acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Efficient execution of memory access phases using dataflow specialization

Published: 13 June 2015 Publication History

Abstract

This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators.
We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.

References

[1]
"Intel's Sandy Bridge Microarchitecture," http://www.realworldtech.com/sandy-bridge/, accessed: 2014-08-14.
[2]
Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.
[3]
"Silvermont, Intel's Low Power Architecture," http://www.realworldtech.com/silvermont/, accessed: 2014-08-14.
[4]
M. Annavaram, J. M. Patel, and E. S. Davidson, "Data prefetching by dependence graph precomputation," in ISCA '01.
[5]
J.-M. Arnau, J.-M. Parcerisa, and P. Xekalakis, "Boosting mobile gpu performance with a decoupled access/execute fragment processor," in ISCA '12.
[6]
K. Arvind and R. S. Nikhil, "Executing a program on the mit tagged-token dataflow architecture," IEEE Trans. Comput., vol. 39, no. 3, pp. 300--318, Mar. 1990.
[7]
E. Bach, "The algebra of events," Linguistics and Philosophy, vol. 9, no. 1, pp. 5--16, 1986.
[8]
C. F. Batten, "Simplified vector-thread architectures for flexible and efficient data-parallel accelerators," Ph.D. dissertation, Cambridge, MA, USA, 2010, AAI0822514.
[9]
N. Bellas, I. N. Hajj, C. D. Polychronopoulos, and G. D. Stamoulis, "Energy and performance improvements in microprocessor design using a loop cache," in ICCD '99.
[10]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.
[11]
L. Brownston, R. Farrell, and E. Kant, Programming Expert Systems in Ops5: An Introduction to Rule-Based Programming. Addison-Wesley, 1985.
[12]
M. Budiu, G. Venkataramani, T. Chelcea, and S. C. Goldstein, "Spatial Computation," in ASPLOS XI.
[13]
R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt, "Simultaneous subordinate microthreading (ssmt)," in ISCA '99.
[14]
S. Che, M. Boyer, M. anoyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing."
[15]
Y. Chou, B. Fahs, and S. Abraham, "Microarchitecture optimizations for exploiting memory-level parallelism," in ISCA '04.
[16]
S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris, M. Schuette, and A. Saidi, "The reconfigurable streaming vector processor," in MICRO '03.
[17]
N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, "Application-specific processing on a general-purpose core via transparent instruction set customization," in MICRO '04.
[18]
J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, H. Huang, and G. Reinman, "Composable accelerator-rich microprocessor enhanced for adaptivity and longevity," in ISLPED '13.
[19]
K. Czechowski, V. Lee, E. Grochowski, R. Ronen, R. Singhal, R. Vuduc, and P. Dubey, "Improving the energy efficiency of big cores," in ISCA '14.
[20]
J. B. Dennis and D. P. Misunas, "A preliminary architecture for a basic data-flow processor," in ISCA '75.
[21]
M. Dubois and Y. H. Song, "Assisted execution," Department of EE-Systems, University of Southern California, Tech. Rep. #CENG 98-25, 1998.
[22]
J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in ICS '97.
[23]
"Hardware specialization with dyser." {Online}. Available: research.cs.wisc.edu/vertical/DySER
[24]
C. Ebeling, D. C. Cronquist, and P. Franklin, "Rapid - reconfigurable pipelined datapath," in FPL '96.
[25]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural acceleration for general-purpose approximate programs," in MICRO '12.
[26]
K. P. Eswaran, "Aspects of a trigger subsystem in an integrated database system," in ICSE '76.
[27]
A. Farcy, O. Temam, R. Espasa, and T. Juan, "Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes," in MICRO '98.
[28]
A. Garg and M. C. Huang, "A performance-correctness explicitly-decoupled architecture," in MICRO '08.
[29]
N. H. Gehani, H. V. Jagadish, and O. Shmueli, "Composite event specification in active databases: Model & implementation," in VLDB '92.
[30]
S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor, "PipeRench: A Reconfigurable Architecture and Compiler," IEEE Computer, vol. 33, no. 4, pp. 70--77, April 2000.
[31]
J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," in ISCA '85.
[32]
V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically specialized datapaths for energy efficient computing," in HPCA '11.
[33]
V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, "Dyser: Unifying functionality and parallelism specialization for energy efficient computing," IEEE Micro, vol. 33, no. 5, 2012.
[34]
V. Govindaraju, T. Nowatzki, and K. Sankaralingam, "Breaking simd shackles: Liberating accelerators by exposing flexible microarchitectural mechanisms," in PACT '13.
[35]
S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, "Bundled execution of recurring traces for energy-efficient general purpose processing," in MICRO-44 '11.
[36]
J. R. Gurd, C. C. Kirkham, and I. Watson, "The manchester prototype dataflow computer," Commun. ACM, vol. 28, no. 1, pp. 34--52, Jan. 1985. {Online}. Available: http://doi.acm.org/10.1145/2465.2468
[37]
T. R. Halfill, "AMD Bobcat snarls at Atom," Microprocessor Report, August 2010.
[38]
J. R. Hauser and J. Wawrzynek, "Garp: A MIPS Processor with a Reconfigurable Coprocessor," in FPCC '97.
[39]
C.-H. Ho, "Mechanisms Towards Energy-Efficient Dynamic Hardware Specialization," PhD Dissertation, Unversity of Wisconsin-Madison, 2014.
[40]
R. A. Iannucci, "Toward a dataflow/von neumann hybrid architecture," in ISCA '88.
[41]
R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, "The vector-thread architecture," Micro, IEEE, vol. 24, no. 6, pp. 84--90, 2004.
[42]
C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems," in MICRO '97.
[43]
S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO '09.
[44]
Y. Li and J. M. Patel, "Bitweaving: Fast scans for main memory data processing," in SIGMOD '13.
[45]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: building customized program analysis tools with dynamic instrumentation," in PLDI '05.
[46]
B. Mathew and A. Davis, "A loop accelerator for low power embedded vliw processors," in CODES + ISSS 2004.
[47]
D. McCarthy and U. Dayal, "The architecture of an active database management system," in SIGMOD '89.
[48]
M. Morgenstern, "Active databases as a paradigm for enhanced computing environments," in VLDB '83.
[49]
A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi, "Slice-processors: An implementation of operation-based prediction," in ICS '01.
[50]
O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, "Runahead execution: an alternative to very large instruction windows for out-of-order processors," in HPCA '03, pp. 129--140.
[51]
A. Nowatzi, V. Gangadhar, and K. Sankaralingam, "Exploring the potential of heterogeneous von neumann/dataflow execution models," in ISCA '15.
[52]
A. Pajuelo, A. González, and M. Valero, "Speculative dynamic vectorization," in ISCA '02.
[53]
G. Papadopoulos and D. Culler, "Monsoon: an explicit token-store architecture," in ISCA '90.
[54]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered instructions: A control paradigm for spatially-programmed architectures," in ISCA '13.
[55]
A. Poulovassilis, G. Papamarkos, and P. T. Wood, "Event-condition-action rule languages for the semantic web," in EDBT'06.
[56]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: Balancing efficiency & flexibility in specialized computing," in ISCA '13.
[57]
W. Ro, S. Crago, A. Despain, and J.-L. Gaudiot, "Design and evaluation of a hierarchical decoupled architecture," The Journal of Supercomputing, vol. 38, no. 3, pp. 237--259, 2006.
[58]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, S. W. Keckler, D. Burger, and C. R. Moore, "Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture," in ISCA '03.
[59]
N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, "Can traditional programming bridge the ninja performance gap for parallel computing applications?" in ISCA '12.
[60]
R. Singhal, ""inside intel next generation nehalem microarchitecture"," in Hot Chips, 2008.
[61]
A. Smith, R. Nagarajan, K. Sankaralingam, R. McDonald, D. Burger, S. W. Keckler, and K. S. McKinley, "Dataflow Predication," in MICRO 39.
[62]
J. E. Smith, "Decoupled access/execute computer architectures," in ISCA '82.
[63]
SPEC CPU2006. Standard Performance Evaluation Corporation, 2006.
[64]
M. Stonebraker, "A rules system for relational database management system," in International Conference on Databases, 1982.
[65]
S. Subramaniam and G. H. Loh, "Fire-and-forget: Load/store scheduling with no store queue at all," in MICRO '06.
[66]
S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, "Wavescalar," in MICRO '03.
[67]
S. Vajapeyam, P. J. Joseph, and T. Mitra, "Dynamic vectorization: A mechanism for exploiting far-flung ilp in ordinary programs," in ISCA '99.
[68]
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, "Conservation Cores: Reducing the Energy of Mature Computations," in ASPLOS '10.
[69]
L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross, "Navigating big data with high-throughput, energy-efficient data partitioning," in ISCA '13.
[70]
Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, "Chimaera: a high-performance architecture with a tightly-coupled reconfigurable functional unit," in ISCA '00.
[71]
C. Zilles and G. Sohi, "Execution-based prediction using speculative slices," in ISCA '01.
[72]
C. B. Zilles and G. S. Sohi, "Understanding the backward slices of performance degrading instructions," in ISCA '00.

Cited By

View all
  • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
  • (2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
  • (2023)Decoupled Vector RunaheadProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614255(17-31)Online publication date: 28-Oct-2023
  • Show More Cited By

Index Terms

  1. Efficient execution of memory access phases using dataflow specialization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
      June 2015
      768 pages
      ISBN:9781450334020
      DOI:10.1145/2749469
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ISCA '15
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)42
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
      • (2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
      • (2023)Decoupled Vector RunaheadProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614255(17-31)Online publication date: 28-Oct-2023
      • (2023)Cohort: Software-Oriented Acceleration for Heterogeneous SoCsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582059(105-117)Online publication date: 25-Mar-2023
      • (2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
      • (2022)X-cacheProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527380(396-409)Online publication date: 18-Jun-2022
      • (2022)Technical Difficulties and Development TrendSoftware Defined Chips10.1007/978-981-19-7636-0_3(135-166)Online publication date: 15-Nov-2022
      • (2021)Fast Key-Value Lookups with Node TrackerACM Transactions on Architecture and Code Optimization10.1145/345209918:3(1-26)Online publication date: 8-Jun-2021
      • (2021)Vector Runahead2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00024(195-208)Online publication date: Jun-2021
      • (2020)I Think Therefore You AreACM Transactions on Cyber-Physical Systems10.1145/33754034:4(1-25)Online publication date: 18-Jun-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media