research-article

Efficient execution of memory access phases using dataflow specialization

Authors:

Karthikeyan SankaralingamAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 118 - 130

https://doi.org/10.1145/2749469.2750390

Published: 13 June 2015 Publication History

Abstract

This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators.

We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.

References

[1]

"Intel's Sandy Bridge Microarchitecture," http://www.realworldtech.com/sandy-bridge/, accessed: 2014-08-14.

[2]

Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.

[3]

"Silvermont, Intel's Low Power Architecture," http://www.realworldtech.com/silvermont/, accessed: 2014-08-14.

[4]

M. Annavaram, J. M. Patel, and E. S. Davidson, "Data prefetching by dependence graph precomputation," in ISCA '01.

Digital Library

[5]

J.-M. Arnau, J.-M. Parcerisa, and P. Xekalakis, "Boosting mobile gpu performance with a decoupled access/execute fragment processor," in ISCA '12.

Digital Library

[6]

K. Arvind and R. S. Nikhil, "Executing a program on the mit tagged-token dataflow architecture," IEEE Trans. Comput., vol. 39, no. 3, pp. 300--318, Mar. 1990.

Digital Library

[7]

E. Bach, "The algebra of events," Linguistics and Philosophy, vol. 9, no. 1, pp. 5--16, 1986.

[8]

C. F. Batten, "Simplified vector-thread architectures for flexible and efficient data-parallel accelerators," Ph.D. dissertation, Cambridge, MA, USA, 2010, AAI0822514.

Digital Library

[9]

N. Bellas, I. N. Hajj, C. D. Polychronopoulos, and G. D. Stamoulis, "Energy and performance improvements in microprocessor design using a loop cache," in ICCD '99.

[10]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.

Digital Library

[11]

L. Brownston, R. Farrell, and E. Kant, Programming Expert Systems in Ops5: An Introduction to Rule-Based Programming. Addison-Wesley, 1985.

Digital Library

[12]

M. Budiu, G. Venkataramani, T. Chelcea, and S. C. Goldstein, "Spatial Computation," in ASPLOS XI.

Digital Library

[13]

R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt, "Simultaneous subordinate microthreading (ssmt)," in ISCA '99.

Digital Library

[14]

S. Che, M. Boyer, M. anoyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing."

[15]

Y. Chou, B. Fahs, and S. Abraham, "Microarchitecture optimizations for exploiting memory-level parallelism," in ISCA '04.

Digital Library

[16]

S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris, M. Schuette, and A. Saidi, "The reconfigurable streaming vector processor," in MICRO '03.

Digital Library

[17]

N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, "Application-specific processing on a general-purpose core via transparent instruction set customization," in MICRO '04.

Digital Library

[18]

J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, H. Huang, and G. Reinman, "Composable accelerator-rich microprocessor enhanced for adaptivity and longevity," in ISLPED '13.

Digital Library

[19]

K. Czechowski, V. Lee, E. Grochowski, R. Ronen, R. Singhal, R. Vuduc, and P. Dubey, "Improving the energy efficiency of big cores," in ISCA '14.

Digital Library

[20]

J. B. Dennis and D. P. Misunas, "A preliminary architecture for a basic data-flow processor," in ISCA '75.

Digital Library

[21]

M. Dubois and Y. H. Song, "Assisted execution," Department of EE-Systems, University of Southern California, Tech. Rep. #CENG 98-25, 1998.

[22]

J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in ICS '97.

Digital Library

[23]

"Hardware specialization with dyser." {Online}. Available: research.cs.wisc.edu/vertical/DySER

[24]

C. Ebeling, D. C. Cronquist, and P. Franklin, "Rapid - reconfigurable pipelined datapath," in FPL '96.

Digital Library

[25]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural acceleration for general-purpose approximate programs," in MICRO '12.

Digital Library

[26]

K. P. Eswaran, "Aspects of a trigger subsystem in an integrated database system," in ICSE '76.

Digital Library

[27]

A. Farcy, O. Temam, R. Espasa, and T. Juan, "Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes," in MICRO '98.

Digital Library

[28]

A. Garg and M. C. Huang, "A performance-correctness explicitly-decoupled architecture," in MICRO '08.

Digital Library

[29]

N. H. Gehani, H. V. Jagadish, and O. Shmueli, "Composite event specification in active databases: Model & implementation," in VLDB '92.

Digital Library

[30]

S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor, "PipeRench: A Reconfigurable Architecture and Compiler," IEEE Computer, vol. 33, no. 4, pp. 70--77, April 2000.

Digital Library

[31]

J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," in ISCA '85.

Digital Library

[32]

V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically specialized datapaths for energy efficient computing," in HPCA '11.

Digital Library

[33]

V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, "Dyser: Unifying functionality and parallelism specialization for energy efficient computing," IEEE Micro, vol. 33, no. 5, 2012.

Digital Library

[34]

V. Govindaraju, T. Nowatzki, and K. Sankaralingam, "Breaking simd shackles: Liberating accelerators by exposing flexible microarchitectural mechanisms," in PACT '13.

[35]

S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, "Bundled execution of recurring traces for energy-efficient general purpose processing," in MICRO-44 '11.

[36]

J. R. Gurd, C. C. Kirkham, and I. Watson, "The manchester prototype dataflow computer," Commun. ACM, vol. 28, no. 1, pp. 34--52, Jan. 1985. {Online}. Available: http://doi.acm.org/10.1145/2465.2468

Digital Library

[37]

T. R. Halfill, "AMD Bobcat snarls at Atom," Microprocessor Report, August 2010.

[38]

J. R. Hauser and J. Wawrzynek, "Garp: A MIPS Processor with a Reconfigurable Coprocessor," in FPCC '97.

[39]

C.-H. Ho, "Mechanisms Towards Energy-Efficient Dynamic Hardware Specialization," PhD Dissertation, Unversity of Wisconsin-Madison, 2014.

[40]

R. A. Iannucci, "Toward a dataflow/von neumann hybrid architecture," in ISCA '88.

Digital Library

[41]

R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, "The vector-thread architecture," Micro, IEEE, vol. 24, no. 6, pp. 84--90, 2004.

Digital Library

[42]

C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems," in MICRO '97.

Digital Library

[43]

S. Li, J.-H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO '09.

Digital Library

[44]

Y. Li and J. M. Patel, "Bitweaving: Fast scans for main memory data processing," in SIGMOD '13.

Digital Library

[45]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: building customized program analysis tools with dynamic instrumentation," in PLDI '05.

Digital Library

[46]

B. Mathew and A. Davis, "A loop accelerator for low power embedded vliw processors," in CODES + ISSS 2004.

Digital Library

[47]

D. McCarthy and U. Dayal, "The architecture of an active database management system," in SIGMOD '89.

Digital Library

[48]

M. Morgenstern, "Active databases as a paradigm for enhanced computing environments," in VLDB '83.

Digital Library

[49]

A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi, "Slice-processors: An implementation of operation-based prediction," in ICS '01.

Digital Library

[50]

O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, "Runahead execution: an alternative to very large instruction windows for out-of-order processors," in HPCA '03, pp. 129--140.

Digital Library

[51]

A. Nowatzi, V. Gangadhar, and K. Sankaralingam, "Exploring the potential of heterogeneous von neumann/dataflow execution models," in ISCA '15.

Digital Library

[52]

A. Pajuelo, A. González, and M. Valero, "Speculative dynamic vectorization," in ISCA '02.

Digital Library

[53]

G. Papadopoulos and D. Culler, "Monsoon: an explicit token-store architecture," in ISCA '90.

Digital Library

[54]

A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered instructions: A control paradigm for spatially-programmed architectures," in ISCA '13.

Digital Library

[55]

A. Poulovassilis, G. Papamarkos, and P. T. Wood, "Event-condition-action rule languages for the semantic web," in EDBT'06.

Digital Library

[56]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: Balancing efficiency & flexibility in specialized computing," in ISCA '13.

Digital Library

[57]

W. Ro, S. Crago, A. Despain, and J.-L. Gaudiot, "Design and evaluation of a hierarchical decoupled architecture," The Journal of Supercomputing, vol. 38, no. 3, pp. 237--259, 2006.

Digital Library

[58]

K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, S. W. Keckler, D. Burger, and C. R. Moore, "Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture," in ISCA '03.

Digital Library

[59]

N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, and P. Dubey, "Can traditional programming bridge the ninja performance gap for parallel computing applications?" in ISCA '12.

Digital Library

[60]

R. Singhal, ""inside intel next generation nehalem microarchitecture"," in Hot Chips, 2008.

[61]

A. Smith, R. Nagarajan, K. Sankaralingam, R. McDonald, D. Burger, S. W. Keckler, and K. S. McKinley, "Dataflow Predication," in MICRO 39.

[62]

J. E. Smith, "Decoupled access/execute computer architectures," in ISCA '82.

Digital Library

[63]

SPEC CPU2006. Standard Performance Evaluation Corporation, 2006.

[64]

M. Stonebraker, "A rules system for relational database management system," in International Conference on Databases, 1982.

[65]

S. Subramaniam and G. H. Loh, "Fire-and-forget: Load/store scheduling with no store queue at all," in MICRO '06.

Digital Library

[66]

S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, "Wavescalar," in MICRO '03.

Digital Library

[67]

S. Vajapeyam, P. J. Joseph, and T. Mitra, "Dynamic vectorization: A mechanism for exploiting far-flung ilp in ordinary programs," in ISCA '99.

Digital Library

[68]

G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, "Conservation Cores: Reducing the Energy of Mature Computations," in ASPLOS '10.

Digital Library

[69]

L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross, "Navigating big data with high-throughput, energy-efficient data partitioning," in ISCA '13.

Digital Library

[70]

Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, "Chimaera: a high-performance architecture with a tightly-coupled reconfigurable functional unit," in ISCA '00.

Digital Library

[71]

C. Zilles and G. Sohi, "Execution-based prediction using speculative slices," in ISCA '01.

Digital Library

[72]

C. B. Zilles and G. S. Sohi, "Understanding the backward slices of performance degrading instructions," in ISCA '00.

Digital Library

Cited By

Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Lee HSanchez D(2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00092
Naithani ARoelandts JAinsworth SJones TEeckhout L(2023)Decoupled Vector RunaheadProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614255(17-31)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614255
Show More Cited By

Index Terms

Efficient execution of memory access phases using dataflow specialization
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

Efficient execution of memory access phases using dataflow specialization
ISCA'15

This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These ...
Accelerating computation of Euclidean distance map using the GPU with efficient memory access

Recent graphics processing units GPUs, which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high ...
An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs
DFM '11: Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing

The move towards heterogeneous parallel computing is underway as witnessed by the emergence of novel computing platforms combining architecturally diverse components such as CPUs, GPUs and special function units. We approach mapping of streaming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
865
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kumar APrasanna ABalkind JShriraman ATsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)METAL: Caching Multi-level Indexes in Domain-Specific ArchitecturesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640402(715-729)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640402
Lee HSanchez D(2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00092
Naithani ARoelandts JAinsworth SJones TEeckhout L(2023)Decoupled Vector RunaheadProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614255(17-31)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614255
Wei TTurtayeva NOrenes-Vera MLonkar OBalkind JAamodt TJerger NSwift M(2023)Cohort: Software-Oriented Acceleration for Heterogeneous SoCsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582059(105-117)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582059
Orenes-Vera MManocha ABalkind JGao FAragón JWentzlaff DMartonosi MSalapura VZahran MChong FTang L(2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527400
Sedaghati AHakimi MHojabr RShriraman ASalapura VZahran MChong FTang L(2022)X-cacheProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527380(396-409)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527380
Liu LWei SZhu JDeng CLiu LWei SZhu JDeng C(2022)Technical Difficulties and Development TrendSoftware Defined Chips10.1007/978-981-19-7636-0_3(135-166)Online publication date: 15-Nov-2022
https://doi.org/10.1007/978-981-19-7636-0_3
Cavus MShatnawi MSendag RUht A(2021)Fast Key-Value Lookups with Node TrackerACM Transactions on Architecture and Code Optimization10.1145/345209918:3(1-26)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3452099
Naithani AAinsworth SJones TEeckhout L(2021)Vector Runahead2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00024(195-208)Online publication date: Jun-2021
https://doi.org/10.1109/ISCA52012.2021.00024
Esterle LBrown J(2020)I Think Therefore You AreACM Transactions on Cyber-Physical Systems10.1145/33754034:4(1-25)Online publication date: 18-Jun-2020
https://dl.acm.org/doi/10.1145/3375403
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten