skip to main content
10.1145/2749469.2750380acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Exploring the potential of heterogeneous von neumann/dataflow execution models

Published: 13 June 2015 Publication History

Abstract

General purpose processors (GPPs), from small inorder designs to many-issue out-of-order, incur large power overheads which must be addressed for future technology generations. Major sources of overhead include structures which dynamically extract the data-dependence graph or maintain precise state. Considering irregular workloads, current specialization approaches either heavily curtail performance, or provide simply too little benefit. Interestingly, well known explicit-dataflow architectures eliminate these overheads by directly executing the data-dependence graph and eschewing instruction-precise recoverability. However, even after decades of research, dataflow architectures have yet to come into prominence as a solution. We attribute this to a lack of effective control speculation and the latency overhead of explicit communication, which is crippling for certain codes.
This paper makes the observation that if both out-of-order and explicit-dataflow were available in one processor, many types of GPP cores can benefit from dynamically switching during certain phases of an application's lifetime. Analysis reveals that an ideal explicit-dataflow engine could be profitable for more than half of instructions, providing significant performance and energy improvements. The challenge is to achieve these benefits without introducing excess hardware complexity. To this end, we propose the Specialization Engine for Explicit-Dataflow (SEED). Integrated with an inorder core, we see 1.67× performance and 1.65× energy benefits, with an Out-Of-Order (OOO) dual-issue core we see 1.33× and 1.70×, and with a quad-issue OOO, 1.14× and 1.54×.

References

[1]
K. Arvind and R. S. Nikhil, "Executing a program on the mit tagged-token dataflow architecture," IEEE Trans. Comput., 1990.
[2]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, 2011.
[3]
M. Budiu, P. V. Artigas, and S. C. Goldstein, "Dataflow: A complement to superscalar," in ISPASS, 2005.
[4]
R. Buehrer and K. Ekanadham, "Incorporating data flow ideas into von neumann processors for parallel execution," Computers, IEEE Transactions on, 1987.
[5]
D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, W. Yoder, and the TRIPS Team, "Scaling to the end of silicon with EDGE architectures," IEEE Computer, 2004.
[6]
N. Clark, A. Hormati, and S. Mahlke, "Veal: Virtualized execution accelerator for loops," in ISCA '08.
[7]
N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, "Application-specific processing on a general-purpose core via transparent instruction set customization," in MICRO, 2004.
[8]
B. Fields, R. Bodik, M. Hill, and C. Newburn, "Using interaction costs for microarchitectural bottleneck analysis," in MICRO, 2003.
[9]
M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley, "An evaluation of the trips computer system," in ASPLOS '09.
[10]
D. Gibson and D. A. Wood, "Forwardflow: A scalable core for power-constrained cmps," in ISCA, 2010.
[11]
V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically specialized datapaths for energy efficient computing," in HPCA, 2011.
[12]
V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, "Dyser: Unifying functionality and parallelism specialization for energy efficient computing," IEEE Micro, 2012.
[13]
P. Greenhalgh, "Big. little processing with arm cortex-a15 & cortex-a7," ARM White Paper, 2011.
[14]
S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, "Bundled execution of recurring traces for energy-efficient general purpose processing," in MICRO, 2011.
[15]
M. Hayenga, V. Naresh, and M. Lipasti, "Revolver: Processor architecture for power efficient loop execution," in HPCA, 2014.
[16]
C.-H. Ho, V. Govindaraju, T. Nowatzki, R. Nagaraju, Z. Marzec, P. Agarwal, C. Frericks, R. Cofell, and K. Sankaralingam, "Performance evaluation of a dyser fpga prototype system spanning the compiler, microarchitecture, and hardware implementation," in ISPASS, 2015.
[17]
C.-H. Ho, S. J. Kim, and K. Sankaralingam, "Efficient execution of memory access phases using dataflow specialization," in ISCA, 2015.
[18]
R. A. Iannucci, "Toward a dataflow/von neumann hybrid architecture," in ISCA, 1988.
[19]
R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, "The vector-thread architecture." in ISCA, 2004.
[20]
C. Lee, M. Potkonjak, and W. Mangione-Smith, "Mediabench: a tool for evaluating and synthesizing multimedia and communications systems," in MICRO, 1997.
[21]
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators," in ACM SIGARCH Computer Architecture News, 2011.
[22]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO '09.
[23]
A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. Wenisch, and S. Mahlke, "Composite cores: Pushing heterogeneity into a core," in MICRO, 2012.
[24]
M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C. Goldstein, and M. Budiu, "Tartan: evaluating spatial computation for whole program execution," in ASPLOS, 2006.
[25]
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," HP Laboratories, 2009.
[26]
T. Nowatzki, V. Gangadhar, and K. Sankaralingam, "Studying hybrid von-neumann/dataflow execution models," Computer Sciences Department, University of Wisconsin-Madison, Tech. Rep., 2015.
[27]
T. Nowatzki, M. Sartin-Tarm, L. De Carli, K. Sankaralingam, C. Estan, and B. Robatmili, "A general constraint-centric scheduling framework for spatial architectures," in PLDI, 2013.
[28]
S. Padmanabha, A. Lukefahr, R. Das, and S. A. Mahlke, "Trace based phase prediction for tightly-coupled heterogeneous cores," in MICRO, 2013.
[29]
G. M. Papadopoulos, "Monsoon: an explicit token-store architecture," in ISCA, 1990.
[30]
Y. Park, J. J. K. Park, H. Park, and S. Mahlke, "Libra: Tailoring simd execution using heterogeneous hardware and dynamic configurability," in MICRO, 2012.
[31]
K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger, "Distributed Microarchitectural Protocols in the TRIPS Prototype Processor," in MICRO, 2006.
[32]
S. Srinath, B. Ilbeyi, M. Tan, G. Liu, Z. Zhang, and C. Batten, "Architectural specialization for inter-iteration loop dependence patterns," in MICRO, 2014.
[33]
S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, "Wavescalar," in MICRO, 2003.
[34]
A. Venkat and D. M. Tullsen, "Harnessing isa diversity: Design of a heterogeneous-isa chip multiprocessor," in ISCA, 2014.
[35]
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, "Conservation Cores: Reducing the Energy of Mature Computations," in ASPLOS '10.
[36]
G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, "Qscores: Trading dark silicon for scalable energy efficiency with quasi-specific cores," in MICRO, 2011.

Cited By

View all
  • (2025)Enhancing CGRA Efficiency Through Aligned Compute and Communication ProvisioningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707230(410-425)Online publication date: 30-Mar-2025
  • (2024)The TYR Dataflow Architecture: Improving Locality by Taming Parallelism2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00089(1184-1200)Online publication date: 2-Nov-2024
  • (2023)Clockhands: Rename-free Instruction Set Architecture for Out-of-order ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614272(1-16)Online publication date: 28-Oct-2023
  • Show More Cited By

Index Terms

  1. Exploring the potential of heterogeneous von neumann/dataflow execution models

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
        June 2015
        768 pages
        ISBN:9781450334020
        DOI:10.1145/2749469
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 June 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        ISCA '15
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 543 of 3,203 submissions, 17%

        Upcoming Conference

        ISCA '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)65
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 05 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Enhancing CGRA Efficiency Through Aligned Compute and Communication ProvisioningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707230(410-425)Online publication date: 30-Mar-2025
        • (2024)The TYR Dataflow Architecture: Improving Locality by Taming Parallelism2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00089(1184-1200)Online publication date: 2-Nov-2024
        • (2023)Clockhands: Rename-free Instruction Set Architecture for Out-of-order ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614272(1-16)Online publication date: 28-Oct-2023
        • (2023)Accelerating RTL Simulation with Hardware-Software Co-DesignProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614257(153-166)Online publication date: 28-Oct-2023
        • (2023)Implementation of Dataflow Software Pipelining for Codelet ModelProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583734(161-172)Online publication date: 15-Apr-2023
        • (2023)NoC-based hardware software co-design framework for dataflow thread managementThe Journal of Supercomputing10.1007/s11227-023-05335-879:16(17983-18020)Online publication date: 11-May-2023
        • (2023)DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial DatapathEfficient Execution of Irregular Dataflow Graphs10.1007/978-3-031-33136-7_5(89-123)Online publication date: 26-Apr-2023
        • (2022)CalipersProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532390(1-14)Online publication date: 28-Jun-2022
        • (2022)Energy Efficient Computing Systems: Architectures, Abstractions and Modeling to Techniques and StandardsACM Computing Surveys10.1145/351109454:11s(1-37)Online publication date: 9-Sep-2022
        • (2022)The Mozart reuse exposed dataflow processor for AI and beyondProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533040(978-992)Online publication date: 18-Jun-2022
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media