research-article

Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?

Authors:

Daniel S. McFarlin,

Charles Tucker,

Craig ZillesAuthors Info & Claims

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Pages 241 - 252

https://doi.org/10.1145/2451116.2451143

Published: 16 March 2013 Publication History

Abstract

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.

References

[1]

B. A. Babaian, S. K. Okunev, and V. Y. Volkonsky. Critical path optimization--unload hard extended scalar block. USPTO 6584611, 2001.

[2]

R. D. Barnes, J.W. Sias, E. M. Nystrom, S. J. Patel, J. N. Navarro, and W.-m. W. Hwu. Beating in-order stalls with "flea-flicker" two-pass pipelining. IEEE Trans. Comput., 55(1):18--33, Jan. 2006.

Digital Library

[3]

A. T. Brian Kreskamp, Pablo Montesinos. Enhancing mlp: Runahead execution and related techniques. IACOMA Technical Report 512, 2005.

[4]

M. Butler and Y. Patt. An investigation of the performance of various dynamic scheduling techniques. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 1--9, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.

Digital Library

[5]

H. W. Cain and P. Nagpurkar. Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor. In ISPASS, pages 203--212, 2010.

[6]

L. Carter, W. Chuang, and B. Calder. An epic processor with pending functional units. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science, pages 445--448. Springer Berlin / Heidelberg, 2006.

Digital Library

[7]

P. P. Chang, W. Y. Chen, S. A. Mahlke, and W.-m. W. Hwu. Comparing static and dynamic code scheduling for multiple-instruction-issue processors. In Proceedings of the 24th annual international symposium on Microarchitecture, MICRO 24, pages 25--33, New York, NY, USA, 1991. ACM.

Digital Library

[8]

A. Deb, J. M. Codina, and A. Gonzalez. Softhv: a hw/sw co-designed processor with horizontal and vertical fusion. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF '11, pages 1:1--1:10, New York, NY, USA, 2011. ACM.

Digital Library

[9]

J. C. Dehnert et al. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Reallife Challenges. In Proceedings of the International Symposium on Code Generation and Optimization, pages 15--24, 2003.

Digital Library

[10]

J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. Intel Whitepaper, 2006.

[11]

M. Dupre, N. Darch, and O. Teman. VHC: Quickly Building an Optimizer for Complex Embedded Architectures. In Proceedings of the International Symposium on Code Generation and Optimization, pages 53--64, 2004.

Digital Library

[12]

K. Ebcioglu and E. R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 26--37, June 1997.

Digital Library

[13]

B. Fahs et al. Performance Characterization of a Hardware Framework for Dynamic Optimization. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2001.

Digital Library

[14]

B. A. Fields, S. Rubin, and R. Bodik. Focusing processor policies via Critical-Path prediction. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 74--85, July 2001.

Digital Library

[15]

J. Fritts and W. Wolf. Evaluation of static and dynamic scheduling for media processors. In Proceedings of the 2nd Workshop on Media Processors and DSPs, Micro '00, 2000.

[16]

J. S. Gardner. Mips aptiv cores hit the mark. Microprocessor Report, May 2012.

[17]

M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley. An evaluation of the trips computer system. In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems, ASPLOS '09, pages 1--12, New York, NY, USA, 2009. ACM.

Digital Library

[18]

J. P. Grossman. Cheap out-of-order execution using delayed issue. In Proceedings of the International Conference of Computer Design, CD 2000, pages 549 -- 551, 2000.

Digital Library

[19]

M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y.Watanabe, and T. Yamazaki. Synergistic processing in cell's multicore architecture. IEEE Micro, 26(2):10--24, Mar. 2006.

Digital Library

[20]

T. R. Halfhill. Netlogic doubles up xlp. Microprocessor Report, April 2011.

[21]

M. Heffernan. Data-Dependency Graph Transformations for Instruction Scheduling. PhD thesis, Massachusetts Institute of Technology, 2007.

[22]

A. Hilton, S. Nagarakatte, and A. Roth. icfp: Tolerating all-level cache misses in in-order processors. IEEE Micro, 30(1):12--19, Jan. 2010.

Digital Library

[23]

M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing memory operations: Providing memory performance feedback in modern processors. In In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 260--270, 1996.

Digital Library

[24]

Intel. Intel 64 and ia-32 architectures optimization reference manual. Intel Technical Manual, 2012.

[25]

Intel. Intel architecture instruction set extensions programming reference. Intel Technical Manual, 2012.

[26]

D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ASPLOS-X, pages 159--170, New York, NY, USA, 2002. ACM.

Digital Library

[27]

A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Whitepaper, Jan. 2000.

[28]

H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51:639--662, November 2007.

Digital Library

[29]

D. J. Lilja. Reducing the branch penalty in pipelined processors. Computer, 21(7):47--55, July 1988.

Digital Library

[30]

C. E. Love and H. F. Jordan. An investigation of static versus dynamic scheduling. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 192--201, New York, NY, USA, 1990. ACM.

Digital Library

[31]

C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. IEEE Micro, 23(2):44--55, Mar. 2003.

Digital Library

[32]

R. Nagarajan, S. K. Kushwaha, D. Burger, K. S. McKinley, C. Lin, and S. W. Keckler. Static placement, dynamic issue (spdi) scheduling for edge architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 74--84, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[33]

N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles. Hardware atomicity for reliable software speculation. In Proceedings of the 34th International Symposium on Computer Architecture, pages 174--185, 2007.

Digital Library

[34]

O. Palomar, T. Juan, and J. J. Navarro. Reusing cached schedules in an out-of-order processor with in-order issue logic. In Proceedings of the 2009 IEEE international conference on Computer design, ICCD'09, pages 246--253, Piscataway, NJ, USA, 2009. IEEE Press.

Digital Library

[35]

S. J. Patel and S. S. Lumetta. rePLay: A Hardware Framework for Dynamic Optimization. IEEE Transactions on Computers, 50(6):590--608, 2001.

Digital Library

[36]

S. J. Patel, T. Tung, S. Bose, and M. M. Crum. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, MICRO 33, pages 303--313, New York, NY, USA, 2000. ACM.

Digital Library

[37]

N. Ranganathan, R. Nagarajan, D. Jimnez, D. Burger, S. W. Keckler, and C. Lin. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance. Technical report, 2002.

[38]

B. R. Rau. Dynamically scheduled vliw processors. In Proceedings of the 26th annual international symposium on Microarchitecture, MICRO 26, pages 80--92, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press.

Digital Library

[39]

K. W. Rudd and M. J. Flynn. Instruction-level parallel processorsdynamic and static scheduling tradeoffs. In Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis, PAS '97, pages 74--, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

[40]

H. Sharangpani and K. Arora. Itanium processor microarchitecture. IEEE Micro, 20(5):24--43, Sept. 2000.

Digital Library

[41]

J. L. Shin, H. Park, H. Li, A. Smith, Y. Choi, H. Sathianathan, S. Dash, S. Turullols, S. Kim, R. Masleid, G. Konstadinidis, R. T. Golla, M. J. Doherty, G. Grohoski, and C. McAllister. The next-generation 64b sparc core in a t4 soc processor. In ISSCC, pages 60--62, 2012.

[42]

G. Shobaki. Optimal Global Instruction Scheduling Using Enumeration. PhD thesis, University of California Davis, 2006.

Digital Library

[43]

G. Shobaki, K. Wilken, and M. Heffernan. Optimal trace scheduling using enumeration. ACM Trans. Archit. Code Optim., 5(4):19:1--19:32, Mar. 2009.

Digital Library

[44]

M. D. Smith, M. Horowitz, and M. S. Lam. Efficient superscalar performance through boosting. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 248--259, 1992.

Digital Library

[45]

F. Spadini, B. Fahs, S. Patel, and S. S. Lumetta. Improving quasidynamic schedules through region slip. In Proceedings of the international symposium on Code generation and optimization: feedbackdirected and runtime optimization, CGO '03, pages 149--158, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[46]

S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. Continual flow pipelines. In Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, ASPLOS-XI, pages 107--119, New York, NY, USA, 2004. ACM.

Digital Library

[47]

E. Talpes and D. Marculescu. Execution cache-based microarchitecture power-efficient superscalar processors. IEEE Trans. Very Large Scale Integr. Syst., 13(1):14--26, Jan. 2005.

Digital Library

[48]

S. Undy. Poulson: An 8 core 32nm next generation intel itanium processor, 2011.

[49]

M. G. Valluri, L. K. John, and K. S. McKinley. Low-power, low-complexity instruction issue using compiler assistance. In Proceedings of the 19th annual international conference on Supercomputing, ICS '05, pages 209--218, New York, NY, USA, 2005. ACM.

Digital Library

[50]

D. W. Wall. Limits of instruction-level parallelism. SIGARCH Comput. Archit. News, 19(2):176--188, Apr. 1991.

Digital Library

[51]

M. T. Yourst and K. Ghose. Incremental commit groups for nonatomic trace processing. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 67--80, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

Cited By

Hajiabadi AAgarwal ADiavastos ACarlson TDe V(2024)Levioso: Efficient Compiler-Informed Secure SpeculationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655894(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655894
Pashrashid AHajiabadi ACarlson T(2023)HidFix: Efficient Mitigation of Cache-Based Spectre Attacks Through Hidden Rollbacks2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323979(1-9)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323979
Diavastos ACarlson T(2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1145/3548681
Show More Cited By

Index Terms

Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
ASPLOS '13

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the ...
Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
ASPLOS '13

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the ...
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

March 2013

574 pages

ISBN:9781450318709

DOI:10.1145/2451116

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

ACM SIGARCH Computer Architecture News Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '13

Sponsor:

ASPLOS '13: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2013

Texas, Houston, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
3,013
Total Downloads

Downloads (Last 12 months)234
Downloads (Last 6 weeks)9

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hajiabadi AAgarwal ADiavastos ACarlson TDe V(2024)Levioso: Efficient Compiler-Informed Secure SpeculationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655894(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3655894
Pashrashid AHajiabadi ACarlson T(2023)HidFix: Efficient Mitigation of Cache-Based Spectre Attacks Through Hidden Rollbacks2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323979(1-9)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323979
Diavastos ACarlson T(2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1145/3548681
Tino ACollange CSeznec A(2020)SIMT-XACM Transactions on Architecture and Code Optimization10.1145/339203217:2(1-23)Online publication date: 29-May-2020
https://dl.acm.org/doi/10.1145/3392032
Mashimo SShioya RInoue K(2020)Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous CoreProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368496(207-216)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3368474.3368496
Rokicki SRohou EDerrien S(2019)Aggressive Memory Speculation in HW/SW Co-Designed Machines2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715010(332-335)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8715010
Min WJiang SLiu LRui YJain R(2019)A Survey on Food ComputingACM Computing Surveys10.1145/332916852:5(1-36)Online publication date: 13-Sep-2019
https://dl.acm.org/doi/10.1145/3329168
Rokicki SRohou EDerrien S(2019)Hybrid-DBT: Hardware/Software Dynamic Binary Translation Targeting VLIWIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.286428838:10(1872-1885)Online publication date: Oct-2019
https://doi.org/10.1109/TCAD.2018.2864288
Guha AVedula NShriraman A(2019)Deepframe: A Profile-driven Compiler for Spatial Hardware AcceleratorsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00014(68-81)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1109/PACT.2019.00014
Mohammadi MAamodt TDally W(2017)CG-OoOACM Transactions on Architecture and Code Optimization10.1145/315103414:4(1-26)Online publication date: 5-Dec-2017
https://dl.acm.org/doi/10.1145/3151034
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents