skip to main content
10.1145/2451116.2451143acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?

Published: 16 March 2013 Publication History

Abstract

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.

References

[1]
B. A. Babaian, S. K. Okunev, and V. Y. Volkonsky. Critical path optimization--unload hard extended scalar block. USPTO 6584611, 2001.
[2]
R. D. Barnes, J.W. Sias, E. M. Nystrom, S. J. Patel, J. N. Navarro, and W.-m. W. Hwu. Beating in-order stalls with "flea-flicker" two-pass pipelining. IEEE Trans. Comput., 55(1):18--33, Jan. 2006.
[3]
A. T. Brian Kreskamp, Pablo Montesinos. Enhancing mlp: Runahead execution and related techniques. IACOMA Technical Report 512, 2005.
[4]
M. Butler and Y. Patt. An investigation of the performance of various dynamic scheduling techniques. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 1--9, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.
[5]
H. W. Cain and P. Nagpurkar. Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor. In ISPASS, pages 203--212, 2010.
[6]
L. Carter, W. Chuang, and B. Calder. An epic processor with pending functional units. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science, pages 445--448. Springer Berlin / Heidelberg, 2006.
[7]
P. P. Chang, W. Y. Chen, S. A. Mahlke, and W.-m. W. Hwu. Comparing static and dynamic code scheduling for multiple-instruction-issue processors. In Proceedings of the 24th annual international symposium on Microarchitecture, MICRO 24, pages 25--33, New York, NY, USA, 1991. ACM.
[8]
A. Deb, J. M. Codina, and A. Gonzalez. Softhv: a hw/sw co-designed processor with horizontal and vertical fusion. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF '11, pages 1:1--1:10, New York, NY, USA, 2011. ACM.
[9]
J. C. Dehnert et al. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Reallife Challenges. In Proceedings of the International Symposium on Code Generation and Optimization, pages 15--24, 2003.
[10]
J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. Intel Whitepaper, 2006.
[11]
M. Dupre, N. Darch, and O. Teman. VHC: Quickly Building an Optimizer for Complex Embedded Architectures. In Proceedings of the International Symposium on Code Generation and Optimization, pages 53--64, 2004.
[12]
K. Ebcioglu and E. R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 26--37, June 1997.
[13]
B. Fahs et al. Performance Characterization of a Hardware Framework for Dynamic Optimization. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2001.
[14]
B. A. Fields, S. Rubin, and R. Bodik. Focusing processor policies via Critical-Path prediction. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 74--85, July 2001.
[15]
J. Fritts and W. Wolf. Evaluation of static and dynamic scheduling for media processors. In Proceedings of the 2nd Workshop on Media Processors and DSPs, Micro '00, 2000.
[16]
J. S. Gardner. Mips aptiv cores hit the mark. Microprocessor Report, May 2012.
[17]
M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley. An evaluation of the trips computer system. In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems, ASPLOS '09, pages 1--12, New York, NY, USA, 2009. ACM.
[18]
J. P. Grossman. Cheap out-of-order execution using delayed issue. In Proceedings of the International Conference of Computer Design, CD 2000, pages 549 -- 551, 2000.
[19]
M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y.Watanabe, and T. Yamazaki. Synergistic processing in cell's multicore architecture. IEEE Micro, 26(2):10--24, Mar. 2006.
[20]
T. R. Halfhill. Netlogic doubles up xlp. Microprocessor Report, April 2011.
[21]
M. Heffernan. Data-Dependency Graph Transformations for Instruction Scheduling. PhD thesis, Massachusetts Institute of Technology, 2007.
[22]
A. Hilton, S. Nagarakatte, and A. Roth. icfp: Tolerating all-level cache misses in in-order processors. IEEE Micro, 30(1):12--19, Jan. 2010.
[23]
M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing memory operations: Providing memory performance feedback in modern processors. In In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 260--270, 1996.
[24]
Intel. Intel 64 and ia-32 architectures optimization reference manual. Intel Technical Manual, 2012.
[25]
Intel. Intel architecture instruction set extensions programming reference. Intel Technical Manual, 2012.
[26]
D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ASPLOS-X, pages 159--170, New York, NY, USA, 2002. ACM.
[27]
A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Whitepaper, Jan. 2000.
[28]
H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51:639--662, November 2007.
[29]
D. J. Lilja. Reducing the branch penalty in pipelined processors. Computer, 21(7):47--55, July 1988.
[30]
C. E. Love and H. F. Jordan. An investigation of static versus dynamic scheduling. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 192--201, New York, NY, USA, 1990. ACM.
[31]
C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. IEEE Micro, 23(2):44--55, Mar. 2003.
[32]
R. Nagarajan, S. K. Kushwaha, D. Burger, K. S. McKinley, C. Lin, and S. W. Keckler. Static placement, dynamic issue (spdi) scheduling for edge architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 74--84, Washington, DC, USA, 2004. IEEE Computer Society.
[33]
N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles. Hardware atomicity for reliable software speculation. In Proceedings of the 34th International Symposium on Computer Architecture, pages 174--185, 2007.
[34]
O. Palomar, T. Juan, and J. J. Navarro. Reusing cached schedules in an out-of-order processor with in-order issue logic. In Proceedings of the 2009 IEEE international conference on Computer design, ICCD'09, pages 246--253, Piscataway, NJ, USA, 2009. IEEE Press.
[35]
S. J. Patel and S. S. Lumetta. rePLay: A Hardware Framework for Dynamic Optimization. IEEE Transactions on Computers, 50(6):590--608, 2001.
[36]
S. J. Patel, T. Tung, S. Bose, and M. M. Crum. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, MICRO 33, pages 303--313, New York, NY, USA, 2000. ACM.
[37]
N. Ranganathan, R. Nagarajan, D. Jimnez, D. Burger, S. W. Keckler, and C. Lin. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance. Technical report, 2002.
[38]
B. R. Rau. Dynamically scheduled vliw processors. In Proceedings of the 26th annual international symposium on Microarchitecture, MICRO 26, pages 80--92, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press.
[39]
K. W. Rudd and M. J. Flynn. Instruction-level parallel processorsdynamic and static scheduling tradeoffs. In Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis, PAS '97, pages 74--, Washington, DC, USA, 1997. IEEE Computer Society.
[40]
H. Sharangpani and K. Arora. Itanium processor microarchitecture. IEEE Micro, 20(5):24--43, Sept. 2000.
[41]
J. L. Shin, H. Park, H. Li, A. Smith, Y. Choi, H. Sathianathan, S. Dash, S. Turullols, S. Kim, R. Masleid, G. Konstadinidis, R. T. Golla, M. J. Doherty, G. Grohoski, and C. McAllister. The next-generation 64b sparc core in a t4 soc processor. In ISSCC, pages 60--62, 2012.
[42]
G. Shobaki. Optimal Global Instruction Scheduling Using Enumeration. PhD thesis, University of California Davis, 2006.
[43]
G. Shobaki, K. Wilken, and M. Heffernan. Optimal trace scheduling using enumeration. ACM Trans. Archit. Code Optim., 5(4):19:1--19:32, Mar. 2009.
[44]
M. D. Smith, M. Horowitz, and M. S. Lam. Efficient superscalar performance through boosting. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 248--259, 1992.
[45]
F. Spadini, B. Fahs, S. Patel, and S. S. Lumetta. Improving quasidynamic schedules through region slip. In Proceedings of the international symposium on Code generation and optimization: feedbackdirected and runtime optimization, CGO '03, pages 149--158, Washington, DC, USA, 2003. IEEE Computer Society.
[46]
S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. Continual flow pipelines. In Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, ASPLOS-XI, pages 107--119, New York, NY, USA, 2004. ACM.
[47]
E. Talpes and D. Marculescu. Execution cache-based microarchitecture power-efficient superscalar processors. IEEE Trans. Very Large Scale Integr. Syst., 13(1):14--26, Jan. 2005.
[48]
S. Undy. Poulson: An 8 core 32nm next generation intel itanium processor, 2011.
[49]
M. G. Valluri, L. K. John, and K. S. McKinley. Low-power, low-complexity instruction issue using compiler assistance. In Proceedings of the 19th annual international conference on Supercomputing, ICS '05, pages 209--218, New York, NY, USA, 2005. ACM.
[50]
D. W. Wall. Limits of instruction-level parallelism. SIGARCH Comput. Archit. News, 19(2):176--188, Apr. 1991.
[51]
M. T. Yourst and K. Ghose. Incremental commit groups for nonatomic trace processing. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 67--80, Washington, DC, USA, 2005. IEEE Computer Society.

Cited By

View all
  • (2024)Levioso: Efficient Compiler-Informed Secure SpeculationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655894(1-6)Online publication date: 23-Jun-2024
  • (2023)HidFix: Efficient Mitigation of Cache-Based Spectre Attacks Through Hidden Rollbacks2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323979(1-9)Online publication date: 28-Oct-2023
  • (2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
March 2013
574 pages
ISBN:9781450318709
DOI:10.1145/2451116
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
    ASPLOS '13
    March 2013
    540 pages
    ISSN:0163-5964
    DOI:10.1145/2490301
    Issue’s Table of Contents
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 4
    ASPLOS '13
    April 2013
    540 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2499368
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic scheduling
  2. hw/sw co-design

Qualifiers

  • Research-article

Conference

ASPLOS '13

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)234
  • Downloads (Last 6 weeks)9
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Levioso: Efficient Compiler-Informed Secure SpeculationProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655894(1-6)Online publication date: 23-Jun-2024
  • (2023)HidFix: Efficient Mitigation of Cache-Based Spectre Attacks Through Hidden Rollbacks2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323979(1-9)Online publication date: 28-Oct-2023
  • (2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
  • (2020)SIMT-XACM Transactions on Architecture and Code Optimization10.1145/339203217:2(1-23)Online publication date: 29-May-2020
  • (2020)Energy Efficient Runahead Execution on a Tightly Coupled Heterogeneous CoreProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3368474.3368496(207-216)Online publication date: 15-Jan-2020
  • (2019)Aggressive Memory Speculation in HW/SW Co-Designed Machines2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715010(332-335)Online publication date: Mar-2019
  • (2019)A Survey on Food ComputingACM Computing Surveys10.1145/332916852:5(1-36)Online publication date: 13-Sep-2019
  • (2019)Hybrid-DBT: Hardware/Software Dynamic Binary Translation Targeting VLIWIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.286428838:10(1872-1885)Online publication date: Oct-2019
  • (2019)Deepframe: A Profile-driven Compiler for Spatial Hardware AcceleratorsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2019.00014(68-81)Online publication date: 23-Sep-2019
  • (2017)CG-OoOACM Transactions on Architecture and Code Optimization10.1145/315103414:4(1-26)Online publication date: 5-Dec-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media