research-article

Branch vanguard: decomposing branch functionality into prediction and resolution instructions

Authors:
Daniel S. McFarlin

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Craig Zilles

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitectureJune 2015Pages 323–335https://doi.org/10.1145/2749469.2750400

Published:13 June 2015Publication History

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 323–335

ABSTRACT

While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver.

We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.

References

J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, ser. POPL '83. New York, NY, USA: ACM, 1983, pp. 177--189. {Online}. Available: http://doi.acm.org/10.1145/567067.567085 Google ScholarDigital Library
D. I. August, D. A. Connors, J. C. Gyllenhaal, and W.-m. W. Hwu, "Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results," in Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, ser. HPCA '97. Washington, DC, USA: IEEE Computer Society, 1997, pp. 84--. {Online}. Available: http://dl.acm.org/citation.cfm?id=548716.822702 Google ScholarDigital Library
E. Brunvand, "The nsr processor," in System Sciences, 1993, Proceeding of the Twenty-Sixth Hawaii International Conference on, vol. i, Jan 1993, pp. 428--435 vol.1.Google Scholar
H. W. Cain and P. Nagpurkar, "Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor," in ISPASS, 2010, pp. 203--212.Google Scholar
M. Charney, "Intel software development emulator." {Online}. Available: https://software.intel.com/en-us/articles/pintoolGoogle Scholar
R. P. Colwell, R. P. Nix, J. J. O. Donnell, D. B. Papworth, and P. K. Rodman, "A vliw architecture for a trace scheduling compiler," in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1987, pp. 180--192. Google ScholarCross Ref
B. Dally, ""project denver"processor to usher in a new era of computing," Jan. 2011. {Online}. Available: http://blogs.nvidia.com/blog/2011/01/05/project-denver-processor-to-usher-in-new-era-of-computingGoogle Scholar
J. W. Davidson and D. B. Whalley, "Reducing the cost of branches by using registers," in Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90. New York, NY, USA: ACM, 1990, pp. 182--191. {Online}. Available: http://doi.acm.org/10.1145/325164.325138 Google ScholarDigital Library
J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, "The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life Challenges," in Proceedings of the International Symposium on Code Generation and Optimization, 2003, pp. 15--24. Google ScholarDigital Library
J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in Proceedings of the 11th International Conference on Supercomputing, ser. ICS '97. New York, NY, USA: ACM, 1997, pp. 68--75. {Online}. Available: http://doi.acm.org/10.1145/263580.263597 Google ScholarDigital Library
J. Edmondson, P. Rubinfeld, R. Preston, and V. Rajagopalan, "Superscalar instruction execution in the 21164 alpha microprocessor," Micro, IEEE, vol. 15, no. 2, pp. 33--43, Apr 1995. Google ScholarDigital Library
M. Farrens and A. Pleszhun, "Implementation of the pipe processor," Computer, vol. 24, no. 1, pp. 65--70, Jan 1991. Google ScholarDigital Library
B. A. Fields, S. Rubin, and R. Bodik, "Focusing processor policies via Critical-Path prediction," in Proceedings of the 28th Annual International Symposium on Computer Architecture, Jul. 2001, pp. 74--85. {Online}. Available: http://www.cs.wisc.edu/~bodik/research/isca01a.pdf Google ScholarDigital Library
J. A. Fisher, "Trace scheduling: a technique for global microcode compaction," vol. 30(7), pp. 478--490, 1981. Google ScholarDigital Library
J. Fritts and W. Wolf, "Evaluation of static and dynamic scheduling for media processors," in Proceedings of the 2nd Workshop on Media Processors and DSPs, ser. Micro '00, 2000.Google Scholar
J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," SIGARCH Comput. Archit. News, vol. 13, no. 3, pp. 20--27, Jun. 1985. {Online}. Available: http://doi.acm.org/10.1145/327070.327117 Google ScholarDigital Library
M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, "Synergistic processing in cell's multicore architecture," IEEE Micro, vol. 26, no. 2, pp. 10--24, Mar. 2006. {Online}. Available: http://dx.doi.org/10.1109/MM.2006.41 Google ScholarDigital Library
J. Hennessy, N. Jouppi, F. Baskett, T. Gross, and J. Gill, "Hardware/software tradeoffs for increased performance," in Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS I. New York, NY, USA: ACM, 1982, pp. 2--11. {Online}. Available: http://doi.acm.org/10.1145/800050.801820 Google ScholarDigital Library
A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," IEEE Micro, vol. 30, no. 1, pp. 12--19, Jan. 2010. {Online}. Available: http://dx.doi.org/10.1109/MM.2010.20 Google ScholarDigital Library
P. Y. T. Hsu and E. S. Davidson, "Highly concurrent scalar processing," in Proceedings of the 13th Annual International Symposium on Computer Architecture, ser. ISCA '86. Los Alamitos, CA, USA: IEEE Computer Society Press, 1986, pp. 386--395. {Online}. Available: http://dl.acm.org/citation.cfm?id=17407.17401 Google ScholarDigital Library
W. M. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, "The Superblock: An Effective Technique for VLIW and Superscalar Compilation," Journal of Supercomputing, vol. 7, no. 1, pp. 229--248, Mar 1993. {Online}. Available: http://www.crhc.uiuc.edu/IMPACT/ftp/journal/jsc.superblock.93.pdf Google ScholarDigital Library
Intel, "Intel itanium processor 9500 series refence manual. software development and optimization guide," Intel Technical Manual, 2012.Google Scholar
A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites." {Online}. Available: http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdfGoogle Scholar
V. Kathail, M. Schlansker, and B. Rau, "HPL PlayDoh architecture specification: Version 1.0," Hewlett-Packard Laboratories, Tech. Rep. HPL-93-80, Feb. 1993.Google Scholar
H. Kim, J. Joao, O. Mutlu, and Y. N. Patt, "Profile-assisted compiler support for dynamic predication in diverge-merge processors," in Proceedings of the International Symposium on Code Generation and Optimization, ser. CGO '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 367--378. {Online}. Available: http://dx.doi.org/10.1109/CGO.2007.31 Google ScholarDigital Library
H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, "Diverge-merge processor: Generalized and energy-efficient dynamic predication," IEEE Micro, vol. 27, no. 1, pp. 94--104, Jan. 2007. {Online}. Available: http://dx.doi.org/10.1109/MM.2007.9 Google ScholarDigital Library
H. Kim, O. Mutlu, J. Stark, and Y. Patt, "Wish branches: combining conditional branching and predication for adaptive predicated execution," in Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, Nov 2005, pp. 12 pp.--54. Google ScholarDigital Library
A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," in Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, Oct 1998, pp. 278--285. Google ScholarDigital Library
S. Mahlke and B. Natarajan, "Compiler synthesized dynamic branch prediction," in Microarchitecture, 1996. MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on, Dec 1996, pp. 153--164. Google ScholarDigital Library
S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in In Proceedings of the 25th International Symposium on Microarchitecture, 1992, pp. 45--54. Google ScholarDigital Library
D. S. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism?" in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '13. New York, NY, USA: ACM, 2013, pp. 241--252. {Online}. Available: http://doi.acm.org/10.1145/2451116.2451143 Google ScholarDigital Library
C. McNairy and D. Soltis, "Itanium 2 processor microarchitecture," IEEE Micro, vol. 23, no. 2, pp. 44--55, Mar. 2003. {Online}. Available: http://dx.doi.org/10.1109/MM.2003.1196114 Google ScholarDigital Library
A. S. Nadkarni and A. Tyagi, "A trace based evaluation of speculative branch decoupling," in Computer Design, 2000. Proceedings. 2000 International Conference on. IEEE, 2000, pp. 300--307. Google ScholarDigital Library
N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles, "Hardware atomicity for reliable software speculation," in Proceedings of the 34th International Symposium on Computer Architecture, 2007, pp. 174--185. Google ScholarDigital Library
A. Seznec, "A new case for the tage branch predictor," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp. 117--127. {Online}. Available: http://doi.acm.org/10.1145/2155620.2155635 Google ScholarDigital Library
R. Sheikh, J. Tuck, and E. Rotenberg, "Control-flow decoupling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 329--340. {Online}. Available: http://dx.doi.org/10.1109/MICRO.2012.38 Google ScholarDigital Library
G. Shobaki, K. Wilken, and M. Heffernan, "Optimal trace scheduling using enumeration," ACM Trans. Archit. Code Optim., vol. 5, no. 4, pp. 19:1--19:32, Mar. 2009. {Online}. Available: http://doi.acm.org/10.1145/1498690.1498694 Google ScholarDigital Library
M. Smotherman, "Documentation project for the IBM ACS-1 Supercomputer," Jun. 2010. {Online}. Available: http://www.cs.clemson.edu/~mark/acs.htmlGoogle Scholar
A. Srivastava and A. Despain, "Prophetic branches: a branch architecture for code compaction and efficient execution," in Microarchitecture, 1993., Proceedings of the 26th Annual International Symposium on, Dec 1993, pp. 94--99. Google ScholarDigital Library
A. Tyagi, H.-C. Ng, and P. Mohapatra, "Dynamic branch decoupled architecture," in Computer Design, 1999.(ICCD'99) International Conference on. IEEE, 1999, pp. 442--450. Google ScholarDigital Library
W. J. Watson, "The ti asc: A highly modular and flexible super computer architecture," in Proceedings of the December 5-7, 1972, Fall Joint Computer Conference, Part I, ser. AFIPS '72 (Fall, part I). New York, NY, USA: ACM, 1972, pp. 221--228. {Online}. Available: http://doi.acm.org/10.1145/1479992.1480022 Google ScholarDigital Library
C. Young and M. D. Smith, "Improving the accuracy of static branch prediction using branch correlation," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 232--241. {Online}. Available: http://doi.acm.org/10.1145/195473.195549 Google ScholarDigital Library
H. C. Young, "Code scheduling methods for some architectural features in pipe," Microprocessing and Microprogramming, vol. 22, no. 1, pp. 39--63, 1988. {Online}. Available: http://www.sciencedirect.com/science/article/pii/0165607488900063 Google ScholarDigital Library
M. Yourst, "Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator," in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE International Symposium on, April 2007, pp. 23--34.Google Scholar

Index Terms

Branch vanguard: decomposing branch functionality into prediction and resolution instructions
1. Hardware
  1. Hardware validation
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific instruction set processors
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Control structures

Recommendations

Branch vanguard: decomposing branch functionality into prediction and resolution instructions
ISCA'15

While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches ...
Read More
Classification-directed branch predictor design
Read More
A latency-conscious SMT branch prediction architecture

Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell
ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 617
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Branch vanguard: decomposing branch functionality into prediction and resolution instructions

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Branch vanguard: decomposing branch functionality into prediction and resolution instructions

Classification-directed branch predictor design

A latency-conscious SMT branch prediction architecture