ABSTRACT
While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver.
We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.
- J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, ser. POPL '83. New York, NY, USA: ACM, 1983, pp. 177--189. {Online}. Available: http://doi.acm.org/10.1145/567067.567085 Google ScholarDigital Library
- D. I. August, D. A. Connors, J. C. Gyllenhaal, and W.-m. W. Hwu, "Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results," in Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, ser. HPCA '97. Washington, DC, USA: IEEE Computer Society, 1997, pp. 84--. {Online}. Available: http://dl.acm.org/citation.cfm?id=548716.822702 Google ScholarDigital Library
- E. Brunvand, "The nsr processor," in System Sciences, 1993, Proceeding of the Twenty-Sixth Hawaii International Conference on, vol. i, Jan 1993, pp. 428--435 vol.1.Google Scholar
- H. W. Cain and P. Nagpurkar, "Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor," in ISPASS, 2010, pp. 203--212.Google Scholar
- M. Charney, "Intel software development emulator." {Online}. Available: https://software.intel.com/en-us/articles/pintoolGoogle Scholar
- R. P. Colwell, R. P. Nix, J. J. O. Donnell, D. B. Papworth, and P. K. Rodman, "A vliw architecture for a trace scheduling compiler," in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1987, pp. 180--192. Google ScholarCross Ref
- B. Dally, ""project denver"processor to usher in a new era of computing," Jan. 2011. {Online}. Available: http://blogs.nvidia.com/blog/2011/01/05/project-denver-processor-to-usher-in-new-era-of-computingGoogle Scholar
- J. W. Davidson and D. B. Whalley, "Reducing the cost of branches by using registers," in Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90. New York, NY, USA: ACM, 1990, pp. 182--191. {Online}. Available: http://doi.acm.org/10.1145/325164.325138 Google ScholarDigital Library
- J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, "The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life Challenges," in Proceedings of the International Symposium on Code Generation and Optimization, 2003, pp. 15--24. Google ScholarDigital Library
- J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in Proceedings of the 11th International Conference on Supercomputing, ser. ICS '97. New York, NY, USA: ACM, 1997, pp. 68--75. {Online}. Available: http://doi.acm.org/10.1145/263580.263597 Google ScholarDigital Library
- J. Edmondson, P. Rubinfeld, R. Preston, and V. Rajagopalan, "Superscalar instruction execution in the 21164 alpha microprocessor," Micro, IEEE, vol. 15, no. 2, pp. 33--43, Apr 1995. Google ScholarDigital Library
- M. Farrens and A. Pleszhun, "Implementation of the pipe processor," Computer, vol. 24, no. 1, pp. 65--70, Jan 1991. Google ScholarDigital Library
- B. A. Fields, S. Rubin, and R. Bodik, "Focusing processor policies via Critical-Path prediction," in Proceedings of the 28th Annual International Symposium on Computer Architecture, Jul. 2001, pp. 74--85. {Online}. Available: http://www.cs.wisc.edu/~bodik/research/isca01a.pdf Google ScholarDigital Library
- J. A. Fisher, "Trace scheduling: a technique for global microcode compaction," vol. 30(7), pp. 478--490, 1981. Google ScholarDigital Library
- J. Fritts and W. Wolf, "Evaluation of static and dynamic scheduling for media processors," in Proceedings of the 2nd Workshop on Media Processors and DSPs, ser. Micro '00, 2000.Google Scholar
- J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," SIGARCH Comput. Archit. News, vol. 13, no. 3, pp. 20--27, Jun. 1985. {Online}. Available: http://doi.acm.org/10.1145/327070.327117 Google ScholarDigital Library
- M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, "Synergistic processing in cell's multicore architecture," IEEE Micro, vol. 26, no. 2, pp. 10--24, Mar. 2006. {Online}. Available: http://dx.doi.org/10.1109/MM.2006.41 Google ScholarDigital Library
- J. Hennessy, N. Jouppi, F. Baskett, T. Gross, and J. Gill, "Hardware/software tradeoffs for increased performance," in Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS I. New York, NY, USA: ACM, 1982, pp. 2--11. {Online}. Available: http://doi.acm.org/10.1145/800050.801820 Google ScholarDigital Library
- A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," IEEE Micro, vol. 30, no. 1, pp. 12--19, Jan. 2010. {Online}. Available: http://dx.doi.org/10.1109/MM.2010.20 Google ScholarDigital Library
- P. Y. T. Hsu and E. S. Davidson, "Highly concurrent scalar processing," in Proceedings of the 13th Annual International Symposium on Computer Architecture, ser. ISCA '86. Los Alamitos, CA, USA: IEEE Computer Society Press, 1986, pp. 386--395. {Online}. Available: http://dl.acm.org/citation.cfm?id=17407.17401 Google ScholarDigital Library
- W. M. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, "The Superblock: An Effective Technique for VLIW and Superscalar Compilation," Journal of Supercomputing, vol. 7, no. 1, pp. 229--248, Mar 1993. {Online}. Available: http://www.crhc.uiuc.edu/IMPACT/ftp/journal/jsc.superblock.93.pdf Google ScholarDigital Library
- Intel, "Intel itanium processor 9500 series refence manual. software development and optimization guide," Intel Technical Manual, 2012.Google Scholar
- A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites." {Online}. Available: http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdfGoogle Scholar
- V. Kathail, M. Schlansker, and B. Rau, "HPL PlayDoh architecture specification: Version 1.0," Hewlett-Packard Laboratories, Tech. Rep. HPL-93-80, Feb. 1993.Google Scholar
- H. Kim, J. Joao, O. Mutlu, and Y. N. Patt, "Profile-assisted compiler support for dynamic predication in diverge-merge processors," in Proceedings of the International Symposium on Code Generation and Optimization, ser. CGO '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 367--378. {Online}. Available: http://dx.doi.org/10.1109/CGO.2007.31 Google ScholarDigital Library
- H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, "Diverge-merge processor: Generalized and energy-efficient dynamic predication," IEEE Micro, vol. 27, no. 1, pp. 94--104, Jan. 2007. {Online}. Available: http://dx.doi.org/10.1109/MM.2007.9 Google ScholarDigital Library
- H. Kim, O. Mutlu, J. Stark, and Y. Patt, "Wish branches: combining conditional branching and predication for adaptive predicated execution," in Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, Nov 2005, pp. 12 pp.--54. Google ScholarDigital Library
- A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," in Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, Oct 1998, pp. 278--285. Google ScholarDigital Library
- S. Mahlke and B. Natarajan, "Compiler synthesized dynamic branch prediction," in Microarchitecture, 1996. MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on, Dec 1996, pp. 153--164. Google ScholarDigital Library
- S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in In Proceedings of the 25th International Symposium on Microarchitecture, 1992, pp. 45--54. Google ScholarDigital Library
- D. S. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism?" in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '13. New York, NY, USA: ACM, 2013, pp. 241--252. {Online}. Available: http://doi.acm.org/10.1145/2451116.2451143 Google ScholarDigital Library
- C. McNairy and D. Soltis, "Itanium 2 processor microarchitecture," IEEE Micro, vol. 23, no. 2, pp. 44--55, Mar. 2003. {Online}. Available: http://dx.doi.org/10.1109/MM.2003.1196114 Google ScholarDigital Library
- A. S. Nadkarni and A. Tyagi, "A trace based evaluation of speculative branch decoupling," in Computer Design, 2000. Proceedings. 2000 International Conference on. IEEE, 2000, pp. 300--307. Google ScholarDigital Library
- N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles, "Hardware atomicity for reliable software speculation," in Proceedings of the 34th International Symposium on Computer Architecture, 2007, pp. 174--185. Google ScholarDigital Library
- A. Seznec, "A new case for the tage branch predictor," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp. 117--127. {Online}. Available: http://doi.acm.org/10.1145/2155620.2155635 Google ScholarDigital Library
- R. Sheikh, J. Tuck, and E. Rotenberg, "Control-flow decoupling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 329--340. {Online}. Available: http://dx.doi.org/10.1109/MICRO.2012.38 Google ScholarDigital Library
- G. Shobaki, K. Wilken, and M. Heffernan, "Optimal trace scheduling using enumeration," ACM Trans. Archit. Code Optim., vol. 5, no. 4, pp. 19:1--19:32, Mar. 2009. {Online}. Available: http://doi.acm.org/10.1145/1498690.1498694 Google ScholarDigital Library
- M. Smotherman, "Documentation project for the IBM ACS-1 Supercomputer," Jun. 2010. {Online}. Available: http://www.cs.clemson.edu/~mark/acs.htmlGoogle Scholar
- A. Srivastava and A. Despain, "Prophetic branches: a branch architecture for code compaction and efficient execution," in Microarchitecture, 1993., Proceedings of the 26th Annual International Symposium on, Dec 1993, pp. 94--99. Google ScholarDigital Library
- A. Tyagi, H.-C. Ng, and P. Mohapatra, "Dynamic branch decoupled architecture," in Computer Design, 1999.(ICCD'99) International Conference on. IEEE, 1999, pp. 442--450. Google ScholarDigital Library
- W. J. Watson, "The ti asc: A highly modular and flexible super computer architecture," in Proceedings of the December 5-7, 1972, Fall Joint Computer Conference, Part I, ser. AFIPS '72 (Fall, part I). New York, NY, USA: ACM, 1972, pp. 221--228. {Online}. Available: http://doi.acm.org/10.1145/1479992.1480022 Google ScholarDigital Library
- C. Young and M. D. Smith, "Improving the accuracy of static branch prediction using branch correlation," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 232--241. {Online}. Available: http://doi.acm.org/10.1145/195473.195549 Google ScholarDigital Library
- H. C. Young, "Code scheduling methods for some architectural features in pipe," Microprocessing and Microprogramming, vol. 22, no. 1, pp. 39--63, 1988. {Online}. Available: http://www.sciencedirect.com/science/article/pii/0165607488900063 Google ScholarDigital Library
- M. Yourst, "Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator," in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE International Symposium on, April 2007, pp. 23--34.Google Scholar
Index Terms
- Branch vanguard: decomposing branch functionality into prediction and resolution instructions
Recommendations
Branch vanguard: decomposing branch functionality into prediction and resolution instructions
ISCA'15While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches ...
A latency-conscious SMT branch prediction architecture
Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point ...
Comments