ABSTRACT
As power-efficiency becomes paramount concern in processor design, architectures are coming up that completely do away with hardware branch prediction, and rely solely on software branch hinting. A popular example is the Synergistic Processing Unit (SPU) in the IBM Cell processor. To be able to minimize the branch penalty using branch hint instructions, in addition to estimating the branch probabilities (which has been looked at before [6, 25, 24]), it is important to carefully insert branch hints. Towards this, in this paper, we i) construct a branch penalty model for compiler, ii) formulate the problem of minimizing branch penalty using branch hinting and iii) propose a heuristic to solve this problem. The heuristic is based on three basic techniques that we introduce in this paper: NOP padding, hint pipelining, and nested loop restructuring. Experimental results on several benchmarks show that our solution can reduce the branch penalty as much as 35.4% over the previous approach.
- GNU Toolchain 4.1.1 and GDB for the Cell BE's PPU/SPU. http://www.bsc.es/plantillaH.php?cat_id=304.Google Scholar
- IBM Full-System Simulator for Cell BE. http://www.alphaworks.ibm.com/tech/cellsystemsim.Google Scholar
- A. Agarwal and M. Levy. The kill rule for multicore. In Proceedings of the 44th annual Design Automation Conference, DAC '07, pages 750--753, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- K. Bai and A. Shrivastava. Heap Data Management for Limited Local Memory (LLM) Multi-core Processors. In CODES+ISSS '10: Proceedings of the 23th international symposium on System Synthesis, New York, NY, USA, 2010. ACM Press. ISBN. Google ScholarDigital Library
- K. Bai, A. Shrivastava, and S. Kudchadker. Stack Data Management for Limited Local Memory (LLM) Multi-core Processors. In Proceedings of the International Conference on Application Specific Systems, Architectures and Processors (ASAP), 2011. Google ScholarDigital Library
- T. Ball and J. R. Larus. Branch prediction for free. In Proceedings of PLDI, pages 300--313, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- M. Briejer, C. Meenderinck, and B. Juurlink. Extending the Cell SPE with energy efficient branch prediction. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, EuroPar'10, pages 304--315, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
- A. E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 161--172, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- M. Gschwind, H. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarDigital Library
- J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper. The Mälardalen WCET benchmarks - past, present and future. pages 137--147, Brussels, Belgium, July 2010. OCG.Google Scholar
- H. Hofstee. Power efficient processor architecture and the Cell processor. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 258--262, 2005. Google ScholarDigital Library
- IBM. Cell Broadband Engine Programming Handbook including PowerXCell 8i. https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE7%0D5852575CA0074E8ED.Google Scholar
- IBM. IBM Cell SDK 3.1. http://www.ibm.com/developerworks/power/cell.Google Scholar
- Dual-Core Intel Itanium Processor 9000 and 9100 Series. http://download.intel.com/design/itanium/downloads/314054.pdf, 2007.Google Scholar
- D. A. Jiménez and C. Lin. Dynamic branch prediction with perceptrons. In HPCA '01: Proceedings of the 7th International Symposium on High-Performance Computer Architecture, page 197, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarDigital Library
- S. C. Jung, A. Shrivastava, and K. Bai. Dynamic code mapping for limited local memory systems. In Application-specific Systems Architectures and Processors (ASAP), 2010 21st IEEE International Conference on, pages 13--20, 2010.Google ScholarCross Ref
- J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49:589--604, July 2005. Google ScholarDigital Library
- J. Kalamatianos and D. R. Kaeli. Improving the accuracy of indirect branch prediction via branch classification. SIGARCH Comput. Archit. News, 27(1):23--26, 1999. Google ScholarDigital Library
- D. Kolson, A. Nicolau, and N. Dutt. Elimination of redundant memory traffic in high-level synthesis. IEEE Trans. on Comp-aided Design, 15:1354--1363, 1996. Google ScholarDigital Library
- P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded Sparc processor. IEEE Micro, 25(2):21--29, 2005. Google ScholarDigital Library
- A. Pabalkar, A. Shrivastava, A. Kannan, and J. Lee. SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories. In HIPC 2008: International Conference on High Performance Computing, pages 569--582, 2008. Google ScholarDigital Library
- B. Sinharoy and S. W. White. Use of software hint for branch prediction in the absence of hint bit in the branch instruction. http://www.freepatentsonline.com/6971000.html.Google Scholar
- A. S. Stephen, S. Felix, V. Krishnan, and Y. Sazeides. Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor. In in 29th Annual International Symposium on Computer Architecture, pages 295--306, 2002. Google ScholarDigital Library
- T. A. Wagner, V. Maverick, S. L. Graham, and M. A. Harrison. Accurate static estimators for program optimization. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, PLDI '94, pages 85--96, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
- Y. Wu and J. R. Larus. Static branch frequency and program profile analysis. In Proceedings of the 27th annual international symposium on Microarchitecture, pages 1--11, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
Index Terms
- Branch penalty reduction on IBM cell SPUs via software branch hinting
Recommendations
An Improved Pipelined Processor Architecture Eliminating Branch and Jump Penalty
ICCEA '10: Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 01Control dependencies are one of the major limitations to increase the performance of pipelined processors. This paper deals with eliminating penalties in pipelined processor. We present our discussion in the light of MIPS pipelined processor ...
Reducing Branch Misprediction Penalty via Selective Branch Recovery
HPCA '04: Proceedings of the 10th International Symposium on High Performance Computer ArchitectureBranch misprediction penalty consists of two components: the time wasted on mis-speculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor ...
Comments