skip to main content
10.1145/2039370.2039425acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Branch penalty reduction on IBM cell SPUs via software branch hinting

Published:09 October 2011Publication History

ABSTRACT

As power-efficiency becomes paramount concern in processor design, architectures are coming up that completely do away with hardware branch prediction, and rely solely on software branch hinting. A popular example is the Synergistic Processing Unit (SPU) in the IBM Cell processor. To be able to minimize the branch penalty using branch hint instructions, in addition to estimating the branch probabilities (which has been looked at before [6, 25, 24]), it is important to carefully insert branch hints. Towards this, in this paper, we i) construct a branch penalty model for compiler, ii) formulate the problem of minimizing branch penalty using branch hinting and iii) propose a heuristic to solve this problem. The heuristic is based on three basic techniques that we introduce in this paper: NOP padding, hint pipelining, and nested loop restructuring. Experimental results on several benchmarks show that our solution can reduce the branch penalty as much as 35.4% over the previous approach.

References

  1. GNU Toolchain 4.1.1 and GDB for the Cell BE's PPU/SPU. http://www.bsc.es/plantillaH.php?cat_id=304.Google ScholarGoogle Scholar
  2. IBM Full-System Simulator for Cell BE. http://www.alphaworks.ibm.com/tech/cellsystemsim.Google ScholarGoogle Scholar
  3. A. Agarwal and M. Levy. The kill rule for multicore. In Proceedings of the 44th annual Design Automation Conference, DAC '07, pages 750--753, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bai and A. Shrivastava. Heap Data Management for Limited Local Memory (LLM) Multi-core Processors. In CODES+ISSS '10: Proceedings of the 23th international symposium on System Synthesis, New York, NY, USA, 2010. ACM Press. ISBN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bai, A. Shrivastava, and S. Kudchadker. Stack Data Management for Limited Local Memory (LLM) Multi-core Processors. In Proceedings of the International Conference on Application Specific Systems, Architectures and Processors (ASAP), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Ball and J. R. Larus. Branch prediction for free. In Proceedings of PLDI, pages 300--313, New York, NY, USA, 1993. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Briejer, C. Meenderinck, and B. Juurlink. Extending the Cell SPE with energy efficient branch prediction. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, EuroPar'10, pages 304--315, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 161--172, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Gschwind, H. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper. The Mälardalen WCET benchmarks - past, present and future. pages 137--147, Brussels, Belgium, July 2010. OCG.Google ScholarGoogle Scholar
  11. H. Hofstee. Power efficient processor architecture and the Cell processor. In High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 258--262, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. IBM. Cell Broadband Engine Programming Handbook including PowerXCell 8i. https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE7%0D5852575CA0074E8ED.Google ScholarGoogle Scholar
  13. IBM. IBM Cell SDK 3.1. http://www.ibm.com/developerworks/power/cell.Google ScholarGoogle Scholar
  14. Dual-Core Intel Itanium Processor 9000 and 9100 Series. http://download.intel.com/design/itanium/downloads/314054.pdf, 2007.Google ScholarGoogle Scholar
  15. D. A. Jiménez and C. Lin. Dynamic branch prediction with perceptrons. In HPCA '01: Proceedings of the 7th International Symposium on High-Performance Computer Architecture, page 197, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. C. Jung, A. Shrivastava, and K. Bai. Dynamic code mapping for limited local memory systems. In Application-specific Systems Architectures and Processors (ASAP), 2010 21st IEEE International Conference on, pages 13--20, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49:589--604, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Kalamatianos and D. R. Kaeli. Improving the accuracy of indirect branch prediction via branch classification. SIGARCH Comput. Archit. News, 27(1):23--26, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Kolson, A. Nicolau, and N. Dutt. Elimination of redundant memory traffic in high-level synthesis. IEEE Trans. on Comp-aided Design, 15:1354--1363, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded Sparc processor. IEEE Micro, 25(2):21--29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Pabalkar, A. Shrivastava, A. Kannan, and J. Lee. SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories. In HIPC 2008: International Conference on High Performance Computing, pages 569--582, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Sinharoy and S. W. White. Use of software hint for branch prediction in the absence of hint bit in the branch instruction. http://www.freepatentsonline.com/6971000.html.Google ScholarGoogle Scholar
  23. A. S. Stephen, S. Felix, V. Krishnan, and Y. Sazeides. Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor. In in 29th Annual International Symposium on Computer Architecture, pages 295--306, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. A. Wagner, V. Maverick, S. L. Graham, and M. A. Harrison. Accurate static estimators for program optimization. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, PLDI '94, pages 85--96, New York, NY, USA, 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Wu and J. R. Larus. Static branch frequency and program profile analysis. In Proceedings of the 27th annual international symposium on Microarchitecture, pages 1--11, New York, NY, USA, 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Branch penalty reduction on IBM cell SPUs via software branch hinting

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CODES+ISSS '11: Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
      October 2011
      402 pages
      ISBN:9781450307154
      DOI:10.1145/2039370

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate280of864submissions,32%

      Upcoming Conference

      ESWEEK '24
      Twentieth Embedded Systems Week
      September 29 - October 4, 2024
      Raleigh , NC , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader