Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

Exploring branch target buffer access filtering for low-energy and high-performance microarchitectures

Exploring branch target buffer access filtering for low-energy and high-performance microarchitectures

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Powerful branch predictors along with a large branch target buffer (BTB) are employed in superscalar and simultaneous multi-threading (SMT) processors for instruction-level parallelism and thread-level parallelism exploitation. However, the large BTB not only dominates the predictor energy consumption, but also becomes a major roadblock in achieving faster clock frequencies at deep sub-micron technologies. The authors propose here a filtering scheme to dramatically reduce the accesses to the BTB to achieve significantly reduced energy consumption in the BTB while maintaining the performance. For a simulated superscalar microprocessor, the experimental evaluation shows that the BTB access filtering (BAF) design achieves an 88.5% dynamic energy reduction with negligible performance loss. The authors also study the leakage behaviour and its control in the BAF design. The results show that by applying a drowsy strategy, very effective leakage control can be achieved. For the high-performance design, the BAF can also improve BTB's performance scalability at new technologies. For the simultaneous multi-threading environment, the authors evaluate the effectiveness of the BAF design and propose a banked BAF (BK-BAF) scheme to further reduce the energy consumption and performance overhead. The experimental results confirm that the BK-BAF scheme can be an energy/performance-effective design for next generation SMT processors.

References

    1. 1)
      • Hily, S., Seznec, A.: `Branch prediction and simultaneous multithreading', Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, October 1996, p. 169–173.
    2. 2)
      • Preston, R.P., Badeau, R.W., Bailey, D.W.: `Design of an 8-issue superscalar RISC microprocessor with simultaneous multithreading', Proc. IEEE Int. Solid-State Circuits Conf., 2002.
    3. 3)
      • Flautner, K., Kim, N., Martin, S., Blaauw, D., Mudge, T.: `Drowsy caches: simple techniques for reducing leakage power', Proc. 29th Int. Symp. on Computer Architecture, Anchorage, May 2002, AK, p. 148–157.
    4. 4)
    5. 5)
    6. 6)
      • Tseng, J., Asanovic, K.: `Banked multiported register files for high-frequency superscalar microprocessors', 30thInt. Symp. on Computer Architecture (ISCA-30), June 2003, San Diego, CA, p. 62–71.
    7. 7)
      • A. Falcon , O.J. Santana , A. Ramirez , M. Valero . A latency-conscious SMT branch prediction architecture. Int. J. High Perform. Comput. Netw. , 1 , 11 - 21
    8. 8)
      • Burger, D., Austin, T.M.: `The SimpleScalar tool set, version 2.0′', Technical report 1342, Computer Sciences Department, University of Wisconsin, 1997.
    9. 9)
      • Sherwood, T., Perelman, E., Hamerly, G.: `Automatically characterizing large scale program behavior', Proc. ASPLOS X, October 2002, p. 45–57.
    10. 10)
      • Wang, S., Hu, J., Ziavras, S.G.: `BTB access filtering: a low energy and high performance design', Proc. IEEE Computer Society Annual Symp. on VLSI, April 2008, p. 81–86.
    11. 11)
      • Chang, Y.-J.: `Lazy BTB: reduce BTB energy consumption using dynamic profiling', Proc. 2006 Conf. Asia South Pacific Design Automation, ASP-DAC'06, 2006, p. 917–922.
    12. 12)
      • Pizzol, G.D., Navaux, P.O.A.: `Branch prediction topologies for SMT architectures', Proc. 17th Int. Symp. on Computer Architecture and High Performance Computing, 2005, p. 118–125.
    13. 13)
      • Brooks, D., Tiwari, V., Martonosi, M.: `Wattch: a framework for architectural-level power analysis and optimizations', Proc. Int. Symp. on High-Performance Computer Architecture, 2000, p. 83–94.
    14. 14)
    15. 15)
      • Smith, J.E.: `A study of branch prediction strategies', Proc. Eighth Annual Symp. on Computer Architecture, ISCA'81, 1981, p. 135–148.
    16. 16)
      • Kaxiras, S., Hu, Z., Martonosi, M.: `Cache decay: exploiting generational behavior to reduce cache leakage power', Proc. Int. Symp. on Computer Architecture, 2001, p. 240–251.
    17. 17)
      • J. Casazza . (2008) First the tick, now the tock: Intel microarchitecture (Nehalem).
    18. 18)
      • Hu, Z., Juang, P., Skadron, K., Clark, D., Martonosi, M.: `Applying decay strategies to branch predictors for leakage energy savings', Proc. 2002 Int. Conf. Computer Design, September 2002, p. 442–445.
    19. 19)
      • Wallace, S., Bagherzadeh, N.: `A scalable register file architecture for dynamically scheduled processors', Proc. 1996 Conf. on Parallel Architectures and Compilation Techniques, 1996, p. 179–184.
    20. 20)
      • Petrov, P., Orailoglu, A.: `Low-power branch target buffer for application-specific embedded processors', Proc. Euromicro Symp. on Digital Systems Design, DSD'03, 2003, p. 158–165.
    21. 21)
      • Bannon, P.: `Alpha 21364: a scalable single-chip SMP', Microprocessor Forum, 1998.
    22. 22)
    23. 23)
      • Raasch, S., Reinhardt, S.: `The impact of resource partitioning on SMT processors', Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, 2003, p. 15–25.
    24. 24)
      • Tullsen, D., Eggers, S., Emer, J., Levy, H., Lo, J., Stamm, R.: `Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor', Proc. 22nd Annual Int. Symp. on Computer Architecture, May 1996, p. 191–202.
    25. 25)
    26. 26)
      • Borch, E., Tune, E., Manne, S., Emer, J.: `Loose loops sink chips', Proc. HPCA-8, February 2002, p. 270–281.
    27. 27)
      • Ernst, D., Hamel, A., Austin, T.: `Cyclone: a broadcast-free dynamic instruction scheduler selective replay', Proc. 30th Annual Int. Symp. Computer Architecture, June 2003, p. 235–262.
    28. 28)
      • Park, I., Powell, M., Vijaykumar, T.: `Reducing register ports for higher speed and lower energy', Proc. Int. Symp. on Microarchitecture, December 2002, p. 171–182.
    29. 29)
      • Canal, R., Gonzalez, A.: `Reducing the complexity of the issue logic', Proc. 2001 Int. Conf. on Supercomputing, June 2001, p. 312–320.
    30. 30)
      • Palacharla, S., Jouppi, N.P., Smith, J.: `Complexity-effective superscalar processors', Proc. 24th Annual Int. Symp. on Computer Architecture, June 1997, p. 206–218.
    31. 31)
      • Weglarz, E., Saluja, K., Lipasti, M.: `Minimizing energy consumption for high-performance processing', Proc. Asia and South Pacific Design Automation Conf., 2002, p. 199–204.
    32. 32)
      • Tullsen, D., Eggers, S., Levy, H.: `Simultaneous multithreading: maximizing on-chip parallelism', Proc. 22nd Annual Int. Symp. Computer Architecture, June 1995, p. 392–403.
    33. 33)
      • Kin, J., Gupta, M., Mangione-Smith, W.H.: `The filter cache: an energy efficient memory structure', Proc. Annual ACM/IEEE Int. Symp. on Microarchitecture, 1997, p. 184–193.
    34. 34)
      • Yeh, T.-Y., Patt, Y.N.: `Alternative implementations of two-level adaptive branch predictions', 19thAnnual Int. Symp. Computer Architecture, Gold Coast, May 1992, Australia, p. 124–134.
    35. 35)
      • Jimnez, D.A., Lin, C.: `Dynamic branch prediction with perceptrons', Proc. Seventh Int. Symp. on High-Performance Computer Architecture, HPCA'01, 2001, p. 197–206.
    36. 36)
      • Hrishikesh, M.S., Burger, D., Keckler, S.W., Shivakumar, P., Jouppi, N.P., Farkas, K.I.: `The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays', Proc. 29th Annual Int. Symp. on Computer Architecture, May 2002, p. 14–24.
    37. 37)
    38. 38)
      • Ramsay, M., Feucht, C., Lipasti, M.H.: `Exploring efficient SMT branch predictor design', Proc. Workshop on Complexity-Effective Design, June 2003.
    39. 39)
    40. 40)
      • McFarling, S.: `Combining branch predictors’. WRL Technical Note TN-36', Technical report, 1993.
    41. 41)
    42. 42)
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cdt.2010.0102
Loading

Related content

content/journals/10.1049/iet-cdt.2010.0102
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address