ABSTRACT
Energy, power, and area efficiency are critical design concerns for embedded processors. Much of the energy of a typical embedded processor is consumed in the front-end since instruction fetching happens on nearly every cycle and involves accesses to large memory arrays such as instruction and branch target caches. The use of small front-end arrays leads to significant power and area savings, but typically results in significant performance degradation. This paper evaluates and compares optimizations that improve the performance of embedded processors with small front-end caches. We examine both software techniques, such as instruction re-ordering and selective caching, and hardware techniques, such as instruction prefetching, tagless instruction cache, and unified caches for instruction and branch targets. We demonstrate that, building on top of a block-aware instruction set, these optimizations can eliminate the performance degradation due to small front-end caches. Moreover, selective combinations of these optimizations lead to an embedded processor that performs significantly better than the large cache design while maintaining the area and energy efficiency of the small cache design.
- D. H. Albonesi. Selective Cache Ways: On-Demand Cache Resource Allocation. In The Proceedings of Intl. Symposium on Microarchitecture, pages 248--259, Haifa, Israel, November 1999. Google ScholarDigital Library
- N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis. Energy and Performance Improvements in Microprocessor Design using a Loop Cache. In The Proceedings of Intl. Conference on Computer Design, pages 378--383, Washington, DC, October 1999. Google ScholarDigital Library
- K. Beyls and E. H. D'Hollander. Generating Cache Hints for Improved Program Efficiency. Journal of Systems Architecture, 51(4):223--250, April 2005. Google ScholarDigital Library
- D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In The Proceedings of Intl. Symposium on Computer Architecture, pages 83--94, Vancouver, BC, Canada, June 2000. Google ScholarDigital Library
- D. Burger and T. M. Austin. Simplescalar Tool Set, Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin, Madison, June 1997.Google ScholarDigital Library
- I.-C. K. Chen, C.-C. Lee, and T. N. Mudge. Instruction Prefetching Using Branch Prediction Information. In The Proceedings of Intl. Conference on Computer Design, pages 593--601, San Jose, CA, October 1997. Google ScholarDigital Library
- K. Ghose and M. B. Kamble. Reducing Power in Superscalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation. In The Proceedings of Intl. Symposium on Low Power Electronics and Design, pages 70--75, San Diego, CA, August 1999. Google ScholarDigital Library
- A. Gordon-Ross, S. Cotterell, and F. Vahid. Tiny Instruction Caches for Low Power Embedded Systems. ACM Transactions on Embedded Computing Systems, 2(4):449--481, November 2003. Google ScholarDigital Library
- Intel Corporation. Intel Itanium Architecture Software Developers Manual. Revision 2.0, December 2001.Google Scholar
- Intel Corporation. Intel PXA27x Processor Family Developer's Manual, October 2004.Google Scholar
- P. Jain, S. Devadas, D. Engels, and L. Rudolph. Software-Assisted Cache Replacement Mechanisms for Embedded Systems. In The Proceedings of Intl. Conference on Computer-Aided Design, pages 119--126, San Jose, CA, November 2001. Google ScholarDigital Library
- D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In The Proceedings of Intl. Symposium on Computer Architecture, pages 252--263, Denver, CO, June 1997. Google ScholarDigital Library
- J. Kin, M. Gupta, and W. H. Mangione-Smith. The Filter Cache: An Energy Efficient Memory Structure. In The Proceedings of Intl. Symposium on Microarchitecture, pages 184--193, Research Triangle Park, NC, December 1997. Google ScholarDigital Library
- L. H. Lee, B. Moyer, and J. Arends. Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops. In The Proceedings of Intl. Symposium on Low Power Electronics and Design, pages 267--269, San Diego, CA, August 1999. Google ScholarDigital Library
- C.-K. Luk and T. C. Mowry. Architectural and Compiler Support for Effective Instruction Prefetching: a Cooperative Approach. ACM Transactions on Computer Systems, 19(1):71--109, February 2001. Google ScholarDigital Library
- A. Malik, B. Moyer, and D. Cermak. A Low Power Unified Cache Architecture Providing Power and Performance Flexibility. In The Proceedings of Intl. Symposium on Low Power Electronics and Design, pages 241--243, Rapallo, Italy, July 2000. Google ScholarDigital Library
- S. McFarling. Program Optimization for Instruction Caches. In The Proceedings of Intl. Conference on Architectural Support for Programming Languages and Operating Systems, pages 183--191, Boston, MA, April 1989. Google ScholarDigital Library
- R. Panwar and D. Rennels. Reducing the Frequency of Tag Compares for Low Power I-Cache Design. In The Proceedings of Intl. Symposium on Low Power Design, pages 57--62, Dana Point, CA, April 1995. Google ScholarDigital Library
- G.-H. Park, O.-Y. Kwon, T.-D. Han, S.-D. Kim, and S.-B. Yang. An Improved Lookahead Instruction Prefetching. In The Proceedings of High-Performance Computing on the Information Superhighway, pages 712--715, Seoul, South Korea, May 1997. Google ScholarDigital Library
- K. Pettis and R. C. Hansen. Profile Guided Code Positioning. In The Proceedings of Conference on Programming Language Design and Implementation, pages 16--27, White Plains, NY, June 1990. Google ScholarDigital Library
- J. Pierce and T. Mudge. Wrong-Path Instruction Prefetching. In The Proceedings of Intl. Symposium on Microarchitecture, pages 165--175, Paris, France, December 1996. Google ScholarDigital Library
- M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, and K. Roy. Reducing Set-Associative Cache Energy via Way Prediction and Selective Direct-Mapping. In The Proceedings of Intl. Symposium on Microarchitecture, pages 54--65, Austin, TX, December 2001. Google ScholarDigital Library
- P. Ranganathan, S. Adve, and N. P. Jouppi. Reconfigurable Caches and their Application to Media Processing. In The Proceedings of Intl. Symposium on Computer Architecture, pages 214--224, Vancouver, BC, Canada, June 2000. Google ScholarDigital Library
- G. Reinman, B. Calder, and T. Austin. Fetch Directed Instruction Prefetching. In The Proceedings of Intl. Symposium on Microarchitecture, pages 16--27, Haifa, Israel, Nov. 1999. Google ScholarDigital Library
- C. Rowen. Engineering the Complex SOC. Prentice Hall, 2004.Google Scholar
- J. S. Seng and D. M. Tullsen. Architecture-Level Power Optimization-What Are the Limits? Journal of Instruction-Level Parallelism 7, 7(3):1--20, January 2005.Google Scholar
- P. Shivakumar and N. P. Jouppi. Cacti 3.0: An Integrated Cache Timing, Power, Area Model. Technical Report 2001/02, Compaq Western Research Laboratory, Aug. 2001.Google Scholar
- J. E. Smith and W.-C. Hsu. Prefetching in Supercomputer Instruction Caches. In The Proceedings of Conference on Supercomputing, pages 588--597, Minneapolis, MN, November 1992. Google ScholarDigital Library
- V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. Branch History Guided Instruction Prefetching. In The Proceedings of Intl. Symposium on High-Performance Computer Architecture, pages 291--300, Nuevo Leone, Mexico, January 2001. Google ScholarDigital Library
- H. Tomiyama and H. Yasuura. Code Placement Techniques for Cache Miss Rate Reduction. ACM Transactions on Design Automation of Electronic Systems, 2(4):410--429, October 1997. Google ScholarDigital Library
- A. Zmily, E. Killian, and C. Kozyrakis. Improving Instruction Delivery with a Block-Aware ISA. In The Proceedings of EuroPar Conference, pages 530--539, Lisbon, Portugal, August 2005. Google ScholarDigital Library
- A. Zmily and C. Kozyrakis. Energy-Efficient and High-Performance Instruction Fetch using a Block-Aware ISA. In The Proceedings of Intl. Symposium on Low Power Electronics and Design, pages 36--41, San Diego, CA, August 2005. Google ScholarDigital Library
- A. Zmily and C. Kozyrakis. Block-Aware Instruction Set Architecture. ACM Transactions on Architecture and Code Optimization, 3(3):327--357, September 2006. Google ScholarDigital Library
- A. Zmily and C. Kozyrakis. Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors. In The Proceedings of Conference on Design, Automation and Test in Europe, pages 224--229, Munich, Germany, March 2006. Google ScholarDigital Library
Index Terms
- A low power front-end for embedded processors using a block-aware instruction set
Recommendations
Block-aware instruction set architecture
Instruction delivery is a critical component for wide-issue, high-frequency processors since its bandwidth and accuracy place an upper limit on performance. The processor front-end accuracy and bandwidth are limited by instruction-cache misses, ...
Instruction prefetching using branch prediction information
ICCD '97: Proceedings of the 1997 International Conference on Computer Design (ICCD '97)Instruction prefetching can effectively reduce instruction cache misses, thus improving the performance. In this paper, we propose a prefetching scheme, which employs a branch predictor to run ahead of the execution unit and to prefetch potentially ...
Optimizations Enabled by a Decoupled Front-End Architecture
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the ...
Comments