research-article

B2P2: bounds based procedure placement for instruction TLB power reduction in embedded systems

Authors:

Reiley Jeyapaul,

Aviral ShrivastavaAuthors Info & Claims

SCOPES '10: Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems

Article No.: 2, Pages 1 - 10

https://doi.org/10.1145/1811212.1811215

Published: 28 June 2010 Publication History

Abstract

High performance embedded processors are equipped with the Translation Look-aside Buffer (TLB) which forms the key ingredient to efficient and speedy virtual memory management. The TLB though small, is frequently accessed, and therefore not only consumes significant energy, but also is one of the important thermal hot-spots in the processor. Among the many circuit and microarchitectural techniques proposed to reduce TLB power consumption, the Use-Last TLB is one very efficient technique in which power is consumed only when different pages are accessed in succession, i.e., when there is a page-switch [26]. Though the Use-Last technique is effective in reducing i-TLB power, there is scope to further improve its effectiveness by changing the relative code placement of the program. In this work, we formulate the code placement problem to minimize the page-switches in a program. We prove that this problem is NP-complete and propose an efficient Bounds Based Procedure Placement (B2P2) heuristic to efficiently reduce the program's page-switches. Our procedure placement technique delivers an average of 76% reduction in the instrucion-TLB power with negligible (< 2%) impact on performance, over and above the reduction achieved by the Use-Last TLB architecture alone.

References

[1]

ARM. ARMv5 Architecture Reference Manual, 2007.

[2]

D. F. Bacon, J.-H. Chow, D.-c. R. Ju, K. Muthukumar, and V. Sarkar. A compiler framework for restructuring data declarations to enhance cache and tlb effectiveness. In CASCON '94: Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research, page 3. IBM Press, 1994.

Digital Library

[3]

D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. SIGARCH Comput. Archit. News, 25(3):13--25, 1997.

Digital Library

[4]

S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. SIGOPS Oper. Syst. Rev., 28(5):252--262, 1994.

Digital Library

[5]

M. Cekleov and M. Dubois. Virtual-address caches part 1: Problems and solutions in uniprocessors. IEEE Micro, 17(5):64--71, 1997.

Digital Library

[6]

Y.-J. Chang. An ultra low-power tlb design. In DATE '06: Proceedings of the conference on Design, automation and test in Europe, pages 1122--1127, 3001 Leuven, Belgium, Belgium, 2006. European Design and Automation Association.

Digital Library

[7]

A. Chiyonobu and T. Sato. Energy-efficient instruction scheduling utilizing cache miss information. In MEDEA '05: Proceedings of the 2005 workshop on MEmory performance, pages 65--70, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[8]

J.-H. Choi, J.-H. Lee, S.-W. Jeong, S.-D. Kim, and C. Weems. A low power tlb structure for embedded systems. Computer Architecture Letters, 1(1):3--3, january-december 2002.

Digital Library

[9]

L. T. Clark, B. Choi, and M. Wilkerson. Reducing translation lookaside buffer active power. In ISLPED '03, pages 10--13, New York, NY, USA, 2003. ACM Press.

Digital Library

[10]

V. Delaluz, M. Kandemir, N. Vijaykrishnan, M. Irwin, A. Sivasubramaniam, and I. Kolcu. Compiler-directed array interleaving for reducing energy in multi-bank memories. In ASP-DAC '02, pages 288--293, 2002.

Digital Library

[11]

B. Egger, J. Lee, and H. Shin. Dynamic scratchpad memory management for code in portable systems with an mmu. ACM Trans. Embed. Comput. Syst., 7(2):1--38, 2008.

Digital Library

[12]

M. Ekman, P. Stenström, and F. Dahlgren. Tlb and snoop energy-reduction using virtual caches in low-power chip-multiprocessors. In ISLPED '02: Proceedings of the 2002 international symposium on Low power electronics and design, pages 243--246, New York, NY, USA, 2002. ACM.

Digital Library

[13]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979.

Digital Library

[14]

M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC '01: Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, pages 3--14, Washington, DC, USA, 2001. IEEE Computer Society.

Digital Library

[15]

C. X. Huang, B. Zhang, A.-C. Deng, and B. Swirski. The design and implementation of powermill. In ISLPED '95: Proceedings of the 1995 international symposium on Low power design, pages 105--110, New York, NY, USA, 1995. ACM.

Digital Library

[16]

X. Huang, B. T. Lewis, and K. S. McKinley. Dynamic code management: Improving whole program code locality in managed runtimes. In VEE '06: Proceedings of the 2nd international conference on Virtual execution environments, pages 133--143, New York, NY, USA, 2006. ACM.

Digital Library

[17]

Intel Corporation. Intel XScale®Technology Overview, 2000.

[18]

R. Jeyapaul, S. Marathe, and A. Shrivastava. Code transformations for tlb power reduction. In VLSID '09: Proceedings of the 2009 22nd International Conference on VLSI Design, pages 413--418, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[19]

I. Kadayif, P. Nath, M. Kandemir, and A. Sivasubramaniam. Compiler-directed physical address generation for reducing dtlb power. In ISPASS '04, pages 161--168, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[20]

I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, and G. Chen. Optimizing instruction tlb energy using software and hardware techniques. ACM Trans. Des. Autom. Electron. Syst., 10(2):229--257, 2005.

Digital Library

[21]

M. Kandemir, I. Kadayif, and G. Chen. Compiler-directed code restructuring for reducing data tlb energy. In CODES+ISSS '04, pages 98--103, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[22]

J.-H. Lee, G.-H. Park, S.-B. Park, and S.-D. Kim. A selective filter-bank tlb system. In ISLPED '03: Proceedings of the 2003 international symposium on Low power electronics and design, pages 312--317, New York, NY, USA, 2003. ACM.

Digital Library

[23]

S. Manne, A. Klauser, D. Grunwald, and F. Somenzi. Low power tlb design for high performance microprocessors, 1997.

[24]

MIPS Technologies. MIPS32®Architecture, 2003.

[25]

A. Parikh, S. Kim, M. Kandemir, N. Vijaykrishnan, and M. Irwin. Instruction scheduling for low power. In VLSI-SP '04, pages 129--149, 2004.

Digital Library

[26]

J. R. Haigh, M. Wilkerson, J. Miller, T. Beatty, S. Strazdus, and L. Clark. A low-power 2.5 ghz 90 nm level 1 cache and memory management unit. In IEEE Journal of Solid State Circuits, pages 1190--1199. IEEE Press, 2004.

[27]

H. Tomiyama and H. Yasuura. Code placement techniques for cache miss rate reduction. ACM Trans. Des. Autom. Electron. Syst., 2(4):410--429, 1997.

Digital Library

Cited By

Mittal S(2016)A survey of techniques for architecting TLBsConcurrency and Computation: Practice and Experience10.1002/cpe.406129:10Online publication date: 22-Dec-2016
https://doi.org/10.1002/cpe.4061

Index Terms

B2P2: bounds based procedure placement for instruction TLB power reduction in embedded systems
1. Computer systems organization
  1. Architectures

Recommendations

A predictive decode filter cache for reducing power consumption in embedded processors

With advances in semiconductor technology, power management has increasingly become a very important design constraint in processor design. In embedded processors, instruction fetch and decode consume more than 40% of processor power. This calls for ...
Reducing traffic generated by conflict misses in caches
CF '04: Proceedings of the 1st conference on Computing frontiers

Off-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SCOPES '10: Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems

June 2010

91 pages

ISBN:9781450300841

DOI:10.1145/1811212

General Chair:
Ed Deprettere
Leiden University, NL
,
Program Chair:
Todor Stefanov
Leiden University, NL

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

EDAA: European Design Automation Association

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computing and Communication Foundations

Conference

SCOPES '10

Sponsor:

EDAA

SCOPES '10: 13th International Workshop on Software and Compilers for Embedded Systems

June 28 - 29, 2010

St. Goar, Germany

Acceptance Rates

Overall Acceptance Rate 38 of 79 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
120
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mittal S(2016)A survey of techniques for architecting TLBsConcurrency and Computation: Practice and Experience10.1002/cpe.406129:10Online publication date: 22-Dec-2016
https://doi.org/10.1002/cpe.4061

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten