Stack filter: Reducing L1 data cache power consumption

https://doi.org/10.1016/j.sysarc.2010.10.002Get rights and content

Abstract

The L1 data cache is one of the most frequently accessed structures in the processor. Because of this and its moderate size it is a major consumer of power. In order to reduce its power consumption, in this paper a small filter structure that exploits the special features of the references to the stack region is proposed. This filter, which acts as a top – non-inclusive – level of the data memory hierarchy, consists of a register set that keeps the data stored in the neighborhood of the top of the stack. Our simulation results show that using a small Stack Filter (SF) of just a few registers, 10–25% data cache power savings can be achieved on average, with a negligible performance penalty.

Introduction

Continuous technical improvements in the current microprocessors field lead the trend towards more sophisticated chips. Nevertheless, this fact comes at the expense of significant increase in power consumption, and it is well-known for all architects that the main goal in current designs is to simultaneously deliver both high performance and low power consumption. This is why many researchers have focused their efforts on reducing the overall power dissipation. Power dissipation is spread across different structures including caches, register files, the branch predictor, etc. However, on-chip caches can consume over 40% of a chip’s overall power by themselves [1], [2].

One alternative to mitigate this effect is to partition caches into several smaller caches [3], [4], [5] with the implied reduction in both access time and power cost per access. Another design, known as filter cache [6], trades performance for power consumption by filtering cache references through an unusually small L1 cache. An L2 cache, which is similar in size and structure to a typical L1 cache, is placed after the filter cache to minimize the performance loss. A different alternative, named selective cache ways [7], provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, whereas the full cache will be operational for more cache-intensive periods. Loop caches [8] are other proposal to save power, consisting of a direct-mapped data array and a loop cache controller. The loop cache controller knows precisely whether the next data-requesting instruction will hit in the loop cache, well ahead of time. As a result, there is no performance degradation. Another different approach takes advantage of the special behavior in memory references: we can replace the conventional unified data cache with multiple specialized caches. Each handles different kinds of memory references according to their particular locality characteristics – examples of this approach are [9], [10], both exploiting the locality exhibited in stack accesses. These alternatives make it possible to improve mainly in terms of performance. It is important to highlight that all of these approaches are just some of the existing proposals in the field of caches design.

In this paper we propose a different approach that also exploits the special features of stack references. The novelty resides in the fact that we do not employ a specialized cache for handling stack accesses; instead, we use a straightforward and small-sized structure that records a few words in the neighborhood of the stack pointer, and acts like a filter: if the referenced data falls in the range stored in this filter we avoid unnecessary accesses to L1 data cache. Otherwise we perform the access like in conventional designs. This way, although the IPC remains largely unchanged, we are able to significantly reduce the power consumption of the critical data cache structure with negligible extra hardware. We target a high performance embedded processor [11] as platform to evaluate our proposal, but the technique is likewise applicable to CMPs.

This work is organized as follows: Section 2 describes the stack references special characteristics; Section 3 explains our proposed filter and the implementation details; Section 4 describes the setup that we have used to evaluate our proposal; Section 5 presents and analyzes the obtained experimental results; Section 6 discusses related work and, finally, Section 7 concludes.

Section snippets

Stack access behavior

Typical programs employ special private memory regions as stacks to store data temporarily. Our target architecture has some call-preserved registers. After a function call ends these registers must hold the same value as when the call began. To achieve this, a function call will store (or push) those of the call-preserved registers to be written along the routine on top of the stack. Once the function ends the registers will be reloaded (or popped) from the stack, the routine will return and

Stack filter design

Many authors have realized that the locality of stack references will take a lot of advantage of caching. Our proposal [11] differs from those discussed in Section 6 in three main goals:

  • 1.

    We are implementing a very small filter, 8–32 words only.

  • 2.

    Our proposal locates the filter before, in terms of hierarchy, the $dl1 cache. This has to be done under a major constrain, the access to $dl1 cannot be delayed. Therefore, we must avoid any extra penalty in misses. To deal with this constraint, we include

Experimental environment

We have simulated our proposed SF over the studied target platform using Sim-Panalyzer [14], a simulation tool built on top of Simplescalar/ARM [15]. By default, it is configured to faithfully model the SA-110 StrongARM processor. To evaluate the efficiency of our data cache filtering, we use Sim-Panalyzer, with the parameters listed in Table 1.

For the simulations of our proposed filter in this ARM system, we have used several applications taken from MiBench [16]: adpcm, bitcount, CRC32, cjpeg,

Evaluation

The purpose of our evaluation is to show that using a stack filter a reasonable amount of power can be saved having a negligible impact in the performance. We start with a comparison among three different filter sizes for an ARM architecture. Afterwards, we compare the filter usage for two different target architectures, namely ARM and x86. Finally, we compare the stack filter with filter cache [6], and region-based caching [21] to show that for small sizes it achieves greater power savings

Related work

Many researchers have realized that memory stack region exhibits a special locality significantly away from other data regions, and therefore, they have decided to take advantage from this behavior. In early 1980s, when the trend in microprocessors design was to increase system’s complexity looking for high level languages understanding, Ditzel and McLellan [22] proposed to remove the register file and to include a stack cache for favoring memory–memory architectures and avoiding the complex

Conclusions

Every program has a stack to hold the status of the machine through function calls. Among the special features of this stack there is locality. The nearer a word is to the TOS the higher the locality is. Caches exploit naturally this locality. While there are several proposals to increase performance through stack cache, we believe that using a stack filter having reducing power consumption as goal is an easy way to deal with the power-hunger of cache memories.

In this paper we have shown that a

Acknowledgements

This work was supported in part by the Spanish government through Research Contract CICYT-TIN 2008/508, Consolider Ingenio2010 2007/2011 and by the Hipeac2 European Network of Excellence. Also it was supported by an FPU grant from the Spanish Ministry of Education.

R. Gonzalez-Alberquilla obtained the MS degree in computer engineering from the Complutense University of Madrid in 2007. He is currently a Ph.D. student at the department of Computer Architecture in Complutense University. His research interests include energy-aware processor design and efficient memory management. His recent activities focused on filtering accesses to the stack to reduce power consumption, and using coherence information for debugability purposes.

References (27)

  • J.E. Fritts et al.

    Mediabench ii video: expediting the next generation of video systems research

    Microprocessors and Microsystems

    (2009)
  • M.A. Viredaz et al.

    Power evaluation of a handheld computer

    IEEE Micro

    (2003)
  • J. Montanaro

    A 160 Mhz, 32-b, 0.5W CMOS RISC Microprocessor

    Digital Technology Journal

    (1997)
  • K. Ghose, M.B. Kamble, Reducing power in superscalar processors caches using Su-banking, multiple line buffers and...
  • C.L. Su, A.M. Despain, Cache designs for energy-efficiency, in: Hawaii International Conference on Systems Sciences,...
  • P. Racunas, Y.N. Patt, Partitioned first-level cache design for clustered microarchitectures, in: International...
  • J. Kin et al.

    The filter cache: an energy efficient memory structure

  • D. Albonesi, Selective cache ways: on-demand cache resource allocation, Journal of Instruction-Level Parallelism...
  • L.H. Lee, B. Moyer, J. Arends, Instruction fetch energy reduction using loop caches for embedded applications with...
  • H. Lee, M. Smelyanskiy, C. Newburn, G. Tyson, Stack value file: custom microarchitecture for the stack, in:...
  • S. Cho, P. Yew, G. Lee, Decoupling local variable accesses in a wide-issue superscalar processor, in: International...
  • R. Gonzalez-Alberquilla, F. Castro, L. Pinuel, F. Tirado, Stack oriented data cache filtering, in: International...
  • Ward et al.

    Computation Structures

    (2002)
  • Cited by (8)

    • A predictable hardware to exploit temporal reuse in real-time and embedded systems

      2015, Journal of Systems Architecture
      Citation Excerpt :

      On-chip main memories could present latencies around 10 cycles, so in such case our proposal would provide minor benefits. However, studies considering off-chip memories assume latencies such as 38 cycles [6] or 64 cycles [31]. For such systems, our proposal would reduce execution times to one half approximately.

    • Heterogeneous Memory Organizations in Embedded Systems: Placement of Dynamic Data Objects

      2020, Heterogeneous Memory Organizations in Embedded Systems: Placement of Dynamic Data Objects
    • A Locality-Aware, Energy-Efficient Cache Design for Large-Scale Multi-Core Systems

      2018, Proceedings - IEEE 2018 International Congress on Cybermatics: 2018 IEEE Conferences on Internet of Things, Green Computing and Communications, Cyber, Physical and Social Computing, Smart Data, Blockchain, Computer and Information Technology, iThings/GreenCom/CPSCom/SmartData/Blockchain/CIT 2018
    • ACDC: Small, predictable and high-performance data cache

      2015, ACM Transactions on Embedded Computing Systems
    View all citing articles on Scopus

    R. Gonzalez-Alberquilla obtained the MS degree in computer engineering from the Complutense University of Madrid in 2007. He is currently a Ph.D. student at the department of Computer Architecture in Complutense University. His research interests include energy-aware processor design and efficient memory management. His recent activities focused on filtering accesses to the stack to reduce power consumption, and using coherence information for debugability purposes.

    F. Castro obtained the MS degree in physics from University of Santiago de Compostela in 2000, the MS degree in electrical and computer engineering and the Ph.D. degree in computer science from the Complutense University of Madrid in 2004 and 2008, respectively. He is now a teacher assistant of physics, electrical and computer engineering, and of computer science at the Complutense University. His research interests include energy-aware processor design and efficient memory management. His recent activities focused on the LSQ structure, exploring new techniques to reduce its energy consumption without affecting performance. He is a member of the IEEE.

    L. Pinuel received the Ph.D. degree in computer science from the Universidad Complutense de Madrid (UCM). He is an associate professor in the Department of Computer Architecture, UCM. His research interests include computer architecture, high-performance computing, compiling for novel architectures, low-power microarchitectures, embedded systems, and resource management for emerging computing systems. He is a member of the IEEE and the IEEE Computer Society.

    F. Tirado received the BS degree in applied physics and the Ph.D. degree in physics from the Universidad Complutense de Madrid (UCM) in 1973 and 1977, respectively. He is a professor of computer architecture and technology with UCM. He has worked on computer architecture, parallel processing, and design automation. His current research areas are parallel algorithms and architectures, and processor design. He is a coauthor of more than 200 publications. He has served in the organization of more than 60 international conferences and has also held various positions such as the dean of the Physics Science and Electronic Engineering Faculty, the general manager of the Spanish National Program for Robotics and Advanced Automation, and a member and the chair of the research evaluation committee of Spain. He is now the director of the Center for SuperComputation (CSC) and Madrid Science Park, a member of the Informatics Advisory Board of the UCM, and the adviser of the National Agency for Research and Development (CICYT). He is a senior member of the IEEE, the IEEE Computer Society, and of several European institutions and committees.

    View full text