Dynamic and adaptive SPM management for a multi-task environment

https://doi.org/10.1016/j.sysarc.2010.11.002Get rights and content

Abstract

In this paper, we present a dynamic and adaptive scratchpad memory (SPM) management strategy targeting a multi-task environment. It can be applied to a contemporary embedded processor that maps the physically addressed SPM into a virtual space with the help of an integrated memory management unit (MMU). Based on mass-count disparity, we introduce a hardware memory reference sampling unit (MRSU) that samples the memory reference stream with very low probability. The captured address is considered as one of the memory addresses contained in a frequently referenced memory block. A hardware interruption is generated by the MRSU, and the identified frequently accessed memory block is placed into the SPM space by software. The software also modifies the page table so that the follow-up memory accesses to the memory block will be redirected to the SPM. With no dependence on compiler and profiling information, our proposed strategy is specifically adequate for SPM management in a multi-task environment. In such an environment, a real-time operating system (RTOS) is usually hosted, and the behavior of the memory accesses cannot be predicted by static analysis or profiling. We evaluate our SPM allocation strategy by running several tasks on a tiny RTOS with preemptive scheduling. Experimental results show that our approach can achieve 10% reduction in energy consumption on average, with 1% performance degradation at runtime compared with a cache-only reference system.

Introduction

Most of today’s embedded devices are becoming more and more powerful, especially when the performance of processors is further improved. An embedded system usually hosts an RTOS and runs a number of sophisticated applications, such as handwriting recognition, network communication, and general media processing. Inevitably, they are also becoming more and more power-consuming than devices with fewer features. Therefore, both manufactures and users are trying to reduce power consumption so as to extend the battery life of their devices.

Many embedded processors employ software-controlled small on-chip memories to reduce the energy consumption of the memory system. These software-controlled on-chip memories, frequently referred to as scratchpad memories (SPM), are shown to be both performance and power efficient compared with their hardware-managed cache counterparts [31]. As shown in Fig. 1, the main difference between a conventional cache and SPM are in the following aspects: (1) SPMs are plain memories, and each consists of a simple array of SRAM cells and an address decodes. In contrast, caches include complex comparator and management logic, as well as tag arrays except data arrays. (2) SPM is mapped into the memory address space disjointed from the off-chip memory, whereas a cache is transparent to software. (3) Determining which parts of code/data should be placed in the SPM is users or compilers’ responsibility, but a cache is managed by hardware. (4) SPM guarantees single-cycle access latency, whereas access to a cache may incur longer latency due to cache miss and off-chip access. In a word, placing the most frequently referenced code/data blocks of the application in the SPM can reduce energy consumption and execution time. Furthermore, SPM offers better timing predictability and is therefore widely accepted as an alternative to caches in a real-time embedded system.

Thereby, researchers on memory management have considered SPM as an alternative to cache-based architecture. Most of the previous SPM research has focused on designing good SPM allocation strategies to maintain the most frequently accessed data/instructions on chip, leading to an optimized execution time and energy consumption [3], [4], [20], [25], [26]. However, most of the previous works assume that the target system consists only one task. Therefore, memory accessing behavior is predictable and determinable before program running. The hot-spot of memory space can be identified and located through profiling or with the assistance of compilers/programmers. Although this one task assumption makes sense in certain application domains, there also exist many cases where multiple tasks need to share the same SPM space [19]. On one hand, today’s portable devices perform an ever increasing number of functions undertaken by several tasks. Tasks are created and destroyed upon the user’s demand at an arbitrary time and may run in any order. On the other hand, due to the development of the Internet and portable storage media, more and more embedded applications can be downloaded and updated after deployment; this trend is likely to continue in future. When these programs are compiled, compilers know nothing about SPM’s configuration in the potential running environment. Thus, a dynamic and adaptive SPM management strategy is required to take the full advantage of the on-chip memory.

In this paper, we mainly focus on the data side of SPM utilization and propose a dynamic and adaptive SPM management strategy for multi-task systems. This strategy is a SPM allocation scheme without compiler support and is implemented with the cooperation of hardware and software. Our approach identifies the core working-set of tasks at runtime and consequently decides which part of data should be moved to the SPM. It is an adaptive strategy that can make a timely adjustment of SPM content to the current context in a multi-task environment. Our approach has the following advantages over static SPM allocations techniques. (1) An embedded processor with our proposed sampling hardware does not require a specific compiler able to analyze the source code and insert the management code. The proposed management approach is transparent to programmers and compilers. This feature will maximize the software portability when the user migrates it from one revision to another, migrates it from one processor to another processor in the same family, or migrates it between two different families of processors. (2) Applications downloaded from the Internet are compiled beforehand, and compilers know nothing about the SPM’s configuration in the potential running environment. Our proposed approach is still effective and efficient in this case, whereas the static ones are not. (3) Static methods partition and allocate SPM memory for a fixed set of tasks; however, this assumption is usually not held in complex embedded systems. Our approach has no such limitation, and it is still efficient if new tasks are created and started on-the-fly at runtime.

The rest of this paper is organized as follows. Section 2 presents the motivation of this study and our basic idea to solve the problem is given. Section 3 describes our hardware architecture and software algorithm. In Section 4, we explain the simulation environment, experimental methodology and benchmarks. Our experimental results are presented in Section 5. Section 6 discusses the related work, and Section 7 concludes this paper.

Section snippets

Problem statement and basic idea

In modern embedded systems running an RTOS, many tasks typically run on one available processor. Scheduling is a key concept referring to the way tasks are assigned to run on the only available processor. This assignment is carried out by software known as a scheduler in the RTOS. With priority preemptive scheduling, the scheduler ensures that the processor executes the highest priority task of all currently ready tasks at any given time. There are two situations when scheduling tasks in the

Dynamic and adaptive SPM management with random sampling

The runtime adaptive SPM management is supported by both hardware and the software. A hardware component is responsible for random sampling to predict the most frequently accessed regions based on the locality theory. Software is responsible for the data movement between off-chip DRAM and on-chip SPM as well as the remapping of the VA to a newly allocated PA.

Evaluation methodology

In this section, we present the evaluation methodology for the proposed dynamic and adaptive SPM management strategy. We use an embedded processor ARM926EJ-S and typical benchmarks to evaluate the performance of our approach. We first describe the simulation environment and benchmarks. Subsequently, we explain the measure metrics for execution time and energy.

Result

This section presents the experimental results by comparing a hybrid SPM-cache architecture and SPM-only architecture managed by our proposed approach with a cache-only reference system. The comparison is mainly concentrated on the execution time and energy consumption. Further, we analyze the access distribution and the effect of the varying SPM sizes when adopting our method. We emphasize the energy consumption reduction with adequate consideration of the effect of execution time.

Related work

In this section, we briefly review the previous research in the area of data allocation of SPM. SPM allocation techniques in a single task scenario have been studied extensively in the past [1], [7], [9], [18], [21], [23], [29], [30], [32]. These allocation techniques can be classified into two categories according to whether the contents of SPM are alterable at run-time: static methods and dynamic methods. Static methods are those that the contents of SPM do not change at run-time [3], [4],

Conclusion

This paper presents a new dynamic and adaptive SPM management strategy for multi-task environments. It is a fully automatic scheme with no dependence on the compiler and profiling information. With our approach, all data of different tasks can share the same SPM space at runtime. Without knowing both the size of the allocated variable by each task as well as the total amount of SPM available, our approach dynamically detects the core working sets of tasks and adjusts the contents of the SPM

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant No. 60973010 and the National Research Foundation for the Doctoral Program of Higher Education of China.

Weixing Ji received his Ph.D. degree from Beijing Institute of Technology, Beijing, China, in 2008. He is currently an Assistant Professor with the School of Computer Science and Technology, Beijing Institute of Technology. His research interests are in embedded systems, object-oriented programming, and computer architecture.

References (35)

  • F. Angiolini et al.

    Polynomial-time algorithm for on-chip scratchpad memory partitioning

  • ARM Limited, ARM926EJ-S Technical Reference Manual, <http://infocenter.arm.com>, April...
  • O. Avissar et al.

    Heterogeneous memory management for embedded systems

  • O. Avissar et al.

    An optimal memory allocation scheme for scratch-pad-based embedded systems

    Trans. Embed. Comput. Syst.

    (2002)
  • A. Chatzigeorgiou et al.

    Evaluating performance and power of object-oriented vs. procedural programming in embedded processors

  • N. Deng, W. Ji, F. Shi, A novel adaptive scratchpad memory management strategy, in: The 15th IEEE International...
  • A. Dominguez et al.

    Heap data allocation to scratch-pad memory in embedded systems

    J. Embed. Comput.

    (2005)
  • B. Egger et al.

    Scratchpad memory management for portable systems with a memory management unit

  • B. Egger et al.

    Dynamic scratchpad memory management for code in portable systems with an mmu

    Trans. Embed. Comput. Syst.

    (2008)
  • B. Egger, J. Lee, H. Shin. Scratchpad memory management in a multitasking environment, in: EMSOFT, 2008, pp....
  • Y. Etsion et al.

    L1 cache filtering through random selection of memory references

  • Y. Etsion et al.

    Probabilistic prediction of temporal locality

    IEEE Comput. Archit. Lett.

    (2007)
  • M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, R.B. Brown, Mibench: A free, commercially...
  • I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, Multiprocessor system-on-chip data reuse analysis for exploring...
  • M. Kandemir et al.

    Dynamic on-chip memory management for chip multiprocessors

  • M.T. Kandemir, Data locality enhancement for cmps, in: ICCAD, 2007, pp....
  • J. Lee et al.

    Facsim: a fast and cycle-accurate architecture simulator for embedded systems

  • Cited by (0)

    Weixing Ji received his Ph.D. degree from Beijing Institute of Technology, Beijing, China, in 2008. He is currently an Assistant Professor with the School of Computer Science and Technology, Beijing Institute of Technology. His research interests are in embedded systems, object-oriented programming, and computer architecture.

    Ning Deng received the B.E. degree in School of Computer from National University of Defense Technology (China) in 2007. He is currently a Ph.D. Student at Beijing Institute of Technology. His primary research interest is on-chip memory management of microprocessors, with particular emphasis on scratchpad memory management for embedded systems currently. He is a student member of ACM.

    Feng Shi received his B.E. degree in physics in 1983 from Peaking University and received his Ph.D. degree from Beijing Institute of Technology, Beijing, China, in 1999. He is currently a Professor with the School of Computer Science and Technology, Beijing Institute of Technology. His research focuses on parallel computing and computer architecture.

    Qi Zuo received the B.E. degree in computer science and engineering in 2005 from Beijing Institute of Technology. She is currently pursuing her Ph.D. degree in computer science at Beijing Institute of Technology. Her research focuses on parallel and distributed computing, computer architecture and memory management.

    Jiaxin Li received the B.E. degree in computer science from Beijing Institute of Technology, Beijing, China, in 2004. She is currently working toward the Ph.D. degree in computer science at Beijing Institute of Technology, Beijing, China. She is the author of some papers in her areas of research interest, which include computing models of chip multi-processors and routing algorithms of network on chip.

    View full text