Feasibility of decoupling memory management from the execution pipeline

https://doi.org/10.1016/j.sysarc.2007.03.003Get rights and content

Abstract

In conventional architectures, the central processing unit (CPU) spends a significant amount of execution time allocating and de-allocating memory. Efforts to improve memory management functions using custom allocators have led to only small improvements in performance. In this work, we test the feasibility of decoupling memory management functions from the main processing element to a separate memory management hardware. Such memory management hardware can reside on the same die as the CPU, in a memory controller or embedded within a DRAM chip. Using Simplescalar, we simulated our architecture and investigated the execution performance of various benchmarks selected from SPECInt2000, Olden and other memory intensive application suites.

Hardware allocator reduced the execution time of applications by as much as 50%. In fact, the decoupled hardware results in a performance improvement even when we assume that both the hardware and software memory allocators require the same number of cycles. We attribute much of this improved performance to improved cache behavior since decoupling memory management functions reduces cache pollution caused by dynamic memory management software. We anticipate that even higher levels of performance can be achieved by using innovative hardware and software optimizations. We do not show any specific implementation for the memory management hardware. This paper only investigates the potential performance gains that can result from a hardware allocator.

Section snippets

Introduction and motivation

Modern programming languages often permit complex dynamic memory allocation and garbage collection. Such features provide computer systems architects with a challenge of reducing the overheads due to memory management functions. The challenge is further exacerbated by the ever-increasing gap between memory and processor speeds. Some researchers chose to employ custom memory allocation methods into their systems; however, it has been shown that such custom allocators generally do not improve

Related research

Several research threads, including custom allocators, and hardware implementation of memory management function, have influenced our research.

Dynamic memory management is an important problem studied by researchers for the past several decades. Modern programming languages and applications are driving the need for more efficient implementations of memory management functions, in terms of both memory usage and execution performance. Several researchers have proposed and implemented custom

Experimental framework

To evaluate the potential for decoupling memory management, we constructed experiments to reflect conditions as close to real execution environments as possible. We have identified and controlled experimental parameters such as machine model(s), appropriate benchmarks, and statistical attributes of interest. In this section we describe our methodology and the selection of benchmarks.

Experiment results

In this section, we report the results of our experiments. We discuss both the execution performance and cache behavior resulting from decoupling of memory management functions.

Simple optimization of the decoupled memory manager

In general, a hardware implementation of any function should require fewer cycles than a corresponding software implementation. The performance of a hardware implementation of Lea’s allocator can also be improved for applications such as voronoi and treeadd which make bursts of malloc calls. In such cases the (hardware) allocator could predict that the next malloc request would be for the same sized object as the previous request. Thus the allocator could pre-allocate similar-sized objects. If

Conclusions and future research

In this study we have shown that decoupling memory management functions from the processing pipeline can lead to improved performance. Several features impact performance of modern architectures. Among these are out-of-order execution, speculative execution, and cache hierarchies. Application characteristics in terms of memory usage, distribution of allocation requests over the lifetime of the application, and the sizes of objects requested also impact performance. Decoupling eliminates a

Wentong Li received his MS degree in Computer Science from the University of North Texas. He is currently completing his PhD in Computer Science at the University of North Texas. He is employed by Turn, Inc. as staff software engineer. His research interests cover the areas of computer architecture, machine learning and information retrieval.

References (16)

  • M. Rezaei et al.

    Intelligent memory management eliminates cache pollution due to memory management functions

    Journal of Systems Architecture

    (2006)
  • E.D. Berger, B.G. Zorn, K.S. McKinley, Reconsidering custom memory allocation, in: Proceedings of the Conference on...
  • D. Patterson

    The case for intelligent RAM: IRAM

    IEEE Micro

    (1997)
  • K.M. Kavi, M. Rezaei, R. Cytron, An efficient memory management technique that improves localities, in: Proceedings of...
  • Y. Feng, E.D. Berger, A locality-improving dynamic memory allocator, in: Proceedings of the 2005 workshop on Memory...
  • V.H. Lai, S.M. Donahue, R.K. Cytron, Hardware optimizations for storage allocation in real-time systems, Tech Rept,...
  • S. Donahue, M. Hanpton, R. Cytron, M. Franklin, K. Kavi, Hardware support for fast and bounded-time storage allocation,...
  • J.M. Chang et al.

    A high-performance memory allocator for object-oriented systems

    IEEE Transactions on Computers

    (1996)
There are more references available in the full text version of this article.

Cited by (0)

Wentong Li received his MS degree in Computer Science from the University of North Texas. He is currently completing his PhD in Computer Science at the University of North Texas. He is employed by Turn, Inc. as staff software engineer. His research interests cover the areas of computer architecture, machine learning and information retrieval.

Mehran Rezai received BS and MS degrees in Electrical Engineering from the University of Alabama in Huntsville and a PhD in Computer Science from the University of North Texas. He worked as a visiting faculty member at the University of Texas at Arlington. He is currently working as a software consultant in Washington, DC area.

Krishna M. Kavi is currently a professor and the Chair of Computer Science and Engineering Department at the University of North Texas. Previously, he was the Eminent Scholar Chair Professor of Computer Engineering at the University of Alabama in Huntsville from 1997 to 2001. He was on the faculty at the University of Texas at Arlington from 1982 to1997. He was a program manager at the US National Science Foundation from 1993 to 1995. He has extensive research record covering intelligent memory systems, multithreaded and decoupled architectures, dataflow model of computation, scheduling and load-balancing.

Afrin Naz is currently completing her PhD in Computer Science at the University of North Texas. She received her MS in Computer Science from Midwestern State University, Wichita Falls, Texas. She is a member of UPSILON PI EPSILON Chapter of Texas at Midwestern State University. She is also the recipient of multi-cultural scholastic award of University of North Texas. She will start her academic career as an Assistant Professor at Drake University in Iowa, in Fall 2007. Her research interest includes Computer Architecture, Compilers and Embedded System designs.

Philip Sweany, associate professor in UNTs Computer Science and Engineering Department, maintains a research focus in both compiler optimization for architectures exhibiting fine-grained parallelism and application of compiler optimization algorithms to automated synthesis of net-centric systems.

View full text