On the performance of trace locality of reference

doi:10.1016/j.peva.2004.10.018

Performance Evaluation

Volume 60, Issues 1–4, May 2005, Pages 51-72

https://doi.org/10.1016/j.peva.2004.10.018 Get rights and content

Abstract

In this paper, trace locality of reference (LoR) is identified as a mechanism to predict the behavior of a variety of systems. If two objects were accessed nearby in the past and the first one is accessed again, trace LoR predicts that the second one will be accessed in near future. To capture trace LoR, trace graph is introduced. Although trace LoR can be observed in a variety of systems, but the focus of this paper is to characterize it for data accesses in memory management systems. In this field, it is compared with recency-based prediction (LRU stack) and it is shown that not only the model is much simpler, but also it outperforms recency-based prediction in all cases. The paper examines various parameters affecting trace LoR such as object size, caching effects (address reference stream versus miss address stream), and access type (read, write, or both). It shows that object size does not have meaningful effects on trace LoR; in average the predictability of miss address stream is 30% better than address reference stream; and identifying access type can increase predictability. Finally, two enhancements are introduced to the model: history and multiple LRU prediction. A main contribution of this paper is the introduction of the n-stride prediction¹. For a prediction to be useful, we should have sufficient time to load the object, and n-stride prediction shows that trace LoR can predict an access far ahead from its occurrence.

Introduction

Hierarchical systems are used extensively in computer area. Most systems use cache memory to decrease their access time. They include file systems, network servers, web proxies, network clients, and memory management systems.

In a hierarchical model, objects are fetched and placed into a layer when a miss occurs or they are prefetched. If there is no room for the new object, a replacement algorithm determines the victim object. Both of the prefetching and replacement algorithms are based on predicting the system behavior. The success of a hierarchical system depends on the correct prediction of system behavior.

Although the concepts developed in this paper are general and can be applied to any hierarchical system, this paper focuses on memory management systems and studies their data access patterns. The results presented here can be used for caching (cache miss handling), address translation (TLB management), and virtual memory management (page fault handling).

Various methods have been studied so far to model the system behavior and predict objects required in near future. Locality of reference [11] is among these methods. Two LoR types are defined in literature: spatial and temporal. Most replacement algorithms, including LRU, and the effectiveness of hierarchical systems, are based on temporal LoR [9], [43]; meanwhile, some prefetching algorithms such as sequential prefetching [42] or stream buffers [31] are based on spatial LoR. Similarly, fetching a cache block or a page frame, instead of just fetching the requested word, which is a type of prefetching, is based on spatial LoR [44].

However, there are a variety of access patterns not covered by traditional LoR types [18], [29]. The fact that traditional LoR types are not sufficient to capture all access patterns, has been the motivation of lots of prediction algorithms. Examples are prefetch algorithms for array-based programs [3], [5], and algorithms for pointer-intensive programs [4], [10], [21], [22], [24], [25], [26], [27], [36], [37].

The main problem of these approaches is that they are tied to special cases. The program segment having the expected behavior should be identified and then this algorithm applied to it. Most of them are offline algorithms that require compiler support [3], [4].

A group of prediction methods uses the system trace. The system trace is defined as the sequence of accessed objects. Now, if an object was accessed previously and is accessed again, such methods predict that its nearby objects in the trace will be accessed too. Branch prediction algorithms use the past outcomes of a branch instruction to predict whether it is taken or not taken the next time. Trace cache and trace processors [19], [20], [32], [34], [35] extend the idea of branch prediction to predict next basic blocks of codes to be executed.

In data access patterns, recency-based prediction²[38], based on LRU stack, and a number of frequency-based graph algorithms [13], [17], [22] use the system trace to predict its future behavior. The graph algorithms are Markov predictor [22], [30], access graph [13], and probability graph [17]. Markov prediction has been used in the context of cache prefetching [22] and I/O prediction [30]. Access graph has been used in the context of virtual memory management, and probability graph in the context of file systems.

This paper formalizes the concept of trace LoR in general and shows how it can be used to predict future data accesses in memory management systems. Section 2 reviews the related work. Section 3 defines trace LoR as a general aspect of most systems. Section 4 introduces trace graph for capturing trace LoR. Section 5 studies the benchmark results including effects of system configuration. Section 6 introduces some extensions to trace graph to predict the correct behavior when there are more than one trace associated with an object. Section 7 defines n-stride prediction and evaluates its usefulness. Finally, Section 7 concludes the paper.

Section snippets

Related work

In the context of memory management, various models were suggested to predict a program behavior and prefetch an object. There are three types of prefetching techniques: offline, online, and hybrid methods. Offline techniques rely on compiler to analyze the program and insert prefetching instructions in the code [6], [15], [46]. Online algorithms detect access patterns at runtime, and hybrid methods [7], [8], [16], [24], [40], [45], [47] use both compiler and runtime behavior to predict the

Trace locality of reference

Consider the algorithm used to find an item in a linked list. This algorithm works as follows. It first examines the first element. If it matches the requested item, the search completes and the element is returned.

Otherwise, the second element is examined. This process continues until the list is exhausted or the item is found. This behavior is repeated whenever an item is searched in a linked list. In fact, most linked list processing functions have the same behavior.

As another example,

Definition

To make prediction using trace LoR, one should store the trace of the system. However, this information should be stored in a way that can be used for prediction. In the simplest case, one wishes to only predict the next object and only uses the last occurrence of the current object. For this purpose, trace graph is introduced.

Trace graph. For each object in the object space, one node is created. Each node has at most one outgoing edge. If the trace contains the sequence $〈 a_{1}, a_{2} 〉$ , an edge is

Experiments

To evaluate the effect of prediction using trace LoR on system performance, SimpleScalar [1] was used. The functionality of SimpleScalar was extended to extract the required information. The benchmarks were taken from SPEC CPU 2000 suite and each was run for 20,000,000 addresses. Table 1 shows the list of used benchmarks.

The prediction accuracy (the percentage of correct predictions to total number of predictions) were measured for each of the programs and each of its input set. Then, for each

Enhancing the model

In two ways, there are more than one trace for an object. To illustrate the first, note that a program may not follow its previous behavior exactly. Based on the input parameters, a code fragment accesses different objects and follows different execution paths. In this manner, the same code fragment produces different traces.

As an example, consider the linked list search example again. Consider that each item key is an ordered pair $(k_{1}, k_{2})$ . The code to find an item in this list is shown in Fig.

n-Stride prediction

Trace graph is the simplest usage of trace LoR for predicting a system behavior. This section explores a more complicated usage of trace LoR. We call it n-stride prediction. However, it should be noted that it differs from the stride prediction discussed in [14].

n-Stride prediction shows that an access can be predicted far ahead before its happening. Note that for a prediction to be useful, not only the right prediction should be done, but also there must be sufficient time to fetch the

Conclusion

In this paper, the concept of trace LoR was developed. If a system behavior has the trace LoR property, the trace of its accesses can be stored to predict its future behavior. A simple model, called trace graph was developed for this purpose. In the next phase, the model was enhanced to improve its effectiveness. In addition, effects of system parameters on the trace graph are measured.

If there are more than one occurrence for an object in the trace, it is not clear which occurrence should be

Ali Mahjur received his B.S. and M.S. degrees in computer engineering from Sharif University of Technology (SUT), Iran, in 1996 and 1998, respectively. He has been a Ph.D. student in computer engineering at SUT since then. His research interests include Computer Architecture, Operating Systems, Memory Management Systems, and Programming Languages.

References (47)

T. Austin et al.
SimpleScalar: an infrastructure for computer system modeling
IEEE Comput.
(2002)
J.L. Baer et al.
An effective on-chip preloading scheme to reduce data access penalty
B.D. Cahoon et al.
Tolerating latency by prefetching java objects
B.D. Cahoon et al.
Data flow analysis for software prefetching linked data structures in java controller
B. Cahoon et al.
Simple and effective array prefetching in Java
D. Callahan et al.
Software prefetching
T.F. Chen et al.
Effective hardware-based data prefetching for high performance processors
IEEE Trans. Comput.
(1995)
T.F. Chen
An effective programmable prefetch engine for high performance processors
C. Noga
LRU is better than FIFO
R. Cooksey et al.
A stateless, content-directed data prefetching mechanism

P.J. Denning

The working set model for program behavior

Commun. ACM

(1968)

K. Farkas et al.

Memory-system design considerations for dynamically scheduled processors

F. Rosen

Experimental studies of access graph-based heuristics: beating the LRU standard

J.W.C. Fu et al.

Stride directed prefetching in scalar processors

S. Ghosh et al.

Precise miss analysis for program transformations with caches of arbitrary associativity

E.H. Gornish et al.

An integrated hardware/software data prefetching scheme for shared-memory multiprocessors

J. Griffioen et al.

Reducing file system latency using a predictive approach

H. Han et al.

A comparison of locality transformations for irregular codes

Q. Jacobson et al.

Path-based next trace prediction

Q. Jacobson et al.

Trace preconstruction

T.L. Johnson et al.

Run-time cache bypassing

IEEE Trans. Comput.

(1999)

D. Joseph et al.

Prefetching using Markov predictors

IEEE Trans. Comput.

(1999)

N. Jouppi

Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers

Cited by (2)

BitTorrent traffic from a caching perspective
2013, Journal of the Brazilian Computer Society
Two-phase prediction of L1 data cache misses
2006, IEE Proceedings: Computers and Digital Techniques

Amir Hossein Jahangir received his Ph.D. degree in industrial informatics from the Department of Electrical Engineering, Institut National des Sciences Appliquees, Toulouse, France in 1989. Since then, he has been with the Department of Computer Engineering, Sharif University of Technology, Iran, where he has taught several hardware architecture courses and supervised related research projects. From 1990 to 1994 he was the head of the department and has had several other responsibilities thereafter. His research interests include High Performance Computer Architectures, Analysis of Network devices and the design of real-time and Fault-Tolerant systems.

Amir Hossein Gholamipour will receive his B.Sc. from the Department of Computer Engineering, Sharif University of Technology, Iran by June 2005. His research interests include Computer Architecture, Real-Time systems, and Embedded Systems.

¹: It should be noted that the n-stride prediction, introduced in this paper, differs from stride prefetching introduced in other researches.

View full text

On the performance of trace locality of reference

Abstract

Introduction

Section snippets

Related work

Trace locality of reference

Definition

Experiments

Enhancing the model

n-Stride prediction

Conclusion

SimpleScalar: an infrastructure for computer system modeling

IEEE Comput.

An effective on-chip preloading scheme to reduce data access penalty

Tolerating latency by prefetching java objects

Data flow analysis for software prefetching linked data structures in java controller

Simple and effective array prefetching in Java

Software prefetching

Effective hardware-based data prefetching for high performance processors

IEEE Trans. Comput.

An effective programmable prefetch engine for high performance processors

LRU is better than FIFO

A stateless, content-directed data prefetching mechanism

The working set model for program behavior

Commun. ACM

Memory-system design considerations for dynamically scheduled processors

Experimental studies of access graph-based heuristics: beating the LRU standard

Stride directed prefetching in scalar processors

Precise miss analysis for program transformations with caches of arbitrary associativity

An integrated hardware/software data prefetching scheme for shared-memory multiprocessors

Reducing file system latency using a predictive approach

A comparison of locality transformations for irregular codes

Path-based next trace prediction

Trace preconstruction

Run-time cache bypassing

IEEE Trans. Comput.

Prefetching using Markov predictors

IEEE Trans. Comput.

Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers