Load-balancing data prefetching techniques

https://doi.org/10.1016/S0167-739X(00)00056-XGet rights and content

Abstract

Despite the success of hybrid data address and value prediction in increasing the accuracy and coverage of data prefetching, memory access latency is still found to be an important bottleneck to the system performance. Careful study shows that about half of the cache misses are actually due to data references whose access pattern can be predicted accurately. Furthermore, the overall cache effectiveness is bounded by the behavior of unpredictable data references in cache. In this paper, we propose a set of four load-balancing techniques to address this memory latency problem. The first two mechanisms, sequential unification and aggressive lookahead mechanisms, are mainly used to reduce the chance of partial hits and the abortion of accurate prefetch requests. The latter two mechanisms, default prefetching and cache partitioning mechanisms, are used to optimize the cache performance of unpredictable references. The resulting cache, called the LBD (load-balancing data) cache, is found to have superior performance over a wide range of applications. Simulation of the LBD cache with RPT prefetching (reference prediction table — one of the most cited selective data prefetch schemes proposed by Chen and Baer) on SPEC95 showed that significant reduction in data reference latency, ranging from about 20 to over 90% and with an average of 55.89%, can be obtained. This is compared against the performance of prefetch-on-miss and RPT, with an average latency reduction of only 17.37 and 26.05%, respectively.

Introduction

Research on cache prefetching often focuses on accuracy and coverage [1]. With a higher prefetch accuracy, the chance of cache pollution is reduced. A larger coverage also allows more data/instruction references to be handled by a cache prefetch unit. To achieve these goals, recent work in data prefetching emphasizes on the exploration of hybrid data address and value prediction [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. Data accesses in a program are partitioned into exclusive distinct reference classes. Members in each class are usually generated by a unique type of data structure in a program and they are handled exclusively by one predictor/prefetch unit. Good examples of these predictors include linear memory reference [15], [16], [17], [18], [19] and pointer predictors [1], [9], [20], [21], [22], [23], [24].

With accurate predictor(s) and its supporting hardware/software, very good cache performance is expected. However, it might not be as simple as it appears. It is found [25] that the percentage of cache misses due to predictable references ranges from a few percents to over 95%, with an average of 54%. In other words, about half of the cache misses are actually due to data references that can be predicted accurately. The overall cache effectiveness is further bounded by the access behavior of unpredictable references, which contributes to the other half of the cache misses.

Investigation reveals that this performance loss is mainly due to four reasons, which are shared among different predictors and prefetch mechanisms. The first reason is about the bus usage priority. In a traditional memory system, a demand fetch always has higher priority over a prefetch request to use the bus. If a demand fetch occurs while a prefetch request is being served, the prefetch request will likely be aborted and the demand fetch will then be served as soon as possible. The argument behind this decision is that any delay to a demand fetch is supposed to be visible to the program execution. However, with the out-of-order processor execution and the non-blocking cache, delay due to demand memory accesses might not necessarily be seen by the program execution [26]. At the same time, the high accuracy in recent data prefetching results in a higher chance for the aborted prefetch requests to be referenced (or re-issued) later. This not only wastes the bandwidth consumed by the “partial”, aborted prefetch requests, but it also increases the total number of cache misses caused by data references associated to the aborted, accurate prefetch requests.

The second reason is about the collaboration between demand fetch and prefetch requests. Once a prefetch request is triggered by a memory access instruction (MAI), there is very little collaboration between the demand fetch of the MAI and its corresponding prefetch request. Each of them is handled separately as an independent request. This approach of handling memory requests is simple, but might not give very good performance. Normally, the startup overhead of a prefetch or demand fetch is fully visible to the bus. As the bus width between the first and second level caches increases, cache miss penalty will mainly be determined by the transfer time of the first byte of data. The time required to transfer subsequent bytes of memory data in a cache block becomes relatively much smaller. Preliminary study [25] found that it might be possible to eliminate the startup overhead of a prefetch request by working collaboratively with the demand fetch of the same MAI.

The third reason is about the prefetch coverage. To minimize cache pollution, current data predictors and prefetch schemes often trade-off coverage for accuracy. A MAI needs to pass through one or multi-levels of confirmation test before it will be considered for prefetching. This is certainly an effective way to maintain the prefetch accuracy to a very high level. The sacrifice, however, is the amount of data references that are potentially covered by the scheme. References that do not fall into the coverage set will impose an upper bound to the performance gain of selective data prefetch schemes. Furthermore, it is found that while it is good for a prefetch scheme to achieve a very high prediction accuracy (e.g. over 90%), this should not be an absolute requirement to trigger prefetching.

The final reason is the interference between predictable and unpredictable references. Compared to unpredictable references, predictable references usually have stronger spatial properties, larger inter-access distance, and a much larger working set size. As a result, the chance for the predictable references to be reused in cache is quite low.1 However, they interfere the working set of the relatively stronger temporal, unpredictable ones by wiping them out from cache, thus resulting in performance degradation to the unpredictable references. The argument here is: should the predictable references be re-directed to a small prefetch buffer instead of entering the normal cache? In this way, interference between the two reference groups will be reduced and the effective size of a normal cache for unpredictable references will be increased.

In this paper, we propose a set of four load-balancing mechanisms to make up for the discrepancy between the ideal and observable performance of accurate prefetching. They are: (i) sequential unification of demand and prefetch requests; (ii) aggressive lookahead prefetching; (iii) default prefetching; and (iv) cache partitioning. The first two mechanisms are mainly used to reduce the chance of partial hits and the abortion of accurate prefetch requests by demand fetch ones, while the latter two mechanisms are to optimize the cache performance of unpredictable references. To ensure performance gain, these mechanisms will only be triggered selectively, based on the level of confidence on the urgency, accuracy, and nature of prefetched data. The resulting cache, called the LBD (load-balancing data) cache, is found to have superior performance over a wide range of applications. Simulation of the LBD cache with RPT prefetching (reference prediction table — one of the most cited selective data prefetching schemes proposed by Chen and Baer [15], [16]) using SPEC95 showed that significant reduction in data reference latency, ranging from about 20 to over 90% and with an average of 55.89%, can be obtained. This is compared against the performance of prefetch-on-miss and RPT, with an average latency reduction of only 17.37 and 26.05%, respectively. Moreover, this LBD cache works collaboratively with most data cache predictors, thus making it attractive for practical hardware-driven cache implementation.

The outline for the rest of the paper is as follows. Section 2 will briefly summarize previous work related to accurate data prefetching and predictors, and cache partitioning. Then, the four proposed mechanisms, together with the load-balanced data cache (LBD cache), will be proposed in Section 3. Simulation result for these mechanisms and the LBD cache will be given in Section 4. Finally, the paper will be concluded in Section 5.

To help the discussion in the rest of the paper, the RPT prefetching will be chosen as the basic data prefetch scheme, on which our mechanisms will be added on. RPT is chosen because it is one of the most cited schemes for accurate data prefetching. It has also been agreed to be one of the most efficient schemes to prefetch memory references with linear address sequences. Note that the four mechanisms proposed here are not limited to RPT. They can be applied to other accurate prefetch schemes with very little (if any) modification. The focus of this paper is to obtain better cache performance for a given cache space and data predictor, not to increase the accuracy of a given predictor.

Before we proceed, it will be helpful to give precise definition for the following terms used in the rest of the paper:

  • •

    A partial hit (or partial miss) for a cache block I is said to occur if, while the block I is being prefetched from the next level of the memory hierarchy, a demand fetch for block I occurs.

  • •

    The data memory latency for a cache design D is defined as the difference between the total execution time of a system with cache design D and the total execution time of the system with a perfect data cache (i.e. no data cache miss).

  • •

    A selective prefetch (SP) scheme refers to a scheme that is based on one or more reference predictors to prefetch data in cache. Furthermore, each of these predictors only handles one type of data sequences with predefined characteristics such as linear memory address sequence by a MAI. In the rest of this paper, the term SP and RPT might sometimes be used interchangeably because RPT is chosen as the SP scheme here.

In our study, SPEC95 was chosen as the benchmark suite for experimentation (see Table 1). These programs were compiled on the UltraSPARC platform with the optimization flag on. Then, each of them was traced and simulated cycle-by-cycle for 100 million instructions. The baseline architecture together with its parameters used in the simulation study is given in Table 2. It is an UltraSPARC ISA compatible superscalar processor with rich details in its architectural features, including the pipeline, register renaming, reorder buffer, branch prediction, and multi-level memory hierarchy.

Section snippets

Previous research

Prefetching is always an important issue in cache design [27], [28]. Recent work in data prefetching concentrates on the predictability of references and the accuracy of prefetching. A direct consequence of this research direction is the introduction of hybrid address and value predictors [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. Multiple predictors are implemented, each of which will be designed, optimized, and responsible for one type of references. Examples of

Load-balanced data cache

In the beginning of this paper, we pointed out that there are two aspects of performance loss due to memory accesses. With limited bus bandwidth for (pre-)fetching and the increasing memory access latency, the observed performance for accurate data prefetching often differs significantly from its ideal one. Either the correctly predicted data cannot be prefetched in time for consumption or the accurate prefetch requests are aborted while they are being served. This observation is generally

Experimental result

To study the performance of our proposed mechanisms in Section 3, cycle-by-cycle simulation on the baseline architecture given in Table 2 was conducted for various cache configurations and control designs. In particular, the following cache designs were simulated and analyzed:

  • 1.

    Perfect data cache (i.e. no cache misses).

  • 2.

    Normal data cache without prefetching.

  • 3.

    Normal data cache with POM.

  • 4.

    Data cache with RPT (original).

  • 5.

    Data cache with RPT and sequential unification only.

  • 6.

    Data cache with RPT and

Conclusion

Recent work in data prefetching has successfully improved its accuracy and coverage. However, it is found that more than half of the cache misses are actually due to the predictable group of data references. Investigation shows that this is mainly due to the partial hit effect and the abortion of accurate prefetch requests by demand fetch requests. The overall cache performance is further bounded by the performance of the unpredictable reference group. To make up for this situation, we propose

Chi-Hung Chi is currently a Senior Lecturer in the School of Computing, National University of Singapore. He obtained his BSc degree from the University of Wisconsin, Madison and his PhD degree from the Purdue University. From 1988 to 1991, he worked in Philips Laboratory as a Senior Research Scientist. From 1991 to 1993, he was with IBM. From 1993 to 1996, he was with the Department of Computer Science, Chinese University of Hong Kong. He publishes extensively and holds six US patents. His

References (33)

  • C.H. Chi, C.M. Cheung, Hardware-driven prefetching for pointer data references, in: Proceedings of the ACM...
  • B. Black, B. Mueller, S. Postal, R. Rakvic, Load execution latency reduction, in: Proceedings of the ACM International...
  • G.Z. Chrysos, J.S. Emer, Memory dependence prediction using store sets, in: Proceedings of the 25th Annual...
  • J. Gonzalez, A. Gonzalez, Speculative execution via address prediction and data prefetching, in: Proceedings of the ACM...
  • J. Gonzalez, A. Gonzalez, The potential of data value speculation to boost ILP, in: Proceedings of the ACM...
  • P. Ibanez, V. Vinals, J.L. Briz, M.J. Garzaran, Characterization and improvement of LOAD/STORE cache-based prefetching,...
  • D. Joseph, D. Grunwald, Prefetching using markov predictors, in: Proceedings of the 24th Annual International Symposium...
  • M.H. Lipasti, C.B. Wilkerson, J.P. Shen, Value locality and load value prediction, in: Proceedings of the Seventh...
  • A. Moshovos, S. Breach, T. Vijaykumar, G. Sohi, Dynamic speculation and synchronization of data dependencies, in:...
  • J.A. Rivers, E.S. Tam, G.S. Tyson, E.S. Davidson, M. Farrens, Utilizing reuse information in data cache management, in:...
  • Y. Sazeides, J.E. Smith, The predictability of data values, in: Proceedings of the MICRO-30, ACM, New York, 1997, pp....
  • Y. Sazeides, J.E. Smith, Modeling program predictability, in: Proceedings of the 25th Annual International Symposium on...
  • Y. Sazeides, S. Vassiliadis, The performance potential of data dependence speculation and collapsing, in: Proceedings...
  • F. Wang, M. Franklin, Highly accurate data value prediction using hybrid predictors, in: Proceedings of the MICRO-30,...
  • T.F. Chen, J.L. Baer, Reducing memory latency via non-blocking and prefetching caches, in: Proceedings of the Fifth...
  • T.F. Chen et al.

    Effective hardware-based prefetching for high performance processors

    IEEE Trans. Comput.

    (1995)
  • Cited by (1)

    Chi-Hung Chi is currently a Senior Lecturer in the School of Computing, National University of Singapore. He obtained his BSc degree from the University of Wisconsin, Madison and his PhD degree from the Purdue University. From 1988 to 1991, he worked in Philips Laboratory as a Senior Research Scientist. From 1991 to 1993, he was with IBM. From 1993 to 1996, he was with the Department of Computer Science, Chinese University of Hong Kong. He publishes extensively and holds six US patents. His current research interest includes computer architecture, parallel processing, compiler optimization, and Internet technologies.

    Jun-Li Yuan is currently a PhD student in the School of Computing, National University of Singapore. His research interest includes computer architecture and parallel processing.

    View full text