Elsevier

Parallel Computing

Volume 38, Issues 10–11, October–November 2012, Pages 533-551
Parallel Computing

Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

https://doi.org/10.1016/j.parco.2012.06.002Get rights and content

Abstract

The rapid advance of processor architectures such as the emerged multicore architectures and the substantially increased computing capability on chip have put more pressure on the sluggish memory systems than ever. In the meantime, many applications become more and more data intensive. Data-access delay, not the processor speed, becomes the leading performance bottleneck of high-performance computing. Data prefetching is an effective solution to accelerating applications’ data access and bridging the growing gap between computing speed and data-access speed. Existing works of prefetching, however, are very conservative in general, due to the computing power consumption concern of the past. They suffer low effectiveness especially when applications’ access pattern changes. In this study, we propose an Algorithm-level Feedback-controlled Adaptive (AFA) data prefetcher to address these issues. The AFA prefetcher is based on the Data-Access History Cache, a hardware structure that is specifically designed for data access acceleration. It provides an algorithm-level adaptation and is capable of dynamically adapting to appropriate prefetching algorithms at runtime. We have conducted extensive simulation testing with the SimpleScalar simulator to validate the design and to analyze the performance gain. The simulation results show that the AFA prefetcher is effective and achieves considerable IPC (Instructions Per Cycle) improvement for 21 representative SPEC-CPU benchmarks.

Introduction

With the rapid advance of semiconductor process technology and the evolvement of micro-architecture, the processor cycle times have been significantly reduced in the past decades. However, compared to the processor performance improvement, especially the aggregated processor performance of multicore/manycore architectures, data-access performance (latency and bandwidth) improvement has been at snail’s pace. The memory speed has only increased by roughly 9% each year over the past two decades, which is significantly lower than the improvement speed of nearly 50% per year for processor performance [18]. This performance disparity between processor and memory is predicted to continually expand in next decades [18]. The unbalanced performance improvement leads to one of the significant performance bottlenecks in high-performance computing known as memory-wall problem [24], [39]. The reason behind this huge processor-memory performance disparity is several folds. First, most advanced architectural and organizational efforts are focused on processor technology, instead of memory storage devices. Second, drastically improving semiconductor technology results in much smaller and more transistors to be built on chip for processing units and thus can achieve a high computational capability. Third, the primary technology improvement for memory device focuses on higher density, which results in much larger memory capacity, but the bandwidth and the latency are improved slowly. Multi-level cache hierarchy architectures have been the primary solution to avoiding large performance loss due to long memory-access delays. However, cache memories are designed based on data access locality. When applications lack data access locality due to non-contiguous data accesses, multi-level cache memory hierarchy does not work well.

Data prefetching is a common technique to accelerate data access and reduce the processor stall time on data accesses, and has been widely recognized as a critical companion technique of memory hierarchy solution to overcome the data access bottleneck issue [24], [39], [18]. Several series of commercial high-performance processors have adopted data prefetching techniques to hide long data-access latency [13], [18], [23]. As the term indicates, the essential idea of data prefetching is to observe data referencing patterns, then speculate future references and fetch the predicted reference data closer to processor before processor demands them. By overlapping computation and data accesses, data prefetching can overcome the limitations of cache memories, reduce long memory access latency, and speed up data-access performance. Numerous studies have been conducted and many strategies have been proposed for data prefetching [5], [6], [10], [11], [13], [21], [23], [25], [31], [35], [37].

Data prefetcher, as a data-access accelerator, has been adopted in production and found to be very effective for applications with regular data-access patterns, such as data streaming [23], [13]. But in general, current prefetching techniques are very limited. Software solutions are too slow for cache level prefetching; whereas hardware prefetchers are static in nature and cannot change with the data-access patterns of the applications. Previous studies show that there is no enough effort to support hardware dynamic adaptation among different strategies depending on the runtime application characteristics [4]. The fact is that the effectiveness of data prefetching is application and data-access pattern dependent. There is no single universal prefetching algorithm suitable for all applications at runtime. A general and effective data prefetcher and accelerator must be dynamic in nature.

In this research, we propose an Algorithm-level Feedback-controlled Adaptive data prefetcher (AFA prefetcher in short) to provide a dynamic and adaptive prefetching methodology based on the recently proposed generic prefetching structure, Data-Access History Cache (DAHC) [6], and runtime feedback collected by hardware counters. DAHC is a new cache structure designed for data prefetching and data-access acceleration. It is capable of effectively tracking data-access history and maintaining correlations of both data-access address stream and instruction address stream. It can be used for efficient implementation of many data prefetching algorithms [6]. Hardware counters gain considerable attention in recent years and are becoming more and more important for improving the performance of contemporary processor architectures, operating system and applications [30], [36], [40]. This study explores this trend and assumes hardware counters are available within a data prefetcher that resides at the lowest cache level. Based on DAHC and available hardware counters, the AFA data prefetcher is able to recognize distinct data-access patterns and adapts to the corresponding appropriate prefetching algorithms at runtime. This adaptation methodology is significantly better than conventional static prefetching strategies. It improves the prefetching effectiveness, which in turn improves the overall performance. The AFA prefetcher does consume some transistors. However, the hardware chip space and the number of transistors integrated on chip are not limitations for current processor architectures. Trading chip space for lower data-access latency is a current trend [24], [39]. The AFA prefetcher and DAHC follow the trend. A preliminary version of this research was published in [7], with extended background studies, design details, simulation details, evaluation tests, and related work discussions in this paper.

The contribution of this work is several folds:

  • First, we demonstrate that different prefetching algorithms exhibit substantial variation in performance improvement for diverse applications. The performance variance is largely due to distinct access patterns that different applications exhibit.

  • Second, we argue that the rapid advance of semiconductor technology and the trend of integrating considerable amount of hardware counters on chip provide us an opportunity to explore dynamic and adaptive strategy that can produce better performance improvement for various applications on average.

  • Third, we propose an algorithm-level adaptive prefetcher, named AFA prefetcher, that is able to dynamically adapt to well-performing algorithms at runtime based on a three-tuple evaluation metric. These three tuples are orthogonal and complementary to each other. We adopt innovative mechanisms to keep the hardware cost of the proposed mechanism low.

  • Fourth, we carry out extensive simulation testing to verify the proposed design and to evaluate the performance improvement. We also vary different simulation configurations and conduct sensitivity analysis of the proposed AFA prefetcher.

  • Last, to our knowledge, this study is the first work exploring algorithm-level dynamic adaptation to accelerate data access depending on applications’ access pattern. Such an approach is promising and has a great potential of accelerating data accesses for high-performance processors such as multicore processors. We hope this study can bring dynamic data-access acceleration into community’s attention and inspire more research efforts on accelerating applications’ data-access performance.

The rest of this paper is organized as follows. Section 2 briefly reviews the DAHC structure. Section 3 presents the proposed AFA prefetcher design and discusses implementation related issues. Section 4 presents the simulation environment and simulation results. Section 5 discusses related work, and finally Section 6 concludes this study.

Section snippets

Data-Access History Cache

To fully exploit the benefits of data prefetching and to focus on accelerating data-access performance to achieve a high sustained performance instead of building extensive compute units to achieve a high peak performance, we proposed a generic prefetching-dedicated cache structure, named Data-Access History Cache (DAHC) [6]. The DAHC serves as a fundamental structure dedicated to data prefetching and data-access acceleration. The DAHC behaves as a cache for recent reference information instead

Algorithm-level Feedback-controlled Adaptive data prefetcher

In this section, we present the design of an Algorithm-level Feedback-controlled Adaptive data prefetcher. This proposed AFA prefetcher leverages the powerful functionality provided by DAHC, supports multiple prefetching algorithms and dynamically adapts to those algorithms that perform well at runtime. The essential idea is using runtime feedback and evaluation to direct the dynamic adaptation. We first present the motivation of this work, and then discuss the evaluation metrics, the

Simulation and performance analysis

We have carried out simulation experiments to study the feasibility of the proposed AFA prefetcher and analyze the potential performance impact. Stream prefetching [11], [23], stride prefetching [5], [13], Markov prefetching [21] and MLDT prefetching [35] algorithms were selected for simulation. The adaptive strategy of the AFA prefetcher is independent of the underlying prefetching algorithms. In theory, any prefetching algorithm can be supported by the AFA prefetcher. The stream, stride,

Related work

Data prefetching, as the name indicates, is a technique to fetch data before requested. A similar technique is instruction prefetching, which tries to speculate the future instructions and fetch them from memory in advance [15]. Data prefetching is usually classified as software prefetching and hardware prefetching [37]. Software prefetching is a technique to instrument prefetch instructions to the source code either by a programmer or by a complier during optimization phase. Hardware

Conclusion

Advances in processor architectures such as multicore architectures have put more pressure than ever on reducing data-access latency for high-performance computing systems. Data-access latency has been recognized by many as the leading factor preventing high-sustained performance of applications. Data prefetching is an effective solution to accelerating data-access performance and to mitigating the fast growing processor-memory performance gap. Many hardware prefetching techniques have been

Acknowledgements

This research is sponsored in part by the National Science Foundation under NSF Grant CCF-0621435 and CCF-0937877, and by ACM/IEEE High-Performance Computing Ph.D. Fellowship and Illinois Institute of Technology Fieldhouse Research Fellowship.

References (42)

  • B. Bloom

    Space/time trade-offs in hash coding with allowable errors

    Communications of the ACM

    (1970)
  • A. Bhattacharjee, M. Martonosi, Inter-core cooperative TLB for chip multiprocessors, in: ASPLOS, 2010, pp....
  • D.C. Burger, T.M. Austin, S. Bennett, Evaluating Future Microprocessors: The SimpleScalar Tool Set, University of...
  • S. Byna et al.

    Taxonomy of data prefetching for multicore processors

    Journal of Computer Science and Technology (JCST)

    (2009)
  • T.F. Chen et al.

    Effective hardware-based data prefetching for high performance processors

    IEEE Transactions on Computers

    (1995)
  • Y. Chen, S. Byna, X.-H. Sun, Data access history cache and associated data prefetching mechanisms, in: Proceedings of...
  • Y. Chen, H. Zhu, X.-H. Sun, An adaptive data prefetcher for high-performance processors, in: Proceedings of the 10th...
  • Y. Chen, H. Zhu, H. Jin, X.-H. Sun, Improving the effectiveness of context-based prefetching with multi-order analysis,...
  • Y. Chou, Low-cost epoch-based correlation prefetching for commercial applications, in: MICRO,...
  • F. Dahlgren, M. Dubois, P. Stenstrom, Fixed and adaptive sequential prefetching in shared-memory multiprocessors, in:...
  • F. Dahlgren et al.

    Sequential hardware prefetching in shared-memory multiprocessors

    IEEE Transactions on Parallel and Distributed Systems

    (1995)
  • P. Diaz, M. Cintra, Stream chaining: exploiting multiple levels of correlation in data prefetching, in: Proceedings of...
  • J. Doweck, Inside Intel Core Micro-architecture and Smart Memory Access, Intel White Paper,...
  • E. Ebrahimi, O. Mutlu, C.J. Lee, Y.N. Patt, Coordinated control of multiple prefetchers in multi-core systems, in:...
  • A. Falcon, A. Ramirez, M. Valero, Effective instruction prefetching via fetch prestaging, in: Proceedings of the 19th...
  • I. Ganusov, M. Burtscher, Future execution: a hardware prefetching technique for chip multiprocessors, in: Proceedings...
  • B. Goeman, H. Vandierendonck, K. Bosschere, Differential FCM: increasing value prediction accuracy by improving table...
  • J. Hennessy et al.

    Computer Architecture: A Quantitative Approach

    (2006)
  • Z. Hu, M. Martonosi, S. Kaxiras, Timekeeping in the memory system: predicting and optimizing memory behavior, in: ISCA,...
  • N.P. Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch...
  • D. Joseph, D. Grunwald, Prefetching using Markov predictors, in: Proceedings of the 24th Annual Symposium on Computer,...
  • Cited by (6)

    View full text