Abstract
The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks.
Similar content being viewed by others
References
A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In The VLDB Journal, 266–277, 1999.
Alpha Architecture Handbook, Digital Equipment Corporation, Maynard, MA, 1994.
D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report TR 1342. University of Wisconsin, Madison, WI, June 1997.
S. P. E. Corporation. The SPEC benchmark suites. http://www.spec.org/.
A. M. Grizzaffi, M. Colette, M. Donnelly, and B. R. Olszewski. Contrasting characteristics and cache performance of technical and multi-user commercial workloads. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 145–155, October 1994.
J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach, 2nd edn. Morgan Kaufmann, Palo Alto, CA, 1996.
D. S. Henry, B. C. Kuszmaul, G. H. Loh, and R. Sami. Circuits for wide-window superscalar processors. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA), pp. 236–247, Vancouver, British Columbia, Canada, June 2000.
W. C. Hsu and J. E. Smith. A performance study of instruction cache prefetching methods. IEEE Transactions on Computers, 47(5):497–508, May 1998.
Intel IA-64 Architecture Software Developer's Manual, Volumes I–IV. Intel Corporation, January 2000. Also available at http://developer.intel.com
Intel(R) Itanium(TM) Processor Hardware Developer's Manual. Intel Corporation, August 2001.
D. Joseph and D. Grunwald. Prefetching using Markov predictors. IEEE Transactions on Computers, 48(2):121–133, 1999.
G. Lauterbach and T. Horel. UltraSPARC-III: designing third generation 64-bit performance. IEEE Micro, 19(3):56–66, 1999.
C. K. Luk and T. C. Mowry. Cooperative instruction prefetching in modern processors. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, pp. 182–194, November 30–December 2, 1998.
Y. Patt, S. J. Patel, M. Evers, D. H. Friendly, and J. Stark. One billion transistors, one uniprocessor, one chip. IEEE Computer, 30(9):51–58, September 1997.
J. Pierce and T. N. Mudge. Wrong-path instruction prefetching. In International Symposium on Microarchitecture, 165–175, 1996.
IBM Regains Performance Lead with Power2. Microprocessor Report, October 1993.
PowerPC 740/PowerPC 750 RISC Microprocessor User's Manual. IBM Corporation, 1999.
G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In Proceedings of the 32nd Annual ACM/IEEE international symposium on microarchitecture on MICRO-32, pp. 16–27, Haifa Israel, November 1999.
J. A. Rivers, G. S. Tyson, E. S. Davidson, and T. M. Austin. On high-bandwidth data cache design for multi-issue processors. In Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pp. 46–56, December 1–3, 1997.
International Technology Roadmap for Semiconductors, 1998. Update. Semiconductor Industry Association, p. 4, 1998.
K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark. Improving prediction for procedure returns with return-address-stack repair mechanisms. In International Symposium on Microarchitecture, pp. 259–271, 1998.
A. J. Smith. Cache memories. ACM Computing Surveys, 14(3):473–530, September 1982.
V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. Branch history guided instruction prefetching. In Proceedings of the 7th International Conference on High Performance Computer Architecture (HPCA), pp. 291–300, Monterrey, Mexico, January 2001.
J. Tse and A. J. Smith. CPU cache prefetching: timing evaluation of hardware implementations. IEEE Transactions on Computers, 47(5):509–526, May 1998.
K. Yeager, A. Ani, A. Bomdica, G. Shippen, H. Sucar, H. Su, J. Chuang, N. Vasseghi, R. Ramchandani, R. Martin, R. Conrad, Y. Chen, W. Voegtli Jr., M. Seddighnezhad, and Y. Van Atta. MIPS R10000 Superscalar Microprocessor. Hot Chips VII, 1995.
Rights and permissions
About this article
Cite this article
Zhang, Y., Haga, S. & Barua, R. Execution History Guided Instruction Prefetching. The Journal of Supercomputing 27, 129–147 (2004). https://doi.org/10.1023/B:SUPE.0000009319.31230.a9
Issue Date:
DOI: https://doi.org/10.1023/B:SUPE.0000009319.31230.a9