Skip to main content
Log in

Dynamic Memory Instruction Bypassing

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Reducing the latency of load instructions is among the most crucial aspects to achieve high performance for current and future microarchitectures. Deep pipelining impacts load-to-use latency even for loads that hit in cache. In this paper we present a dynamic mechanism which detects relations between address producing instructions and the loads that consume these addresses and uses this information to access data before the load is even fetched from the I-Cache. This mechanism is not intended to prefetch from outside the chip but to move data from L1 and L2 silently and ahead of time into the register file, allowing the bypassing of the load instruction (hence the name). An average performance improvement of 22.24% is achieved in the SPECint95 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. W. A. Wulf and S. McKee, Hitting the memory wall: Implications of the obvious, Computer Architecture News, Vol. 23(ACM Press, pp. 20-24 (1995).

  2. D. Ortega, M. Valero, and E. AyguadÉ, A novel renaming mechanism that boosts software prefetching, Proceedings of the 15th Annual International Conference on Supercomputing, ACM Press, pp. 501-510 (2001).

  3. D. Ortega, J.-L. Baer, E. Ayguad'e, and M. Valero,Cost-effective compiler directed memory prefetching and bypassing, Proceedings of the International Conference on Parallel Architectures and Compilation Techniques,IEEE Computer Society Press, pp.189-198 (2002).

  4. A. González,J. González, and M. Valero, Virtual-Physical Registers,Proceedings of the Annual International Symposium on High-Performance Computer Architecture,(February 1998).

  5. M. Hrishikesh, N. Jouppi, K. Farkas, D. Burger, S. Keckler, and P. Shivakumar, The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays, Proceedings of the 29th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 14-24 (2002).

  6. A. Hartstein and T. Puzak, The optimum pipeline depth for a microprocessor, Proceedings of the 29th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 7-13 (2002).

  7. E. Sprangle and D. Carmean, Increasing processor performance by implementing deeper pipelines, Proceedings of the 29th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 25-34 (2002).

  8. M. Moudgill, K. Pingali, and S. Vassiliadis, Register renaming and dynamic speculation: An alternative approach, Proceedings of the 26th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 202-213 (1993).

  9. R. Balasubramonian, S. Dwarkadas, and D. Albonesi, Dynamically allocating processor resources between nearby and distant ILP, Proceedings of the 28th Annual International Symposium on Computer Architecture, ACM Press, pp. 26-37 (2001).

  10. G. Tyson and T. Austin, Improving the accuracy and performance of memory communication through renaming, Proceedings of the 30th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 218-227 (1997).

  11. A. Moshovos and G. Sohi, Streamlining inter-operation memory communication via data dependence prediction, Proceedings of the 30th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 235-245 (1997).

  12. S. Jourdan, R. Ronen, M. Bekerman, B. Shomar, and A. Yoaz, A novel renaming scheme to exploit value temporal locality through physical register reuse and unification, Proceedings of the 31st Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 216-225 (1998).

  13. G. Reinman,B. Calder, D. Tullsen, G. Tyson, and T. Austin, Classifying load and store instructions for memory renaming, Proceedings of the 13th Annual International Conference on Supercomputing, ACM Press, pp. 399-407 (1999).

  14. A. Moshovos and G. Sohi, Read-after-read memory dependence prediction, Proceedings of the 32nd Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 177-185 (1999).

  15. T. Austin and G. Sohi, Zero-cycle loads: Microarchitecture support for reducing load latency, Proceedings of the 28th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 82-92 (1995).

  16. B. Black, B. Mueller, S. Postal, R. Rakvic, N. Utamaphethai, and J. Shen, Load execution latency reduction, Proceedings of the 12th Annual International Conference on Supercomputing, ACM Press, pp. 29-36 (1998).

  17. A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, Speculation techniques for improving load related instruction scheduling, Proceedings of the 26th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 42-53 (1999).

  18. B.-K. Chung, J. Zhang, J.-K. Peir, S.-C. Lai, and K. Lai, Direct load: Dependence-linked dataflow resolution of load address and cache coordinate, Proceedings of the 34th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp.76-87 (2001).

  19. M. Bekerman,A. Yoaz, F. Gabbay, S. Jourdan,M. Kalaev, and R. Ronen, Early load address resolution via register tracking, Proceedings of the 27th Annual International Symposium on Computer Architecture, ACM Press, pp.306-315 (2000).

  20. A. Roth, A. Moshovos, and G. Sohi, Dependence based prefetching for linked data structures, Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM Press, pp. 115-126 (1998).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ortega, D., Valero, M. & Ayguadé, E. Dynamic Memory Instruction Bypassing. International Journal of Parallel Programming 32, 199–224 (2004). https://doi.org/10.1023/B:IJPP.0000029273.49634.19

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:IJPP.0000029273.49634.19

Navigation