Dynamic Memory Instruction Bypassing

Ortega, Daniel; Valero, Mateo; Ayguadé, Eduard

doi:10.1023/B:IJPP.0000029273.49634.19

Daniel Ortega¹,
Mateo Valero² &
Eduard Ayguadé²

53 Accesses
Explore all metrics

Abstract

Reducing the latency of load instructions is among the most crucial aspects to achieve high performance for current and future microarchitectures. Deep pipelining impacts load-to-use latency even for loads that hit in cache. In this paper we present a dynamic mechanism which detects relations between address producing instructions and the loads that consume these addresses and uses this information to access data before the load is even fetched from the I-Cache. This mechanism is not intended to prefetch from outside the chip but to move data from L1 and L2 silently and ahead of time into the register file, allowing the bypassing of the load instruction (hence the name). An average performance improvement of 22.24% is achieved in the SPECint95 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

W. A. Wulf and S. McKee, Hitting the memory wall: Implications of the obvious, Computer Architecture News, Vol. 23(ACM Press, pp. 20-24 (1995).
D. Ortega, M. Valero, and E. AyguadÉ, A novel renaming mechanism that boosts software prefetching, Proceedings of the 15th Annual International Conference on Supercomputing, ACM Press, pp. 501-510 (2001).
D. Ortega, J.-L. Baer, E. Ayguad'e, and M. Valero,Cost-effective compiler directed memory prefetching and bypassing, Proceedings of the International Conference on Parallel Architectures and Compilation Techniques,IEEE Computer Society Press, pp.189-198 (2002).
A. González,J. González, and M. Valero, Virtual-Physical Registers,Proceedings of the Annual International Symposium on High-Performance Computer Architecture,(February 1998).
M. Hrishikesh, N. Jouppi, K. Farkas, D. Burger, S. Keckler, and P. Shivakumar, The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays, Proceedings of the 29th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 14-24 (2002).
A. Hartstein and T. Puzak, The optimum pipeline depth for a microprocessor, Proceedings of the 29th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 7-13 (2002).
E. Sprangle and D. Carmean, Increasing processor performance by implementing deeper pipelines, Proceedings of the 29th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 25-34 (2002).
M. Moudgill, K. Pingali, and S. Vassiliadis, Register renaming and dynamic speculation: An alternative approach, Proceedings of the 26th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 202-213 (1993).
R. Balasubramonian, S. Dwarkadas, and D. Albonesi, Dynamically allocating processor resources between nearby and distant ILP, Proceedings of the 28th Annual International Symposium on Computer Architecture, ACM Press, pp. 26-37 (2001).
G. Tyson and T. Austin, Improving the accuracy and performance of memory communication through renaming, Proceedings of the 30th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 218-227 (1997).
A. Moshovos and G. Sohi, Streamlining inter-operation memory communication via data dependence prediction, Proceedings of the 30th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 235-245 (1997).
S. Jourdan, R. Ronen, M. Bekerman, B. Shomar, and A. Yoaz, A novel renaming scheme to exploit value temporal locality through physical register reuse and unification, Proceedings of the 31st Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 216-225 (1998).
G. Reinman,B. Calder, D. Tullsen, G. Tyson, and T. Austin, Classifying load and store instructions for memory renaming, Proceedings of the 13th Annual International Conference on Supercomputing, ACM Press, pp. 399-407 (1999).
A. Moshovos and G. Sohi, Read-after-read memory dependence prediction, Proceedings of the 32nd Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 177-185 (1999).
T. Austin and G. Sohi, Zero-cycle loads: Microarchitecture support for reducing load latency, Proceedings of the 28th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp. 82-92 (1995).
B. Black, B. Mueller, S. Postal, R. Rakvic, N. Utamaphethai, and J. Shen, Load execution latency reduction, Proceedings of the 12th Annual International Conference on Supercomputing, ACM Press, pp. 29-36 (1998).
A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, Speculation techniques for improving load related instruction scheduling, Proceedings of the 26th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pp. 42-53 (1999).
B.-K. Chung, J. Zhang, J.-K. Peir, S.-C. Lai, and K. Lai, Direct load: Dependence-linked dataflow resolution of load address and cache coordinate, Proceedings of the 34th Annual International Symposium on Microarchitecture, IEEE Computer Society Press, pp.76-87 (2001).
M. Bekerman,A. Yoaz, F. Gabbay, S. Jourdan,M. Kalaev, and R. Ronen, Early load address resolution via register tracking, Proceedings of the 27th Annual International Symposium on Computer Architecture, ACM Press, pp.306-315 (2000).
A. Roth, A. Moshovos, and G. Sohi, Dependence based prefetching for linked data structures, Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM Press, pp. 115-126 (1998).

Download references

Author information

Authors and Affiliations

Barcelona Research Office, Hewlett Packard Laboratories, Barcelona, Spain
Daniel Ortega
Depto. de Arquitectura de Computadores, Universidad Politécnica de Cataluña, Barcelona, Spain
Mateo Valero & Eduard Ayguadé

Authors

Daniel Ortega
View author publications
You can also search for this author in PubMed Google Scholar
Mateo Valero
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Ayguadé
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ortega, D., Valero, M. & Ayguadé, E. Dynamic Memory Instruction Bypassing. International Journal of Parallel Programming 32, 199–224 (2004). https://doi.org/10.1023/B:IJPP.0000029273.49634.19

Download citation

Issue Date: June 2004
DOI: https://doi.org/10.1023/B:IJPP.0000029273.49634.19

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Memory Instruction Bypassing

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

A Modern Primer on Processing in Memory

REFERENCES

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Dynamic Memory Instruction Bypassing

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

In-memory database acceleration on FPGAs: a survey

A Modern Primer on Processing in Memory

REFERENCES

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation