Article

Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Author:

Rajeev BalasubramonianAuthors Info & Claims

ICS '04: Proceedings of the 18th annual international conference on Supercomputing

Pages 326 - 335

https://doi.org/10.1145/1006209.1006255

Published: 26 June 2004 Publication History

Abstract

The growing dominance of wire delays at future technology points renders a microprocessor communication-bound. Clustered microarchitectures allow most dependence chains to execute without being affected by long on-chip wire latencies. They also allow faster clock speeds and reduce design complexity, thereby emerging as a popular design choice for future microprocessors. However, a centralized data cache threatens to be the primary bottle-neck in highly clustered systems. The paper attempts to identify the most complexity-effective approach to alleviate this bottleneck. While decentralized cache organizations have been proposed, they introduce excessive logic and wiring complexity. The paper evaluates if the performance gains of a decentralized cache are worth the increase in complexity. We also introduce and evaluate the behavior of Cluster Prefetch - the forwarding of data values to a cluster through accurate address prediction. Our results show that the success of this technique depends on accurate speculation across unresolved stores. The technique applies for a wide class of processor models and most importantly, it allows high performance even while employing a simple centralized data cache. We conclude that address prediction holds more promise for future wire-delay-limited processors than decentralized cache organizations.

References

[1]

V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. In Proceedings of ISCA-27, pages 248--259, June 2000.

Digital Library

[2]

A. Aggarwal and M. Franklin. An Empirical Study of the Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors. In Proceedings of ISPASS, 2001.

[3]

A. Aggarwal and M. Franklin. Hierarchical Interconnects for On-Chip Clustering. In Proceedings of IPDPS, April 2002.

Digital Library

[4]

P. Ahuja, J. Emer, A. Klauser, and S. Mukherjee. Performance Potential of Effective Address Prediction of Load Instructions. In Proceedings of Workshop on Memory Performance Issues (in conjunction with ISCA-28), June 2001.

[5]

R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dynamically Managing the Communication-Parallelism Trade-Off in Future Clustered Processors. In Proceedings of ISCA-30, pages 275--286, June 2003.

Digital Library

[6]

A. Baniasadi and A. Moshovos. Instruction Distribution Heuristics for Quad-Cluster, Dynamically-Scheduled, Superscalar Processors. In Proceedings of MICRO-33, pages 337--347, December 2000.

Digital Library

[7]

R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Maps: A Compiler-Managed Memory System for Raw Machines. In Proceedings of ISCA-26, May 1999.

Digital Library

[8]

M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser. Correlated Load-Address Predictors. In Proceedings of ISCA-26, pages 54--63, May 1999.

Digital Library

[9]

B. Black, B. Mueller, S. Postal, R. Rakvie, N. Utamaphethai, and J. Shen. Load Execution Latency Reduction. In Proceedings of the 12th ICS, June 1998.

Digital Library

[10]

D. Burger and T. Austin. The Simplescalar Toolset, Version 2.0. Technical Report TR-97-1342, University of Wisconsin-Madison, June 1997.

[11]

R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Cluster Assignment Mechanisms. In Proceedings of HPCA-6, pages 132--142, January 2000.

[12]

T. Chen and J. Baer. Effective Hardware Based Data Prefetching for High Performance Processors. IEEE Transactions on Computers, 44(5):609--623, May 1995.

Digital Library

[13]

G. Chrysos and J. Emer. Memory Dependence Prediction Using Store Sets. In Proceedings of ISCA-25, June 1998.

Digital Library

[14]

K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The Multicluster Architecture: Reducing Cycle Time through Partitioning. In Proceedings of MICRO-30, pages 149--159, December 1997.

Digital Library

[15]

E. Gibert, J. Sanchez, and A. Gonzalez. Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor. In Proceedings of MICRO-35, pages 123--133, November 2002.

Digital Library

[16]

E. Gibert, J. Sanchez, and A. Gonzalez. Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors. In Proceedings of MICRO-36, December 2003.

Digital Library

[17]

J. Gonzalez and A. Gonzalez. Speculative Execution via Address Prediction and Data Prefetching. In Proceedings of the 11th ICS, pages 196--203, July 1997.

Digital Library

[18]

G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Q1, 2001.

[19]

U. Kapasi, W. Dally, S. Rixner, J. Owens, and B. Khailany. The Imagine Stream Processor. In Proceedings of ICCD, September 2002.

Digital Library

[20]

S. Keckler and W. Dally. Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism. In Proceedings of ISCA-19, pages 202--213, May 1992.

Digital Library

[21]

R. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24--36, March/April 1999.

Digital Library

[22]

R. Kessler, E. McLellan, and D. Webb. The Alpha 21264 Microprocessor Architecture. In Proceedings of ICCD, 1998.

Digital Library

[23]

M. Lipasti, C. Wilkerson, and J. Shen. Value Locality and Load Value Prediction. In Proceedings of ASPLOS-VIII, pages 138--147, October 1996.

Digital Library

[24]

A. Moshovos, S. Breach, T. Vijaykumar, and G. Sohi. Dynamic Speculation and Synchronization of Data Dependences. In Proceedings of ISCA-24, May 1997.

Digital Library

[25]

R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler. A Design Space Evaluation of Grid Processor Architectures. In Proceedings of MICRO-34, pages 40--51, December 2001.

Digital Library

[26]

K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The Case for a Single-Chip Multiprocessor. In Proceedings of ASPLOS-VII, October 1996.

Digital Library

[27]

S. Palacharla, N. Jouppi, and J. Smith. Complexity-Effective Superscalar Processors. In Proceedings of ISCA-24, pages 206--218, June 1997.

Digital Library

[28]

J.-M. Parcerisa and A. Gonzalez. Reducing Wire Delay Penalty through Value Prediction. In Proceedings of MICRO-33, pages 317--326, December 2000.

Digital Library

[29]

J.-M. Parcerisa, J. Sahuquillo, A. Gonzalez, and J. Duato. Efficient Interconnects for Clustered Microarchitectures. In Proceedings of PACT, September 2002.

Digital Library

[30]

P. Racunas and Y. Patt. Partitioned First-Level Cache Design for Clustered Microarchitectures. In Proceedings of ICS-17, June 2003.

Digital Library

[31]

N. Ranganathan and M. Franklin. An Empirical Study of Decentralized ILP Execution Models. In Proceedings of ASPLOS-VIII, pages 272--281, October 1998.

Digital Library

[32]

G. Reinman and B. Calder. Predictive Techniques for Aggressive Load Speculation. In Proceedings of MICRO-31, December 1998.

Digital Library

[33]

J. Sanchez and A. Gonzalez. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. In Proceedings of MICRO-33, pages 124--133, December 2000.

Digital Library

[34]

Y. Sazeides, S. Vassiliadis, and J. Smith. The Performance Potential of Data Dependence Speculation and Collapsing. In Proceedings of MICRO-29, pages 238--247, Dec 1996.

Digital Library

[35]

P. Shivakumar and N. P. Jouppi. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report TN-2001/2, Compaq Western Research Laboratory, August 2001.

[36]

J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 System Microarchitecture. Technical report, Technical White Paper, IBM, October 2001.

[37]

E. Tune, D. Liang, D. Tullsen, and B. Calder. Dynamic Prediction of Critical Path Instructions. In Proceedings of HPCA-7, pages 185--196, January 2001.

Digital Library

[38]

V. Zyuban and P. Kogge. Inherently Lower-Power High-Performance Superscalar Architectures. IEEE Transactions on Computers, March 2001.

Digital Library

Cited By

Bieschewski SParcerisa JGonzález A(2016)An Energy-Efficient Memory Unit for Clustered MicroarchitecturesIEEE Transactions on Computers10.1109/TC.2015.249351865:8(2631-2637)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.1109/TC.2015.2493518
LaDuca RSharkey JPonomarev D(2008)Hiding Communication Delays in Clustered MicroarchitecturesProceedings of the 2008 20th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2008.30(107-114)Online publication date: 29-Oct-2008
https://dl.acm.org/doi/10.1109/SBAC-PAD.2008.30
Muralimanohar NRamani KBalasubramonian R(2006)Power efficient resource scaling in partitioned architectures through dynamic heterogeneity2006 IEEE International Symposium on Performance Analysis of Systems and Software10.1109/ISPASS.2006.1620794(100-111)Online publication date: 2006
https://doi.org/10.1109/ISPASS.2006.1620794
Show More Cited By

Index Terms

Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
1. Computer systems organization
  1. Architectures
    1. Serial architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Using Incorrect Speculation to Prefetch Data in a Concurrent Multithreaded Processor
IPDPS '03: Proceedings of the 17th International Symposium on Parallel and Distributed Processing

Concurrent multithreaded architectures exploit both instruction-level and thread-level parallelism through a combination of branch prediction and thread-level control speculation. The resulting speculative issuing of load instructions in these ...
Access map pattern matching for data cache prefetch
ICS '09: Proceedings of the 23rd international conference on Supercomputing

A novel data prefetching method -- access map pattern matching (AMPM) -- that uses "memory access map" is proposed. The AMPM prefetching concentrate hardware resources on collecting the access footprint of the frequently accessed area which we called "...
Energy-efficient data prefetch buffering for low-end embedded processors

An energy-efficient architecture should jointly optimize energy consumption and throughput, as captured by the Energy-Delay-Square Product (ED2P) metric. This paper introduces a prefetch data buffer micro-architecture, which achieves that goal with the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '04: Proceedings of the 18th annual international conference on Supercomputing

June 2004

360 pages

ISBN:1581138393

DOI:10.1145/1006209

General Chair:
Paul Feautrier
LIP, ENS Lyon
,
Program Chairs:
James Goodman
University of Auckland
,
André Seznec
IRISA, INRIA

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS04

Sponsor:

ICS04: International Conference on Supercomputing 2004

June 26 - July 1, 2004

Malo, France

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
342
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bieschewski SParcerisa JGonzález A(2016)An Energy-Efficient Memory Unit for Clustered MicroarchitecturesIEEE Transactions on Computers10.1109/TC.2015.249351865:8(2631-2637)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.1109/TC.2015.2493518
LaDuca RSharkey JPonomarev D(2008)Hiding Communication Delays in Clustered MicroarchitecturesProceedings of the 2008 20th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2008.30(107-114)Online publication date: 29-Oct-2008
https://dl.acm.org/doi/10.1109/SBAC-PAD.2008.30
Muralimanohar NRamani KBalasubramonian R(2006)Power efficient resource scaling in partitioned architectures through dynamic heterogeneity2006 IEEE International Symposium on Performance Analysis of Systems and Software10.1109/ISPASS.2006.1620794(100-111)Online publication date: 2006
https://doi.org/10.1109/ISPASS.2006.1620794
Balasubramonian RMuralimanohar NRamani KVenkatachalapathy V(2005)Microarchitectural Wire Management for Performance and Power in Partitioned ArchitecturesProceedings of the 11th International Symposium on High-Performance Computer Architecture10.1109/HPCA.2005.21(28-39)Online publication date: 12-Feb-2005
https://dl.acm.org/doi/10.1109/HPCA.2005.21

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten