skip to main content
10.1145/1006209.1006255acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Published: 26 June 2004 Publication History

Abstract

The growing dominance of wire delays at future technology points renders a microprocessor communication-bound. Clustered microarchitectures allow most dependence chains to execute without being affected by long on-chip wire latencies. They also allow faster clock speeds and reduce design complexity, thereby emerging as a popular design choice for future microprocessors. However, a centralized data cache threatens to be the primary bottle-neck in highly clustered systems. The paper attempts to identify the most complexity-effective approach to alleviate this bottleneck. While decentralized cache organizations have been proposed, they introduce excessive logic and wiring complexity. The paper evaluates if the performance gains of a decentralized cache are worth the increase in complexity. We also introduce and evaluate the behavior of Cluster Prefetch - the forwarding of data values to a cluster through accurate address prediction. Our results show that the success of this technique depends on accurate speculation across unresolved stores. The technique applies for a wide class of processor models and most importantly, it allows high performance even while employing a simple centralized data cache. We conclude that address prediction holds more promise for future wire-delay-limited processors than decentralized cache organizations.

References

[1]
V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. In Proceedings of ISCA-27, pages 248--259, June 2000.
[2]
A. Aggarwal and M. Franklin. An Empirical Study of the Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors. In Proceedings of ISPASS, 2001.
[3]
A. Aggarwal and M. Franklin. Hierarchical Interconnects for On-Chip Clustering. In Proceedings of IPDPS, April 2002.
[4]
P. Ahuja, J. Emer, A. Klauser, and S. Mukherjee. Performance Potential of Effective Address Prediction of Load Instructions. In Proceedings of Workshop on Memory Performance Issues (in conjunction with ISCA-28), June 2001.
[5]
R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dynamically Managing the Communication-Parallelism Trade-Off in Future Clustered Processors. In Proceedings of ISCA-30, pages 275--286, June 2003.
[6]
A. Baniasadi and A. Moshovos. Instruction Distribution Heuristics for Quad-Cluster, Dynamically-Scheduled, Superscalar Processors. In Proceedings of MICRO-33, pages 337--347, December 2000.
[7]
R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Maps: A Compiler-Managed Memory System for Raw Machines. In Proceedings of ISCA-26, May 1999.
[8]
M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser. Correlated Load-Address Predictors. In Proceedings of ISCA-26, pages 54--63, May 1999.
[9]
B. Black, B. Mueller, S. Postal, R. Rakvie, N. Utamaphethai, and J. Shen. Load Execution Latency Reduction. In Proceedings of the 12th ICS, June 1998.
[10]
D. Burger and T. Austin. The Simplescalar Toolset, Version 2.0. Technical Report TR-97-1342, University of Wisconsin-Madison, June 1997.
[11]
R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Cluster Assignment Mechanisms. In Proceedings of HPCA-6, pages 132--142, January 2000.
[12]
T. Chen and J. Baer. Effective Hardware Based Data Prefetching for High Performance Processors. IEEE Transactions on Computers, 44(5):609--623, May 1995.
[13]
G. Chrysos and J. Emer. Memory Dependence Prediction Using Store Sets. In Proceedings of ISCA-25, June 1998.
[14]
K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The Multicluster Architecture: Reducing Cycle Time through Partitioning. In Proceedings of MICRO-30, pages 149--159, December 1997.
[15]
E. Gibert, J. Sanchez, and A. Gonzalez. Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor. In Proceedings of MICRO-35, pages 123--133, November 2002.
[16]
E. Gibert, J. Sanchez, and A. Gonzalez. Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors. In Proceedings of MICRO-36, December 2003.
[17]
J. Gonzalez and A. Gonzalez. Speculative Execution via Address Prediction and Data Prefetching. In Proceedings of the 11th ICS, pages 196--203, July 1997.
[18]
G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Q1, 2001.
[19]
U. Kapasi, W. Dally, S. Rixner, J. Owens, and B. Khailany. The Imagine Stream Processor. In Proceedings of ICCD, September 2002.
[20]
S. Keckler and W. Dally. Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism. In Proceedings of ISCA-19, pages 202--213, May 1992.
[21]
R. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24--36, March/April 1999.
[22]
R. Kessler, E. McLellan, and D. Webb. The Alpha 21264 Microprocessor Architecture. In Proceedings of ICCD, 1998.
[23]
M. Lipasti, C. Wilkerson, and J. Shen. Value Locality and Load Value Prediction. In Proceedings of ASPLOS-VIII, pages 138--147, October 1996.
[24]
A. Moshovos, S. Breach, T. Vijaykumar, and G. Sohi. Dynamic Speculation and Synchronization of Data Dependences. In Proceedings of ISCA-24, May 1997.
[25]
R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler. A Design Space Evaluation of Grid Processor Architectures. In Proceedings of MICRO-34, pages 40--51, December 2001.
[26]
K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K.-Y. Chang. The Case for a Single-Chip Multiprocessor. In Proceedings of ASPLOS-VII, October 1996.
[27]
S. Palacharla, N. Jouppi, and J. Smith. Complexity-Effective Superscalar Processors. In Proceedings of ISCA-24, pages 206--218, June 1997.
[28]
J.-M. Parcerisa and A. Gonzalez. Reducing Wire Delay Penalty through Value Prediction. In Proceedings of MICRO-33, pages 317--326, December 2000.
[29]
J.-M. Parcerisa, J. Sahuquillo, A. Gonzalez, and J. Duato. Efficient Interconnects for Clustered Microarchitectures. In Proceedings of PACT, September 2002.
[30]
P. Racunas and Y. Patt. Partitioned First-Level Cache Design for Clustered Microarchitectures. In Proceedings of ICS-17, June 2003.
[31]
N. Ranganathan and M. Franklin. An Empirical Study of Decentralized ILP Execution Models. In Proceedings of ASPLOS-VIII, pages 272--281, October 1998.
[32]
G. Reinman and B. Calder. Predictive Techniques for Aggressive Load Speculation. In Proceedings of MICRO-31, December 1998.
[33]
J. Sanchez and A. Gonzalez. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. In Proceedings of MICRO-33, pages 124--133, December 2000.
[34]
Y. Sazeides, S. Vassiliadis, and J. Smith. The Performance Potential of Data Dependence Speculation and Collapsing. In Proceedings of MICRO-29, pages 238--247, Dec 1996.
[35]
P. Shivakumar and N. P. Jouppi. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report TN-2001/2, Compaq Western Research Laboratory, August 2001.
[36]
J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 System Microarchitecture. Technical report, Technical White Paper, IBM, October 2001.
[37]
E. Tune, D. Liang, D. Tullsen, and B. Calder. Dynamic Prediction of Critical Path Instructions. In Proceedings of HPCA-7, pages 185--196, January 2001.
[38]
V. Zyuban and P. Kogge. Inherently Lower-Power High-Performance Superscalar Architectures. IEEE Transactions on Computers, March 2001.

Cited By

View all
  • (2016)An Energy-Efficient Memory Unit for Clustered MicroarchitecturesIEEE Transactions on Computers10.1109/TC.2015.249351865:8(2631-2637)Online publication date: 1-Aug-2016
  • (2008)Hiding Communication Delays in Clustered MicroarchitecturesProceedings of the 2008 20th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2008.30(107-114)Online publication date: 29-Oct-2008
  • (2006)Power efficient resource scaling in partitioned architectures through dynamic heterogeneity2006 IEEE International Symposium on Performance Analysis of Systems and Software10.1109/ISPASS.2006.1620794(100-111)Online publication date: 2006
  • Show More Cited By

Index Terms

  1. Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '04: Proceedings of the 18th annual international conference on Supercomputing
      June 2004
      360 pages
      ISBN:1581138393
      DOI:10.1145/1006209
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 June 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clustered microarchitectures
      2. communication-bound processors
      3. data prefetch
      4. distributed caches
      5. effective address and memory dependence prediction
      6. processor

      Qualifiers

      • Article

      Conference

      ICS04
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2016)An Energy-Efficient Memory Unit for Clustered MicroarchitecturesIEEE Transactions on Computers10.1109/TC.2015.249351865:8(2631-2637)Online publication date: 1-Aug-2016
      • (2008)Hiding Communication Delays in Clustered MicroarchitecturesProceedings of the 2008 20th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2008.30(107-114)Online publication date: 29-Oct-2008
      • (2006)Power efficient resource scaling in partitioned architectures through dynamic heterogeneity2006 IEEE International Symposium on Performance Analysis of Systems and Software10.1109/ISPASS.2006.1620794(100-111)Online publication date: 2006
      • (2005)Microarchitectural Wire Management for Performance and Power in Partitioned ArchitecturesProceedings of the 11th International Symposium on High-Performance Computer Architecture10.1109/HPCA.2005.21(28-39)Online publication date: 12-Feb-2005

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media