research-article

The case for simple, visible cache coherency

Authors:

Mark HorowitzAuthors Info & Claims

MSPC '08: Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)

Pages 31 - 35

https://doi.org/10.1145/1353522.1353532

Published: 02 March 2008 Publication History

Abstract

The shared memory research community has proposed many complex communication protocols that aim to eliminate specific performance bottlenecks, while still providing an easy-to-use communication interface. Although tailored protocols can eliminate some bottlenecks that arise in real applications, removing the cause of the bottleneck through software optimizations and bug fixes is cheaper to implement, faster to fix (once found), and requires no additional support by the hardware beyond a simple shared memory interface. In fact, in our experience, the choice of coherence protocol is much less important than providing an efficient hardware feedback that indentifies the source of the problem. Future cache-coherence research should focus efforts on illuminating memory system behavior, providing smarter tools to identify bottlenecks, and helping to eliminate them through software optimizations.

References

[1]

V. Aslot et al. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. Workshop on OpenMP Applications and Tools, pages 1--10, July 2001.

Digital Library

[2]

L. Censier and P. Feautrier. A New Solution to Coherence Problems in Multicache Systems. In IEEE Transactions on Computers C-27, pages 1112--1118, Dec. 1978.

Digital Library

[3]

D. Chaiken, J. Kubiatowics, and A. Agarwal. LimitLESS directories: A Scalable Cache Coherence Scheme. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating System, volume 26, pages 224--234, October 1991.

Digital Library

[4]

B. Falsafi and D. Wood. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 229--240, 1997.

Digital Library

[5]

J. Gibson. Memory Profiling on Shared-Memory Multiprocessors. PhD thesis, Stanford University, 2002.

Digital Library

[6]

J. Gibson et al. FLASH vs. (Simulated) FLASH: Closing the Simulation Loop. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 49--58, Nov. 2000.

Digital Library

[7]

E. Hagersten and M. Koster. Wildfire: A Scalable Path for SMPs. In Proceedings of the 5th IEEE Symposium on High-Performance Computer Architecture, pages 172--181, 1999.

Digital Library

[8]

D. James, A. Laundrie, S. Gjessing, and G. Sohi. Distributed-Directory Scheme: Scalable Coherent Interface. In IEEE Computer, volume 23(6), pages 74--77, 1990.

Digital Library

[9]

R. Kalla, B. Sinharoy, and J. Tendler. Simultaneous Multithreading Implementations in POWER5---IBM's Next Generation POWER Microprocessor. In Hot Chips 15, August 2003.

[10]

S. Kapil. Gemini: A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. In Hot Chips 15, August 2003.

[11]

P. Kongetiraer. A 32-way Multithreaded SPARC processor. In Hot Chips 16, August 2004.

[12]

R. Kunz. Performance Bottlenecks on Large-Scale Shared-Memory Multiprocessors. PhD thesis, Stanford University, 2005.

[13]

J. Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302--313, April 1994.

Digital Library

[14]

A. Lai and B. Falsafi. Memory Sharing Predictor: The Key to Speculative Coherent DSM. In Proceedings of the 26th International Symposium on Computer Architecture, pages 172--183, May 1999.

Digital Library

[15]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, pages 241--251, June 1997.

Digital Library

[16]

D. Lenoski et al. The DASH Prototype: Implementation and Performance. In Proceedings of the 19th International Symposium on Computer Architecture, pages 92--103, 1992.

Digital Library

[17]

N. Njoroge et al. ATLAS: A Chip-Multiprocessor with Transactional Memory Support. In Design, Automation and Test in Europe Conference and Exhibition, volume 16, pages 1--6, February 2007.

Digital Library

[18]

A. Nowatzyk et al. S3.mp: A Multiprocessor in a Matchbox. In Proceedings of PASA, June 1993.

[19]

S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages 325--337, April 1994.

Digital Library

[20]

A. Saulsbury, T. Wilkinson, J. B. Carter, and A. Landin. An Argument for Simple COMA. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, pages 276--285, January 1995.

Digital Library

[21]

D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the 7th Symp. on Architectural Support for Programming Languages and Operating Systems, pages 174--185, 1996.

Digital Library

[22]

D. Sorin et al. Analytic Evaluation of Shared-Memory Systems with ILP Processors. In Proceedings of the 25th International Symposium on Computer Architecture, pages 380--391, June-July 1998.

Digital Library

[23]

V. Soundararajan et al. Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors. In Proceedings of the 25th International Symposium on Computer Architecture, pages 342--355, June-July 1998.

Digital Library

Cited By

Terechko AHoogerbrugge JAlkadi GGuntur SLahiri ADuranton MWüst CChristie PNackaerts AKumar A(2012)Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore ArchitecturesACM Transactions on Embedded Computing Systems10.1145/2180887.218089011S:1(1-32)Online publication date: 1-Jun-2012
https://dl.acm.org/doi/10.1145/2180887.2180890

Index Terms

The case for simple, visible cache coherency

Recommendations

Distance-aware L2 cache organizations for scalable multiprocessor systems
Special issue: Reconfigurable embedded systems: Synthesis, design and application

In order to provide the scalability to the multiprocessor systems, it is important to keep the remote memory access time in bounds so that it does not impose much additional overhead as the system grows.In this paper, we suggest an LRU/distance-aware ...
Evaluating the performance of four snooping cache coherency protocols
Special Issue: Proceedings of the 16th annual international symposium on Computer Architecture

Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large ...
An adaptive cache coherence protocol

This paper introduces a new adaptive cache coherence protocol which minimizes energy requirements and guarantees scalability. It includes two complementary parts: a non-inclusive sparse-directory to track only actively shared blocks and a structure to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MSPC '08: Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)

March 2008

44 pages

ISBN:9781605580494

DOI:10.1145/1353522

General Chair:
Emery Berger
University of Massachusetts, Amherst
,
Program Chair:
Brad Chen
Google

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS08

Sponsor:

SIGPLAN

ASPLOS08: Architectural Support for Programming Languages and Operating Systems

March 2, 2008

Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 6 of 20 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
297
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Terechko AHoogerbrugge JAlkadi GGuntur SLahiri ADuranton MWüst CChristie PNackaerts AKumar A(2012)Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore ArchitecturesACM Transactions on Embedded Computing Systems10.1145/2180887.218089011S:1(1-32)Online publication date: 1-Jun-2012
https://dl.acm.org/doi/10.1145/2180887.2180890

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten