Article

Proximity-aware directory-based coherence for multi-core processor architectures

Authors:

Jeffery A. Brown,

Dean TullsenAuthors Info & Claims

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Pages 126 - 134

https://doi.org/10.1145/1248377.1248398

Published: 09 June 2007 Publication History

Abstract

As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures.

Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9% for our workloads.

References

[1]

M. E. Acacio, J. Gonzalez, J. M. Garcia, and J. Duato. A novel approach to reduce l2 miss latency in shared-memory multiprocessors. In IPDPS '02: Proceedings of the 16th International Parallel and Distributed Processing Symposium, page 25, Washington, DC, USA, 2002. IEEE Computer Society.

Digital Library

[2]

AMD. http://www.amd.com/usen/processors/productinformation/0 30 118 9484%,00.html.

[3]

L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In ISCA-27, 2000.

Digital Library

[4]

J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd International Symposium on Computer Architecture, pages 264--276, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[5]

F. Dahlgren and J. Torrellas. Cache-only memory architectures. Computer, 32(6):72--79, 1999.

Digital Library

[6]

Device Group. Predictive technology model. In UC Berkeley Technical Report, 2001.

[7]

N. Eisley, L.-S. Peh, and L. Shang. In-network cache coherence. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 321{332, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[8]

A. Gupta, W.-D. Weber, and T. C. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In ICPP (1), pages 312--321, 1990.

[9]

A. Hartstein and T. R. Puzak. The optimum pipeline depth considering both power and performance. ACM Trans. Archit. Code Optim., 1(4):369--388, 2004.

Digital Library

[10]

R. Ho, K. Mai, and M. Horowitz. The future of wires. Proceedings of the IEEE, 89(4):490--504, 2001.

[11]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A nuca substrate for exible cmp cache sharing. In Proceedings of the 19th ACM International Conference on Supercomputing (ICS 05), June 2005.

Digital Library

[12]

IBM. Power5: Presentation at microprocessor forum. 2003.

[13]

Intel. http://www.intel.com/products/processor/coreduo/.

[14]

P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded sparc processor. In IEEE MICRO Magazine, Mar. 2005.

Digital Library

[15]

R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of International Symposium on Computer Architecture, 2005.

Digital Library

[16]

R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of International Symposium on Computer Architecture, June 2005.

Digital Library

[17]

J. Laudon and D. Lenoski. The SGI Origin: a ccNUMA highly scalable server. In ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture, pages 241--251, New York, NY, USA, 1997. ACM Press.

Digital Library

[18]

D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Henessy, M. Horowitz, and M. Lam. The stanford DASH multiprocessor. In IEEE Computer, 1992.

Digital Library

[19]

M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: decoupling performance and correctness. In Proceedings of the 30th annual international symposium on Computer architecture, pages 182--193, New York, NY, USA, 2003. ACM Press.

Digital Library

[20]

M. M. Michael and A. K. Nanda. Design and performance of directory caches for scalable shared memory multiprocessors. In HPCA '99: Proceedings of the 5th International Symposium on High Performance Computer Architecture, page 142, Washington, DC, USA, 1999. IEEE Computer Society.

Digital Library

[21]

B. W. O'Krafka and A. R. Newton. An empirical evaluation of two memory-efficient directory methods. In ISCA '90: Proceedings of the 17th annual international symposium on Computer Architecture, pages 138--147, New York, NY, USA, 1990. ACM Press.

Digital Library

[22]

V. S. Pai, P. Ranganathan, and S. V. Adve. RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors. In Proceedings of the Third Workshop on Computer Architecture Education, February 1997. Also appears in IEEE TCCA Newsletter, October 1997.

Digital Library

[23]

Sun. UltrasparcIV: http://siliconvalley.internet.com/news/print.php/3090801.

[24]

M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA '05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 336--345, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[25]

Z. Zhang and J. Torrellas. Reducing remote conict misses: Numa with remote cache versus coma. In HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, page 272, Washington, DC, USA, 1997. IEEE Computer Society.

Digital Library

Cited By

Adegbija TTandon R(2017)Coding for Efficient Caching in Multicore Embedded Systems2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.59(296-301)Online publication date: Jul-2017
https://doi.org/10.1109/ISVLSI.2017.59
Joshi AVollala SBegum BRamasubramanian N(2016)Performance Analysis of Cache Coherence Protocols for Multi-core ArchitecturesProceedings of the International Conference on Advances in Information Communication Technology & Computing10.1145/2979779.2979801(1-7)Online publication date: 12-Aug-2016
https://dl.acm.org/doi/10.1145/2979779.2979801
Marandola JLouise SCudennec L(2016)Pattern Based Cache Coherency Architecture for Embedded ManycoresProcedia Computer Science10.1016/j.procs.2016.05.48180:C(1542-1553)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.05.481
Show More Cited By

Index Terms

Proximity-aware directory-based coherence for multi-core processor architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Directory based cache coherence verification logic in CMPs cache system
MES '13: Proceedings of the First International Workshop on Many-core Embedded Systems

This work reports a high speed protocol verificaion logic for Chip Multiprocessors (CMPs) realizing directory based cache coherence system. A special class of cellular automata (CA) referred to as single length cycle 2-attractor CA (TACA), has been ...
SARC Coherence: Scaling Directory Cache Coherence in Performance and Power

The SARC project seeks to improve power scalability of shared-memory chip multiprocessors (CMPs) by making directory coherence more efficient in both power and performance. The authors describe how they eliminate two major sources of inefficiency for ...
PS directory: a scalable multilevel directory cache for CMPs

As the number of cores increases in current and future chip-multiprocessor (CMP) generations, coherence protocols must rely on novel hardware structures to scale in terms of performance, power, and area. Systems that use directory information for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

June 2007

376 pages

ISBN:9781595936677

DOI:10.1145/1248377

General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Christian Scheideler
Technische Universität München, Germany

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA07

Sponsor:

SPAA07: 19th ACM Symposium on Parallelism in Algorithms and Architectures

June 9 - 11, 2007

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25

Sponsor:
sigact
sigact

37th ACM Symposium on Parallelism in Algorithms and Architectures

July 28 - August 1, 2025

Portland , OR , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
856
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Adegbija TTandon R(2017)Coding for Efficient Caching in Multicore Embedded Systems2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.59(296-301)Online publication date: Jul-2017
https://doi.org/10.1109/ISVLSI.2017.59
Joshi AVollala SBegum BRamasubramanian N(2016)Performance Analysis of Cache Coherence Protocols for Multi-core ArchitecturesProceedings of the International Conference on Advances in Information Communication Technology & Computing10.1145/2979779.2979801(1-7)Online publication date: 12-Aug-2016
https://dl.acm.org/doi/10.1145/2979779.2979801
Marandola JLouise SCudennec L(2016)Pattern Based Cache Coherency Architecture for Embedded ManycoresProcedia Computer Science10.1016/j.procs.2016.05.48180:C(1542-1553)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.05.481
Mallya NPatil GRaveendran B(2015)Simulation based Performance Study of Cache Coherence ProtocolsProceedings of the 2015 IEEE International Symposium on Nanoelectronic and Information Systems (iNIS)10.1109/iNIS.2015.52(125-130)Online publication date: 21-Dec-2015
https://dl.acm.org/doi/10.1109/iNIS.2015.52
Kayi ASerres OEl-Ghazawi T(2015)Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip MultiprocessorsIEEE Transactions on Computers10.1109/TC.2013.21764:2(316-328)Online publication date: 1-Feb-2015
https://dl.acm.org/doi/10.1109/TC.2013.217
Li GTemam OLiu ZGuo SWang D(2015)Cluster Cache MonitorInternational Journal of Parallel Programming10.1007/s10766-014-0339-043:6(1054-1077)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1007/s10766-014-0339-0
Clarke HTrouvé AMurakami K(2014)Accelerated design space pruning for CMP memory architecturesProceedings of the High Performance Computing Symposium10.5555/2663510.2663535(1-6)Online publication date: 13-Apr-2014
https://dl.acm.org/doi/10.5555/2663510.2663535
Li YMelhem RJones A(2014)A Practical Data Classification Framework for Scalable and High Performance Chip-MultiprocessorsIEEE Transactions on Computers10.1109/TC.2013.16163:12(2905-2918)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1109/TC.2013.161
LI GLIU ZGUO SWANG D(2013)Bayesian Theory Based Adaptive Proximity Data Accessing for CMP CachesIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1587/transfun.E96.A.1293E96.A:6(1293-1305)Online publication date: 2013
https://doi.org/10.1587/transfun.E96.A.1293
(2013)BibliographyMulticore Technology10.1201/b15268-20(409-450)Online publication date: 18-Jul-2013
https://doi.org/10.1201/b15268-20
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten