research-article

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures

Authors:
Sri Harsha Gade

Indraprastha Institute of Information Technology Delhi, New Delhi, India

Indraprastha Institute of Information Technology Delhi, New Delhi, India
View Profile

,
Sujay Deb

Indraprastha Institute of Information Technology Delhi, New Delhi, India

Indraprastha Institute of Information Technology Delhi, New Delhi, India
View Profile

ACM Transactions on Design Automation of Electronic Systems Volume 27 Issue 1Article No.: 2pp 1–31https://doi.org/10.1145/3462775

Published:13 September 2021Publication History

ACM Transactions on Design Automation of Electronic Systems

Abstract

Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by \(\) and runtime by \(\), while requiring \(\) smaller storage and \(\) lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.

References

Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha. 2009. In-network coherence filtering: Snoopy coherence without broadcasts. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 232–243. https://doi.org/10.1145/1669112.1669143 Google ScholarDigital Library
N. Agarwal, L. S. Peh, and N. K. Jha. 2009. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. 67–78. https://doi.org/10.1109/HPCA.2009.4798238Google Scholar
M. Alisafaee. 2012. Spatiotemporal coherence tracking. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 341–350. https://doi.org/10.1109/MICRO.2012.39 Google ScholarDigital Library
A. Asaduzzaman and K. K. Chidella. 2016. A novel directory-based hybrid cache coherence protocol for shared memory multiprocessors. In Proceedings of the IEEE International Symposium on Phased Array Systems and Technology (PAST). 1–6. https://doi.org/10.1109/ARRAY.2016.7832588Google Scholar
N. Beck, S. White, M. Paraschou, and S. Naffziger. 2018. “Zeppelin”: An SoC for multichip architectures. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). 40–42. https://doi.org/10.1109/ISSCC.2018.8310173Google ScholarCross Ref
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72–81. https://doi.org/10.1145/1454115.1454128 Google ScholarDigital Library
Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2016. Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). ACM, New York, NY, 275–286. https://doi.org/10.1145/2967938.2967962 Google ScholarDigital Library
B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 93–103. Google ScholarDigital Library
B. K. Daya, C. H. O. Chen, S. Subramanian, W. C. Kwon, S. Park, T. Krishna, J. Holt, A. P. Chandrakasan, and L. S. Peh. 2014. SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 25–36. https://doi.org/10.1109/ISCA.2014.6853232 Google ScholarDigital Library
B. K. Daya, L. S. Peh, and A. P. Chandrakasan. 2017. Low-power on-chip network providing guaranteed services for snoopy coherent and artificial neural network systems. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. https://doi.org/10.1145/3061639.3062278 Google ScholarDigital Library
S. Deb, K. Chang, X. Yu, S. P. Sah, M. Cosic, A. Ganguly, P. P. Pande, B. Belzer, and D. Heo. 2013. Design of an energy-efficient CMOS-Compatible NoC architecture with millimeter-wave wireless interconnects. IEEE Trans. Comput. 62, 12 (Dec 2013), 2382–2396. https://doi.org/10.1109/TC.2012.224 Google ScholarDigital Library
Ronald G. Dreslinski, Thomas Manville, Korey Sewell, Reetuparna Das, Nathaniel Pinckney, Sudhir Satpathy, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2012. XPoint Cache: Scaling existing bus-based coherence protocols for 2D and 3D many-core systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 75–86. https://doi.org/10.1145/2370816.2370829 Google ScholarDigital Library
S. H. Gade, S. Garg, and S. Deb. 2017. OFDM-based high data rate, fading resilient transceiver for wireless networks-on-chip. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’17). 483–488. https://doi.org/10.1109/ISVLSI.2017.90Google Scholar
Sri Harsha Gade, Shobha Sundar Ram, and Sujay Deb. 2019. Millimeter wave wireless interconnects in deep submicron chips: Challenges and opportunities. Integration 64 (2019), 127–136. https://doi.org/10.1016/j.vlsi.2018.09.004Google ScholarCross Ref
A. Garcia-Guirado, R. Fernandez-Pascual, and J. M. Garcia. 2015. ICCI: In-cache coherence information. IEEE Trans. Comput. 64, 4 (Apr. 2015), 995–1014. https://doi.org/10.1109/TC.2014.2308185Google ScholarDigital Library
John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach (5th ed.). Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarDigital Library
Joel Hruska. 2018. Intel Uses New Foveros 3D Chip-Stacking to Build Core, Atom on Same Silicon. ExtremeTech. Retrieved from https://www.extremetech.com/computing/282137-intel-uses-new-foveros-3d-chip-stacking-technology-to-build-core-atom-on-the-same-silicon.Google Scholar
Libo Huang, Zhiying Wang, Nong Xiao, Yongwen Wang, and Qiang Dou. 2014. Integrated coherence prediction: Towards efficient cache coherence on NoC-based multicore architectures. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 24 (June 2014), 22 pages. https://doi.org/10.1145/2611756 Google ScholarDigital Library
S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb. 2017. Path loss-aware adaptive transmission power control scheme for energy-efficient wireless NoC. In Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). 132–135.Google Scholar
Abdullah Kayi and Tarek El-Ghazawi. 2010. An adaptive cache coherence protocol for chip multiprocessors. In Proceedings of the 2nd International Forum on Next-Generation Multicore/Manycore Technologies (IFMT’10). ACM, New York, NY, Article 4, 10 pages. https://doi.org/10.1145/1882453.1882458 Google ScholarDigital Library
A. Kayi, O. Serres, and T. El-Ghazawi. 2015. Adaptive cache coherence mechanisms with producer-consumer sharing optimization for chip multiprocessors. IEEE Trans. Comput. 64, 2 (Feb. 2015), 316–328. https://doi.org/10.1109/TC.2013.217Google ScholarDigital Library
George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 477–488. https://doi.org/10.1145/1854273.1854332 Google ScholarDigital Library
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 469–480. Google ScholarDigital Library
R. Mahajan, R. Sankman, N. Patel, D. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar, and D. Mallik. 2016. Embedded multi-die interconnect bridge (EMIB)—A high-density, high-bandwidth packaging interconnect. In Proceedings of the IEEE 66th Electronic Components and Technology Conference (ECTC’16). 557–565. https://doi.org/10.1109/ECTC.2016.201Google Scholar
Ofer Markish, Oded Katz, Benny Sheinman, Dan Corcos, and Danny Elad. 2015. On-chip millimeter wave antennas and transceivers. In Proceedings of the 9th International Symposium on Networks-on-Chip (NOCS’15). ACM, New York, NY, Article 11, 7 pages. https://doi.org/10.1145/2786572.2789983 Google ScholarDigital Library
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip cache coherence is here to stay. Commun. ACM 55, 7 (July 2012), 78–89. https://doi.org/10.1145/2209249.2209269 Google ScholarDigital Library
M. M. K. Martin, M. D. Hill, and D. A. Wood. 2003. Token coherence: decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 182–193. https://doi.org/10.1109/ISCA.2003.1206999 Google ScholarDigital Library
Norman P. Jouppi Naveen Muralimanohar, and Rajeev Balasubramonian. 2009. CACTI 6.0: A Tool to Model Large Caches. Retrieved from https://www.hpl.hp.com/techreports/2009/HPL-2009-85.html.Google Scholar
Yin-Chi Peng, Chien-Chih Chen, Hsiang-Jen Tsai, Keng-Hao Yang, Pei-Zhe Huang, Shih-Chieh Chang, Wen-Ben Jone, and Tien-Fu Chen. 2017. Leak Stopper: An actively revitalized snoop filter architecture with effective generation control. ACM Trans. Des. Autom. Electron. Syst. 22, 3, Article 46 (Mar. 2017), 27 pages. https://doi.org/10.1145/3015770 Google ScholarDigital Library
A. Ros, M. E. Acacio, and J. M. Garcia. 2010. A direct coherence protocol for many-core chip multiprocessors. IEEE Trans. Parallel Distrib. Syst. 21, 12 (Dec. 2010), 1779–1792. https://doi.org/10.1109/TPDS.2010.43 Google ScholarDigital Library
A. Ros, M. Davari, and S. Kaxiras. 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 186–197. https://doi.org/10.1109/HPCA.2015.7056032Google Scholar
A. Ros and A. Jimborean. 2016. A hybrid static-dynamic classification for dual-consistency cache coherence. IEEE Trans. Parallel Distrib. Syst. 27, 11 (Nov. 2016), 3101–3115. https://doi.org/10.1109/TPDS.2016.2528241 Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1–12. https://doi.org/10.1109/HPCA.2012.6168950 Google ScholarDigital Library
David Schor. 2018. AMD Announces Threadripper 2, Chiplets Aid Core Scaling. WikiChip. Retrieved from https://fuse.wikichip.org/news/1569/amd-announces-threadripper-2-chiplets-aid-core-scaling/.Google Scholar
T. Shreedhar and S. Deb. 2016. Hierarchical cluster-based NOC design using wireless interconnects for coherence support. In Proceedings of the 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID’16). 63–68. https://doi.org/10.1109/VLSID.2016.54 Google ScholarDigital Library
A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro 36, 2 (Mar. 2016), 34–46. https://doi.org/10.1109/MM.2016.25 Google ScholarDigital Library
K. Strauss, X. Shen, and J. Torrellas. 2007. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 327–342. https://doi.org/10.1109/MICRO.2007.37 Google ScholarDigital Library
Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 335–344. https://doi.org/10.1145/2370816.2370865 Google ScholarDigital Library
S. Volos, C. Seiculescu, B. Grot, N. K. Pour, B. Falsafi, and G. De Micheli. 2012. CCNoC: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In Proceedings of the 6th IEEE/ACM International Symposium on Networks on Chip (NoCS’12). 67–74. https://doi.org/10.1109/NOCS.2012.15 Google ScholarDigital Library
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarDigital Library
J. Zebchuk, B. Falsafi, and A. Moshovos. 2013. Multi-grain coherence directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 359–370. Google ScholarDigital Library
J. Zebchuk, M. K. Qureshi, V. Srinivasan, and A. Moshovos. 2009. A tagless coherence directory. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 423–434. https://doi.org/10.1145/1669112.1669166 Google ScholarDigital Library
H. Zhao, O. Jang, W. Ding, Y. Zhang, M. Kandemir, and M. J. Irwin. 2012. A hybrid NoC design for cache coherence optimization for chip multiprocessors. In Proceedings of the DAC Design Automation Conference. 834–842. Google ScholarDigital Library
Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 135–146. https://doi.org/10.1145/1854273.1854294 Google ScholarDigital Library
H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. SPATL: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 33–44. https://doi.org/10.1109/PACT.2011.10 Google ScholarDigital Library
Xiangrong Zhou, Chenjie Yu, Alokika Dash, and Peter Petrov. 2008. Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans. Des. Autom. Electron. Syst. 13, 1, Article 16 (Feb. 2008), 25 pages. https://doi.org/10.1145/1297666.1297682 Google ScholarDigital Library

Index Terms

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
      2. Multicore architectures

Recommendations

Performance Analysis of Cache Coherence Protocols for Multi-core Architectures: A System Attribute Perspective
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & Computing

Shared memory multi-core processors are becoming dominant in todays computer architectures. Caching of shared data may produce a problem of replication in multiple caches. Replication provides reduction in contention for shared data items along with ...
Read More
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Read More
Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Design Automation of Electronic Systems Volume 27, Issue 1
January 2022
230 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3483335
Editor:
X. Sharon Hu
University of Notre Dame, USA
Issue’s Table of Contents
Copyright © 2021 Association for Computing Machinery.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 13 September 2021
- Accepted: 1 April 2021
- Revised: 1 March 2021
- Received: 1 May 2020
Published in todaes Volume 27, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache coherence
hybrid protocol
many core processors
mm-wave wireless links
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures

ACM Transactions on Design Automation of Electronic Systems

Abstract

References

Cited By

Index Terms

Recommendations

Performance Analysis of Cache Coherence Protocols for Multi-core Architectures: A System Attribute Perspective

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures