Abstract
Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by \(\) and runtime by \(\), while requiring \(\) smaller storage and \(\) lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.
- Niket Agarwal, Li-Shiuan Peh, and Niraj K. Jha. 2009. In-network coherence filtering: Snoopy coherence without broadcasts. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 232–243. https://doi.org/10.1145/1669112.1669143 Google ScholarDigital Library
- N. Agarwal, L. S. Peh, and N. K. Jha. 2009. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. 67–78. https://doi.org/10.1109/HPCA.2009.4798238Google Scholar
- M. Alisafaee. 2012. Spatiotemporal coherence tracking. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 341–350. https://doi.org/10.1109/MICRO.2012.39 Google ScholarDigital Library
- A. Asaduzzaman and K. K. Chidella. 2016. A novel directory-based hybrid cache coherence protocol for shared memory multiprocessors. In Proceedings of the IEEE International Symposium on Phased Array Systems and Technology (PAST). 1–6. https://doi.org/10.1109/ARRAY.2016.7832588Google Scholar
- N. Beck, S. White, M. Paraschou, and S. Naffziger. 2018. “Zeppelin”: An SoC for multichip architectures. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). 40–42. https://doi.org/10.1109/ISSCC.2018.8310173Google ScholarCross Ref
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72–81. https://doi.org/10.1145/1454115.1454128 Google ScholarDigital Library
- Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2016. Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT’16). ACM, New York, NY, 275–286. https://doi.org/10.1145/2967938.2967962 Google ScholarDigital Library
- B. Cuesta, A. Ros, M. E. Gomez, A. Robles, and J. Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 93–103. Google ScholarDigital Library
- B. K. Daya, C. H. O. Chen, S. Subramanian, W. C. Kwon, S. Park, T. Krishna, J. Holt, A. P. Chandrakasan, and L. S. Peh. 2014. SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). 25–36. https://doi.org/10.1109/ISCA.2014.6853232 Google ScholarDigital Library
- B. K. Daya, L. S. Peh, and A. P. Chandrakasan. 2017. Low-power on-chip network providing guaranteed services for snoopy coherent and artificial neural network systems. In Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). 1–6. https://doi.org/10.1145/3061639.3062278 Google ScholarDigital Library
- S. Deb, K. Chang, X. Yu, S. P. Sah, M. Cosic, A. Ganguly, P. P. Pande, B. Belzer, and D. Heo. 2013. Design of an energy-efficient CMOS-Compatible NoC architecture with millimeter-wave wireless interconnects. IEEE Trans. Comput. 62, 12 (Dec 2013), 2382–2396. https://doi.org/10.1109/TC.2012.224 Google ScholarDigital Library
- Ronald G. Dreslinski, Thomas Manville, Korey Sewell, Reetuparna Das, Nathaniel Pinckney, Sudhir Satpathy, David Blaauw, Dennis Sylvester, and Trevor Mudge. 2012. XPoint Cache: Scaling existing bus-based coherence protocols for 2D and 3D many-core systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 75–86. https://doi.org/10.1145/2370816.2370829 Google ScholarDigital Library
- S. H. Gade, S. Garg, and S. Deb. 2017. OFDM-based high data rate, fading resilient transceiver for wireless networks-on-chip. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’17). 483–488. https://doi.org/10.1109/ISVLSI.2017.90Google Scholar
- Sri Harsha Gade, Shobha Sundar Ram, and Sujay Deb. 2019. Millimeter wave wireless interconnects in deep submicron chips: Challenges and opportunities. Integration 64 (2019), 127–136. https://doi.org/10.1016/j.vlsi.2018.09.004Google ScholarCross Ref
- A. Garcia-Guirado, R. Fernandez-Pascual, and J. M. Garcia. 2015. ICCI: In-cache coherence information. IEEE Trans. Comput. 64, 4 (Apr. 2015), 995–1014. https://doi.org/10.1109/TC.2014.2308185Google ScholarDigital Library
- John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach (5th ed.). Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarDigital Library
- Joel Hruska. 2018. Intel Uses New Foveros 3D Chip-Stacking to Build Core, Atom on Same Silicon. ExtremeTech. Retrieved from https://www.extremetech.com/computing/282137-intel-uses-new-foveros-3d-chip-stacking-technology-to-build-core-atom-on-the-same-silicon.Google Scholar
- Libo Huang, Zhiying Wang, Nong Xiao, Yongwen Wang, and Qiang Dou. 2014. Integrated coherence prediction: Towards efficient cache coherence on NoC-based multicore architectures. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 24 (June 2014), 22 pages. https://doi.org/10.1145/2611756 Google ScholarDigital Library
- S. Kaushik, M. Agrawal, H. K. Mondal, S. H. Gade, and S. Deb. 2017. Path loss-aware adaptive transmission power control scheme for energy-efficient wireless NoC. In Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). 132–135.Google Scholar
- Abdullah Kayi and Tarek El-Ghazawi. 2010. An adaptive cache coherence protocol for chip multiprocessors. In Proceedings of the 2nd International Forum on Next-Generation Multicore/Manycore Technologies (IFMT’10). ACM, New York, NY, Article 4, 10 pages. https://doi.org/10.1145/1882453.1882458 Google ScholarDigital Library
- A. Kayi, O. Serres, and T. El-Ghazawi. 2015. Adaptive cache coherence mechanisms with producer-consumer sharing optimization for chip multiprocessors. IEEE Trans. Comput. 64, 2 (Feb. 2015), 316–328. https://doi.org/10.1109/TC.2013.217Google ScholarDigital Library
- George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. 2010. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 477–488. https://doi.org/10.1145/1854273.1854332 Google ScholarDigital Library
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 469–480. Google ScholarDigital Library
- R. Mahajan, R. Sankman, N. Patel, D. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar, and D. Mallik. 2016. Embedded multi-die interconnect bridge (EMIB)—A high-density, high-bandwidth packaging interconnect. In Proceedings of the IEEE 66th Electronic Components and Technology Conference (ECTC’16). 557–565. https://doi.org/10.1109/ECTC.2016.201Google Scholar
- Ofer Markish, Oded Katz, Benny Sheinman, Dan Corcos, and Danny Elad. 2015. On-chip millimeter wave antennas and transceivers. In Proceedings of the 9th International Symposium on Networks-on-Chip (NOCS’15). ACM, New York, NY, Article 11, 7 pages. https://doi.org/10.1145/2786572.2789983 Google ScholarDigital Library
- Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip cache coherence is here to stay. Commun. ACM 55, 7 (July 2012), 78–89. https://doi.org/10.1145/2209249.2209269 Google ScholarDigital Library
- M. M. K. Martin, M. D. Hill, and D. A. Wood. 2003. Token coherence: decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 182–193. https://doi.org/10.1109/ISCA.2003.1206999 Google ScholarDigital Library
- Norman P. Jouppi Naveen Muralimanohar, and Rajeev Balasubramonian. 2009. CACTI 6.0: A Tool to Model Large Caches. Retrieved from https://www.hpl.hp.com/techreports/2009/HPL-2009-85.html.Google Scholar
- Yin-Chi Peng, Chien-Chih Chen, Hsiang-Jen Tsai, Keng-Hao Yang, Pei-Zhe Huang, Shih-Chieh Chang, Wen-Ben Jone, and Tien-Fu Chen. 2017. Leak Stopper: An actively revitalized snoop filter architecture with effective generation control. ACM Trans. Des. Autom. Electron. Syst. 22, 3, Article 46 (Mar. 2017), 27 pages. https://doi.org/10.1145/3015770 Google ScholarDigital Library
- A. Ros, M. E. Acacio, and J. M. Garcia. 2010. A direct coherence protocol for many-core chip multiprocessors. IEEE Trans. Parallel Distrib. Syst. 21, 12 (Dec. 2010), 1779–1792. https://doi.org/10.1109/TPDS.2010.43 Google ScholarDigital Library
- A. Ros, M. Davari, and S. Kaxiras. 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 186–197. https://doi.org/10.1109/HPCA.2015.7056032Google Scholar
- A. Ros and A. Jimborean. 2016. A hybrid static-dynamic classification for dual-consistency cache coherence. IEEE Trans. Parallel Distrib. Syst. 27, 11 (Nov. 2016), 3101–3115. https://doi.org/10.1109/TPDS.2016.2528241 Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1–12. https://doi.org/10.1109/HPCA.2012.6168950 Google ScholarDigital Library
- David Schor. 2018. AMD Announces Threadripper 2, Chiplets Aid Core Scaling. WikiChip. Retrieved from https://fuse.wikichip.org/news/1569/amd-announces-threadripper-2-chiplets-aid-core-scaling/.Google Scholar
- T. Shreedhar and S. Deb. 2016. Hierarchical cluster-based NOC design using wireless interconnects for coherence support. In Proceedings of the 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID’16). 63–68. https://doi.org/10.1109/VLSID.2016.54 Google ScholarDigital Library
- A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro 36, 2 (Mar. 2016), 34–46. https://doi.org/10.1109/MM.2016.25 Google ScholarDigital Library
- K. Strauss, X. Shen, and J. Torrellas. 2007. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). 327–342. https://doi.org/10.1109/MICRO.2007.37 Google ScholarDigital Library
- Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 335–344. https://doi.org/10.1145/2370816.2370865 Google ScholarDigital Library
- S. Volos, C. Seiculescu, B. Grot, N. K. Pour, B. Falsafi, and G. De Micheli. 2012. CCNoC: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In Proceedings of the 6th IEEE/ACM International Symposium on Networks on Chip (NoCS’12). 67–74. https://doi.org/10.1109/NOCS.2012.15 Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarDigital Library
- J. Zebchuk, B. Falsafi, and A. Moshovos. 2013. Multi-grain coherence directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 359–370. Google ScholarDigital Library
- J. Zebchuk, M. K. Qureshi, V. Srinivasan, and A. Moshovos. 2009. A tagless coherence directory. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 423–434. https://doi.org/10.1145/1669112.1669166 Google ScholarDigital Library
- H. Zhao, O. Jang, W. Ding, Y. Zhang, M. Kandemir, and M. J. Irwin. 2012. A hybrid NoC design for cache coherence optimization for chip multiprocessors. In Proceedings of the DAC Design Automation Conference. 834–842. Google ScholarDigital Library
- Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 135–146. https://doi.org/10.1145/1854273.1854294 Google ScholarDigital Library
- H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. SPATL: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 33–44. https://doi.org/10.1109/PACT.2011.10 Google ScholarDigital Library
- Xiangrong Zhou, Chenjie Yu, Alokika Dash, and Peter Petrov. 2008. Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors. ACM Trans. Des. Autom. Electron. Syst. 13, 1, Article 16 (Feb. 2008), 25 pages. https://doi.org/10.1145/1297666.1297682 Google ScholarDigital Library
Index Terms
- A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures
Recommendations
Performance Analysis of Cache Coherence Protocols for Multi-core Architectures: A System Attribute Perspective
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & ComputingShared memory multi-core processors are becoming dominant in todays computer architectures. Caching of shared data may produce a problem of replication in multiple caches. Replication provides reduction in contention for shared data items along with ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures
ICS '16: Proceedings of the 2016 International Conference on SupercomputingAs we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with ...
Comments