An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Fensch, Christian; Cintra, Marcelo

doi:10.1007/s10766-010-0162-1

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Published: 29 December 2010

Volume 39, pages 271–295, (2011)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Christian Fensch¹ &
Marcelo Cintra¹

73 Accesses
Explore all metrics

Abstract

The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring grouped coherence for clustered hierarchical cache

Article 28 March 2017

OMHI 2012: First International Workshop on On-chip Memory Hierarchies and Interconnects: Organization, Management and Implementation

Mosaic: A Scalable Coherence Protocol

Article 29 January 2018

References

Abts, D., Scott, S., Lilja, D.J.: So many states, so little time: verifying memory coherence in the Cray X1. In: Proceedings of the International Parallel and Distributed Processing Symposium. (2003). doi:10.1109/IPDPS.2003.1213087
Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. IEEE Comput. 29(12) (1996). doi:10.1109/2.546611
Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.W., Ryu, S., Steele, G.L. Jr., Tobin-Hochstadt, S.: The Fortress Language Specification Version 1.0 β. Sun Microsystems, Inc., http://research.sun.com/projects/plrg/Publications/fortress1.0beta.pdf (2007)
Beckmann, B.M., Wood, D.A.: Managing wire delay in large chip-multiprocessor caches. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 319–330. (2004). doi:10.1109/MICRO.2004.21
Burger, D., Austin, T.M., Bennett, S.: Evaluating future microprocessors: the SimpleScalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin-Madison (1996)
Burger, D., Keckler, S.W., McKinley, K.S., Dahlin, M., John, L.K., Lin, C., Moore, C.R., Burrill, J., McDonald, R.G., Yoder, W., The TRIPS Team: Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37(7), 44–55 (2004). doi:10.1109/MC.2004.65
Google Scholar
Carter, J.B., Bennett, J.K., Zwaenepoel, W.: Implementation and performance of munin. In: Proceedings of the 13th Symposium on Operating Systems Principles, pp. 152–164 (1991). doi:10.1145/121133.121159
Caşcaval, C., Castaños, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D., Moreira, J.E., Strauss, K., Warren, H.S. Jr.: Evaluation of a multithreaded architecture for cellular computing. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pp. 311–322 (2002). doi:10.1109/HPCA.2002.995720
Chang, J., Sohi, G.S.: Cooperative caching for chip multiprocessors. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pp. 264–276 (2006). doi:10.1109/ISCA.2006.17
Chaudhuri, M., Heinrich, M.: SMTp: an architecture for next-generation scalable multi-threading. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 124–137 (2004). doi:10.1109/ISCA.2004.1310769
Chishti, Z., Powell, M.D., Vijaykumar, T.N.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 357–368 (2005). doi:10.1109/ISCA.2005.39
Cray: Chapel Language Specification 0.785. Cray Inc., http://chapel.cray.com/spec-0.785.pdf (2009)
Fensch, C., Cintra, M.: An OS-based alternative to full hardware coherence on tiled CMPs. In: Proceedings of the 14th International Symposium on High-Performance Computer Architecture, pp. 355–366 (2008). doi:10.1109/HPCA.2008.4658652
Fillo M., Keckler S.W., Dally W.J., Carter N.P., Chang A., Gurevich Y., Lee W.S.: The M-machine multicomputer. Int. J. Parallel Programm. 25(3), 183–212 (1997). doi:10.1007/BF02700035
Article Google Scholar
Hagersten, E.: Personal Communication regarding the verification of the coherence protocol of Sun Microsystems’ Enterprise Servers E3000, E4000, E5000 and E6000 (2007)
Hill M.D.: Multiprocessors should support simple memory-consistency models. Computer 31(8), 28–34 (1998). doi:10.1109/2.707614
Article Google Scholar
Iftode, L., Singh, J.P., Li, K.: Understanding applications performance on shared virtual memory systems. In: Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 122–133 (1996). doi:10.1145/232973.232987
Intel: Intel Core2 Extreme Processor X6800 and Intel Core2 Duo Desktop Processor E6000 and E4000 Sequence Specification Update. Intel, document No: 313279-016 (2007)
Kalla R., Sinharoy B., Tendler J.M.: IBM Power5 chip: a dual-core multithreaded processor. IEEE Micro 24(2), 40–47 (2004). doi:10.1109/MM.2004.1289290
Article Google Scholar
Keleher, P., Cox, A.L., Dwarkadas, S., Zwaenepoel, W.: TreadMarks: distributed shared memory on standard workstations and operating systems. In: USENIX Winter 1994 Technical Conference Proceedings, pp. 115–131 (1994)
Kim, C., Burger, D., Keckler, S.W.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 211–222 (2002). doi:10.1145/605432.605420
Kongetira P., Aingaran K., Olukotun K.: Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005). doi:10.1109/MM.2005.35
Article Google Scholar
Kontothanassis, L.I., Hunt, G., Stets, R., Hardavellas, N., Cierniak, M., Parthasarathy, S., Meira, W. Jr., Dwarkadas, S., Scott, M.L.: VM-based shared memory on low-latency, remote-memory-access networks. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 157–169 (1997). doi:10.1145/384286.264163
Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanović, K.: The vector-thread architecture. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 52–64 (2004). doi:10.1109/ISCA.2004.1310763
Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 408–419 (2005). doi:10.1109/ISCA.2005.34
Kuskin, J., Ofelt, D., Heinrich, M., Heinlein, J., Simoni, R., Gharachorloo, K., Chapin, J., Nakahira, D., Baxter, J., Horowitz, M., Gupta, A., Rosenblum, M., Hennessy, J.L.: The stanford FLASH multiprocessor. In: Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 325–337 (1994). doi:10.1109/ISCA.1994.288140
Laudon, J., Lenoski, D.: The SGI Origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241–251 (1997). doi:10.1145/384286.264206
Li, K.: IVY: a shared virtual memory system for parallel computing. In: Proceedings of the 1988 International Conference on Parallel Processing, vol. 2, pp. 94–101, Pennsylvania State University Press (1988)
Li, M., Sasanka, R., Adve, S.V., Chen, Y.K., Debes, E.: The ALPBench benchmark suite for complex multimedia applications. In: Proceedings of IEEE International Symposium on Workload Characterization, pp. 34–45 (2005). doi:10.1109/IISWC.2005.1525999
Martin, M.M.K., Hill, M.D., Wood, D.A.: Token coherence: decoupling performance and correctness. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 182–193 (2003). doi:10.1109/ISCA.2003.1206999
McNairy C., Bhatia R.: Montecito: a dual-core, dual-thread itanium processor. IEEE Micro 25(2), 10–20 (2005). doi:10.1109/MM.2005.35
Article Google Scholar
Scott, S.L.: Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26–36 (1996). doi:10.1145/237090.237144
Swanson, S., Michelson, K., Schwerin, A., Oskin, M.: WaveScalar. In: Proceedings of the 36th Annual International Symposium on Microarchitecture, pp. 291–203 (2003). doi:10.1109/MICRO.2003.1253203
Taylor, M.B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., Hoffmann, H., Johnson, P., Kim, J., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Agarwal, A., Amarasinghe, S.: Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ILP and streams. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 2–13 (2004). doi:10.1109/ISCA.2004.1310759
Vachharajani, M., Vachharajani, N., August, D.I.: The liberty structural specification language: a high-level modeling language for component reuse. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 195–206 (2004). doi:10.1145/996893.996865
Verghese, B., Devine, S., Gupta, A., Rosenblum, M.: Operating system support for improving data locality on CC-NUMA compute servers. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279–289 (1996). doi:10.1145/237090.237205
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24–36 (1995). doi:10.1145/223982.223990
Zeffer, H., Hagersten, E.: A case for low-complexity MP architectures. In: Proceedings of the Conference on Supercomputing (2007). doi:10.1145/1362622.1362648
Zeffer, H., Radović, Z., Karlsson, M., Hagersten, E.: TMA: a trap-based memory architecture. In: Proceedings of the 20th Annual International Conference on Supercomputing, pp. 259–268 (2006). doi:10.1145/1183401.1183438
Zhang, M., Asanović, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 336–345 (2005). doi:10.1109/ISCA.2005.53

Download references

Author information

Authors and Affiliations

School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, UK
Christian Fensch & Marcelo Cintra

Authors

Christian Fensch
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Cintra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Fensch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fensch, C., Cintra, M. An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs. Int J Parallel Prog 39, 271–295 (2011). https://doi.org/10.1007/s10766-010-0162-1

Download citation

Received: 22 January 2009
Accepted: 09 December 2010
Published: 29 December 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10766-010-0162-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Abstract

Access this article

Similar content being viewed by others

Exploring grouped coherence for clustered hierarchical cache

OMHI 2012: First International Workshop on On-chip Memory Hierarchies and Interconnects: Organization, Management and Implementation

Mosaic: A Scalable Coherence Protocol

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Abstract

Access this article

Similar content being viewed by others

Exploring grouped coherence for clustered hierarchical cache

OMHI 2012: First International Workshop on On-chip Memory Hierarchies and Interconnects: Organization, Management and Implementation

Mosaic: A Scalable Coherence Protocol

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation