Abstract
The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance.
Similar content being viewed by others
References
Abts, D., Scott, S., Lilja, D.J.: So many states, so little time: verifying memory coherence in the Cray X1. In: Proceedings of the International Parallel and Distributed Processing Symposium. (2003). doi:10.1109/IPDPS.2003.1213087
Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. IEEE Comput. 29(12) (1996). doi:10.1109/2.546611
Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.W., Ryu, S., Steele, G.L. Jr., Tobin-Hochstadt, S.: The Fortress Language Specification Version 1.0 β. Sun Microsystems, Inc., http://research.sun.com/projects/plrg/Publications/fortress1.0beta.pdf (2007)
Beckmann, B.M., Wood, D.A.: Managing wire delay in large chip-multiprocessor caches. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 319–330. (2004). doi:10.1109/MICRO.2004.21
Burger, D., Austin, T.M., Bennett, S.: Evaluating future microprocessors: the SimpleScalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin-Madison (1996)
Burger, D., Keckler, S.W., McKinley, K.S., Dahlin, M., John, L.K., Lin, C., Moore, C.R., Burrill, J., McDonald, R.G., Yoder, W., The TRIPS Team: Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37(7), 44–55 (2004). doi:10.1109/MC.2004.65
Carter, J.B., Bennett, J.K., Zwaenepoel, W.: Implementation and performance of munin. In: Proceedings of the 13th Symposium on Operating Systems Principles, pp. 152–164 (1991). doi:10.1145/121133.121159
Caşcaval, C., Castaños, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D., Moreira, J.E., Strauss, K., Warren, H.S. Jr.: Evaluation of a multithreaded architecture for cellular computing. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pp. 311–322 (2002). doi:10.1109/HPCA.2002.995720
Chang, J., Sohi, G.S.: Cooperative caching for chip multiprocessors. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pp. 264–276 (2006). doi:10.1109/ISCA.2006.17
Chaudhuri, M., Heinrich, M.: SMTp: an architecture for next-generation scalable multi-threading. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 124–137 (2004). doi:10.1109/ISCA.2004.1310769
Chishti, Z., Powell, M.D., Vijaykumar, T.N.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 357–368 (2005). doi:10.1109/ISCA.2005.39
Cray: Chapel Language Specification 0.785. Cray Inc., http://chapel.cray.com/spec-0.785.pdf (2009)
Fensch, C., Cintra, M.: An OS-based alternative to full hardware coherence on tiled CMPs. In: Proceedings of the 14th International Symposium on High-Performance Computer Architecture, pp. 355–366 (2008). doi:10.1109/HPCA.2008.4658652
Fillo M., Keckler S.W., Dally W.J., Carter N.P., Chang A., Gurevich Y., Lee W.S.: The M-machine multicomputer. Int. J. Parallel Programm. 25(3), 183–212 (1997). doi:10.1007/BF02700035
Hagersten, E.: Personal Communication regarding the verification of the coherence protocol of Sun Microsystems’ Enterprise Servers E3000, E4000, E5000 and E6000 (2007)
Hill M.D.: Multiprocessors should support simple memory-consistency models. Computer 31(8), 28–34 (1998). doi:10.1109/2.707614
Iftode, L., Singh, J.P., Li, K.: Understanding applications performance on shared virtual memory systems. In: Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 122–133 (1996). doi:10.1145/232973.232987
Intel: Intel Core2 Extreme Processor X6800 and Intel Core2 Duo Desktop Processor E6000 and E4000 Sequence Specification Update. Intel, document No: 313279-016 (2007)
Kalla R., Sinharoy B., Tendler J.M.: IBM Power5 chip: a dual-core multithreaded processor. IEEE Micro 24(2), 40–47 (2004). doi:10.1109/MM.2004.1289290
Keleher, P., Cox, A.L., Dwarkadas, S., Zwaenepoel, W.: TreadMarks: distributed shared memory on standard workstations and operating systems. In: USENIX Winter 1994 Technical Conference Proceedings, pp. 115–131 (1994)
Kim, C., Burger, D., Keckler, S.W.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 211–222 (2002). doi:10.1145/605432.605420
Kongetira P., Aingaran K., Olukotun K.: Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005). doi:10.1109/MM.2005.35
Kontothanassis, L.I., Hunt, G., Stets, R., Hardavellas, N., Cierniak, M., Parthasarathy, S., Meira, W. Jr., Dwarkadas, S., Scott, M.L.: VM-based shared memory on low-latency, remote-memory-access networks. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 157–169 (1997). doi:10.1145/384286.264163
Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanović, K.: The vector-thread architecture. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 52–64 (2004). doi:10.1109/ISCA.2004.1310763
Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 408–419 (2005). doi:10.1109/ISCA.2005.34
Kuskin, J., Ofelt, D., Heinrich, M., Heinlein, J., Simoni, R., Gharachorloo, K., Chapin, J., Nakahira, D., Baxter, J., Horowitz, M., Gupta, A., Rosenblum, M., Hennessy, J.L.: The stanford FLASH multiprocessor. In: Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 325–337 (1994). doi:10.1109/ISCA.1994.288140
Laudon, J., Lenoski, D.: The SGI Origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241–251 (1997). doi:10.1145/384286.264206
Li, K.: IVY: a shared virtual memory system for parallel computing. In: Proceedings of the 1988 International Conference on Parallel Processing, vol. 2, pp. 94–101, Pennsylvania State University Press (1988)
Li, M., Sasanka, R., Adve, S.V., Chen, Y.K., Debes, E.: The ALPBench benchmark suite for complex multimedia applications. In: Proceedings of IEEE International Symposium on Workload Characterization, pp. 34–45 (2005). doi:10.1109/IISWC.2005.1525999
Martin, M.M.K., Hill, M.D., Wood, D.A.: Token coherence: decoupling performance and correctness. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 182–193 (2003). doi:10.1109/ISCA.2003.1206999
McNairy C., Bhatia R.: Montecito: a dual-core, dual-thread itanium processor. IEEE Micro 25(2), 10–20 (2005). doi:10.1109/MM.2005.35
Scott, S.L.: Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26–36 (1996). doi:10.1145/237090.237144
Swanson, S., Michelson, K., Schwerin, A., Oskin, M.: WaveScalar. In: Proceedings of the 36th Annual International Symposium on Microarchitecture, pp. 291–203 (2003). doi:10.1109/MICRO.2003.1253203
Taylor, M.B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., Hoffmann, H., Johnson, P., Kim, J., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Agarwal, A., Amarasinghe, S.: Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ILP and streams. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 2–13 (2004). doi:10.1109/ISCA.2004.1310759
Vachharajani, M., Vachharajani, N., August, D.I.: The liberty structural specification language: a high-level modeling language for component reuse. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 195–206 (2004). doi:10.1145/996893.996865
Verghese, B., Devine, S., Gupta, A., Rosenblum, M.: Operating system support for improving data locality on CC-NUMA compute servers. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279–289 (1996). doi:10.1145/237090.237205
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24–36 (1995). doi:10.1145/223982.223990
Zeffer, H., Hagersten, E.: A case for low-complexity MP architectures. In: Proceedings of the Conference on Supercomputing (2007). doi:10.1145/1362622.1362648
Zeffer, H., Radović, Z., Karlsson, M., Hagersten, E.: TMA: a trap-based memory architecture. In: Proceedings of the 20th Annual International Conference on Supercomputing, pp. 259–268 (2006). doi:10.1145/1183401.1183438
Zhang, M., Asanović, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 336–345 (2005). doi:10.1109/ISCA.2005.53
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fensch, C., Cintra, M. An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs. Int J Parallel Prog 39, 271–295 (2011). https://doi.org/10.1007/s10766-010-0162-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-010-0162-1