Skip to main content
Log in

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abts, D., Scott, S., Lilja, D.J.: So many states, so little time: verifying memory coherence in the Cray X1. In: Proceedings of the International Parallel and Distributed Processing Symposium. (2003). doi:10.1109/IPDPS.2003.1213087

  2. Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. IEEE Comput. 29(12) (1996). doi:10.1109/2.546611

  3. Allen, E., Chase, D., Hallett, J., Luchangco, V., Maessen, J.W., Ryu, S., Steele, G.L. Jr., Tobin-Hochstadt, S.: The Fortress Language Specification Version 1.0 β. Sun Microsystems, Inc., http://research.sun.com/projects/plrg/Publications/fortress1.0beta.pdf (2007)

  4. Beckmann, B.M., Wood, D.A.: Managing wire delay in large chip-multiprocessor caches. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 319–330. (2004). doi:10.1109/MICRO.2004.21

  5. Burger, D., Austin, T.M., Bennett, S.: Evaluating future microprocessors: the SimpleScalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin-Madison (1996)

  6. Burger, D., Keckler, S.W., McKinley, K.S., Dahlin, M., John, L.K., Lin, C., Moore, C.R., Burrill, J., McDonald, R.G., Yoder, W., The TRIPS Team: Scaling to the end of silicon with EDGE architectures. IEEE Comput. 37(7), 44–55 (2004). doi:10.1109/MC.2004.65

    Google Scholar 

  7. Carter, J.B., Bennett, J.K., Zwaenepoel, W.: Implementation and performance of munin. In: Proceedings of the 13th Symposium on Operating Systems Principles, pp. 152–164 (1991). doi:10.1145/121133.121159

  8. Caşcaval, C., Castaños, J.G., Ceze, L., Denneau, M., Gupta, M., Lieber, D., Moreira, J.E., Strauss, K., Warren, H.S. Jr.: Evaluation of a multithreaded architecture for cellular computing. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pp. 311–322 (2002). doi:10.1109/HPCA.2002.995720

  9. Chang, J., Sohi, G.S.: Cooperative caching for chip multiprocessors. In: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pp. 264–276 (2006). doi:10.1109/ISCA.2006.17

  10. Chaudhuri, M., Heinrich, M.: SMTp: an architecture for next-generation scalable multi-threading. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 124–137 (2004). doi:10.1109/ISCA.2004.1310769

  11. Chishti, Z., Powell, M.D., Vijaykumar, T.N.: Optimizing replication, communication, and capacity allocation in CMPs. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 357–368 (2005). doi:10.1109/ISCA.2005.39

  12. Cray: Chapel Language Specification 0.785. Cray Inc., http://chapel.cray.com/spec-0.785.pdf (2009)

  13. Fensch, C., Cintra, M.: An OS-based alternative to full hardware coherence on tiled CMPs. In: Proceedings of the 14th International Symposium on High-Performance Computer Architecture, pp. 355–366 (2008). doi:10.1109/HPCA.2008.4658652

  14. Fillo M., Keckler S.W., Dally W.J., Carter N.P., Chang A., Gurevich Y., Lee W.S.: The M-machine multicomputer. Int. J. Parallel Programm. 25(3), 183–212 (1997). doi:10.1007/BF02700035

    Article  Google Scholar 

  15. Hagersten, E.: Personal Communication regarding the verification of the coherence protocol of Sun Microsystems’ Enterprise Servers E3000, E4000, E5000 and E6000 (2007)

  16. Hill M.D.: Multiprocessors should support simple memory-consistency models. Computer 31(8), 28–34 (1998). doi:10.1109/2.707614

    Article  Google Scholar 

  17. Iftode, L., Singh, J.P., Li, K.: Understanding applications performance on shared virtual memory systems. In: Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 122–133 (1996). doi:10.1145/232973.232987

  18. Intel: Intel Core2 Extreme Processor X6800 and Intel Core2 Duo Desktop Processor E6000 and E4000 Sequence Specification Update. Intel, document No: 313279-016 (2007)

  19. Kalla R., Sinharoy B., Tendler J.M.: IBM Power5 chip: a dual-core multithreaded processor. IEEE Micro 24(2), 40–47 (2004). doi:10.1109/MM.2004.1289290

    Article  Google Scholar 

  20. Keleher, P., Cox, A.L., Dwarkadas, S., Zwaenepoel, W.: TreadMarks: distributed shared memory on standard workstations and operating systems. In: USENIX Winter 1994 Technical Conference Proceedings, pp. 115–131 (1994)

  21. Kim, C., Burger, D., Keckler, S.W.: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 211–222 (2002). doi:10.1145/605432.605420

  22. Kongetira P., Aingaran K., Olukotun K.: Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005). doi:10.1109/MM.2005.35

    Article  Google Scholar 

  23. Kontothanassis, L.I., Hunt, G., Stets, R., Hardavellas, N., Cierniak, M., Parthasarathy, S., Meira, W. Jr., Dwarkadas, S., Scott, M.L.: VM-based shared memory on low-latency, remote-memory-access networks. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 157–169 (1997). doi:10.1145/384286.264163

  24. Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanović, K.: The vector-thread architecture. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 52–64 (2004). doi:10.1109/ISCA.2004.1310763

  25. Kumar, R., Zyuban, V., Tullsen, D.M.: Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 408–419 (2005). doi:10.1109/ISCA.2005.34

  26. Kuskin, J., Ofelt, D., Heinrich, M., Heinlein, J., Simoni, R., Gharachorloo, K., Chapin, J., Nakahira, D., Baxter, J., Horowitz, M., Gupta, A., Rosenblum, M., Hennessy, J.L.: The stanford FLASH multiprocessor. In: Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 325–337 (1994). doi:10.1109/ISCA.1994.288140

  27. Laudon, J., Lenoski, D.: The SGI Origin: a ccNUMA highly scalable server. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241–251 (1997). doi:10.1145/384286.264206

  28. Li, K.: IVY: a shared virtual memory system for parallel computing. In: Proceedings of the 1988 International Conference on Parallel Processing, vol. 2, pp. 94–101, Pennsylvania State University Press (1988)

  29. Li, M., Sasanka, R., Adve, S.V., Chen, Y.K., Debes, E.: The ALPBench benchmark suite for complex multimedia applications. In: Proceedings of IEEE International Symposium on Workload Characterization, pp. 34–45 (2005). doi:10.1109/IISWC.2005.1525999

  30. Martin, M.M.K., Hill, M.D., Wood, D.A.: Token coherence: decoupling performance and correctness. In: Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 182–193 (2003). doi:10.1109/ISCA.2003.1206999

  31. McNairy C., Bhatia R.: Montecito: a dual-core, dual-thread itanium processor. IEEE Micro 25(2), 10–20 (2005). doi:10.1109/MM.2005.35

    Article  Google Scholar 

  32. Scott, S.L.: Synchronization and communication in the T3E multiprocessor. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 26–36 (1996). doi:10.1145/237090.237144

  33. Swanson, S., Michelson, K., Schwerin, A., Oskin, M.: WaveScalar. In: Proceedings of the 36th Annual International Symposium on Microarchitecture, pp. 291–203 (2003). doi:10.1109/MICRO.2003.1253203

  34. Taylor, M.B., Lee, W., Miller, J., Wentzlaff, D., Bratt, I., Greenwald, B., Hoffmann, H., Johnson, P., Kim, J., Psota, J., Saraf, A., Shnidman, N., Strumpen, V., Frank, M., Agarwal, A., Amarasinghe, S.: Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ILP and streams. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 2–13 (2004). doi:10.1109/ISCA.2004.1310759

  35. Vachharajani, M., Vachharajani, N., August, D.I.: The liberty structural specification language: a high-level modeling language for component reuse. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 195–206 (2004). doi:10.1145/996893.996865

  36. Verghese, B., Devine, S., Gupta, A., Rosenblum, M.: Operating system support for improving data locality on CC-NUMA compute servers. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279–289 (1996). doi:10.1145/237090.237205

  37. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24–36 (1995). doi:10.1145/223982.223990

  38. Zeffer, H., Hagersten, E.: A case for low-complexity MP architectures. In: Proceedings of the Conference on Supercomputing (2007). doi:10.1145/1362622.1362648

  39. Zeffer, H., Radović, Z., Karlsson, M., Hagersten, E.: TMA: a trap-based memory architecture. In: Proceedings of the 20th Annual International Conference on Supercomputing, pp. 259–268 (2006). doi:10.1145/1183401.1183438

  40. Zhang, M., Asanović, K.: Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pp. 336–345 (2005). doi:10.1109/ISCA.2005.53

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Fensch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fensch, C., Cintra, M. An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs. Int J Parallel Prog 39, 271–295 (2011). https://doi.org/10.1007/s10766-010-0162-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-010-0162-1

Keywords

Navigation