skip to main content
article

Shared memory computing on clusters with symmetric multiprocessors and system area networks

Published: 01 August 2005 Publication History

Abstract

Cashmere is a software distributed shared memory (S-DSM) system designed for clusters of server-class machines. It is distinguished from most other S-DSM projects by (1) the effective use of fast user-level messaging, as provided by modern system-area networks, and (2) a “two-level” protocol structure that exploits hardware coherence within multiprocessor nodes. Fast user-level messages change the tradeoffs in coherence protocol design; they allow Cashmere to employ a relatively simple directory-based coherence protocol. Exploiting hardware coherence within SMP nodes improves overall performance when care is taken to avoid interference with inter-node software coherence.We have implemented Cashmere on a Compaq AlphaServer/Memory Channel cluster, an architecture that provides fast user-level messages. Experiments indicate that a one-level, version of the Cashmere protocol provides performance comparable to, or slightly better than, that of TreadMarks' lazy release consistency. Comparisons to Compaq's Shasta protocol also suggest that while fast user-level messages make finer-grain software DSMs competitive, VM-based systems continue to outperform software-based access control for applications without extensive fine-grain sharing.Within the family of Cashmere protocols, we find that leveraging intranode hardware coherence provides a 37% performance advantage over a more straightforward one-level implementation. Moreover, contrary to our original expectations, noncoherent hardware support for remote memory writes, total message ordering, and broadcast, provide comparatively little in the way of additional benefits over just fast messaging for our application suite.

References

[1]
Adve, S. V. and Hill, M. D. 1993. A unified formulation of four shared-memory models. IEEE Trans. Para. Distrib. Syst. 4, 6 (June), 613--624.]]
[2]
Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K., Kranz, D., Kubiatowicz, J., Lim, B.-H., Mackenzie, K., and Yeung, D. 1995. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22nd International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June.]]
[3]
American National Standards Institute. 1996. Information Systems---High-Performance Parallel Interface---Mechanical, Electrical, and Signalling Protocol Specification (HIPPI-PH). ANSI X3.183-1991 (R1996), New York, NY.]]
[4]
Amza, C., Cox, A. L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., and Zwaenepoel, W. 1996. TreadMarks: Shared memory computing on networks of Workstations. Computer 29, 2 (Feb.), 18--28.]]
[5]
Amza, C., Cox, A., Dwarkadas, S., and Zwaenepoel, W. 1997. Software DSM protocols that adapt between single writer and multiple writer. In Proceedings of the 3rd International Symposium on High Performance Computer Architecture, San Antonio, TX, Feb.]]
[6]
Bennett, J. K., Carter, J. B., and Zwaenepoel, W. 1990. Adaptive software cache management for distributed shared memory architectures. In Proceedings of the 17th International Symposium on Computer Architecture, Seattle, WA, May.]]
[7]
Bilas, A., Iftode, L., Martin, D., and Singh, J. P. 1996. Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance. Tech. Rep. TR-517-96, Dept. of Computer Science, Princeton Univ., Oct.]]
[8]
Bilas, A., Liao, C., and Singh, J. P. 1999. Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems. In Proceedings of the 26th International Symposium on Computer Architecture, Atlanta, GA, May.]]
[9]
Blumrich, M., Li, K., Alpert, R., Dubnicki, C., Felten, E., and Sandberg, J. 1994. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proceedings of the 21st International Symposium on Computer Architecture, Chicago, IL, Apr.]]
[10]
Bolosky, W. J., Fitzgerald, R. P., and Scott, M. L. 1989. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, Dec.]]
[11]
Bolosky, W. J., Scott, M. L., Fitzgerald, R. P., Fowler, R. J., and Cox, A. L. Numa 1991. Policies and their relation to memory architecture. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, Apr.]]
[12]
Bolosky, W. J. and Scott, M. L. 1992. Evaluation of multiprocessor memory systems using off-line optimal behavior. J. Para. Distrib. Comput. 15, 4 (Aug.), 382--398.]]
[13]
Buzzard, G., Jacobson, D., Mackey, M., Marovich, S., and Wilkes, J. 1996. An implementation of the Hamlyn sender-managed interface architecture. In Proceedings of the 2nd Symposium on Operating Systems Design and Implementation, Seattle, WA, Oct.]]
[14]
Carter, J. B., Bennett, J. K., and Zwaenepoel, W. 1991. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, Pacific Grove, CA, Oct.]]
[15]
Chase, J. S., Amador, F. G., Lazowska, E. D., Levy, H. M., and Littlefield, R. J. 1989. The amber system: Parallel programming on a network of multiprocessors. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, Dec.]]
[16]
Compaq, Intel, and Microsoft. 1997. Virtual Interface Architecture Specification. Draft Revision 1.0, Dec. Available at ftp://download.intel.com/design/servers/vi/san_10.pdf.]]
[17]
Cox, A. L. and Fowler, R. J. 1989. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, Dec.]]
[18]
Cox, A., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., and Zwaenepoel, W. 1994. Software versus hardware shared-memory implementation: A Case Study. In Proceedings of the 21st International Symposium on Computer Architecture, Chicago, IL, Apr.]]
[19]
Culler, D., Dusseau, A., Goldstein, S., Krishnamurthy, A., Lumetta, S., von Eicken, T., and Yelick, K. 1993. Parallel programming in split-C. In Proceedings of Supercomputing '93, Portland, OR, Nov.]]
[20]
Dunning, D., Regnier, G., McAlpine, G., Cameron, D., Shubert, B., Berry, F., Merritt, A. M., Gronke, E., and Dodd, C. 1998. The virtual interface architecture. IEEE Micro. 18, 2 (Mar.), 66--76.]]
[21]
Dwarkadas, S., Schäffer, A. A., Cottingham Jr., R. W., Cox, A. L., Keleher, P., and Zwaenepoel, W. 1994. Parallelization of General Linkage Analysis Problems. Human Heredity 44, 127--141.]]
[22]
Dwarkadas, S., Cox, A. L., and Zwaenepoel, W. 1996. An integrated compile-time/Run-time Software Distributed Shared Memory System. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, Oct.]]
[23]
Dwarkadas, S., Hardavellas, N., Kontothanassis, L. I., Nikhil, R., and Stets, R. 1999a. Cashmere-VLM: Remote memory paging for software distributed shared memory. In Proceedings of the 13th International Parallel Processing Symposium, San Juan, Puerto Rico, Apr.]]
[24]
Dwarkadas, S., Gharachorloo, K., Kontothanassis, L. I., Scales, D. J., Scott, M. L., and Stets, R. 1999b. Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, Orlando, FL, Jan.]]
[25]
Erlichson, A., Nuckolls, N., Chesson, G., and Hennessy, J. 1996. SoftFLASH: Analyzing the performance of clustered distributed virtual shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, Oct.]]
[26]
Feeley, M. J., Chase, J. S., Narasayya, V. R., and Levy, H. M. 1994. Integrating coherency and recovery in distributed systems. In Proceedings of the 1st Symposium on Operating Systems Design and Implementation, Monterey, CA, Nov.]]
[27]
Fillo, M. and Gillett, R. B. 1997. Architecture and implementation of memory channel 2. Digital Technical Journal 9, 1, 27--41.]]
[28]
Gillett, R. 1996. Memory channel: An optimized cluster interconnect. IEEE Micro 16, 2 (Feb.), 12--18.]]
[29]
Goodman, J. R. 1987. Coherency for multiprocessor virtual address caches. In Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA, Oct.]]
[30]
Hill, M. D., Larus, J. R., Reinhardt, S. K., and Wood, D. A. 1993. Cooperative shared memory: Software and hardware for scalable multiprocessors. ACM Trans. Comput. Syst. 11, 4, 300--318, Nov.]]
[31]
Iftode, L., Dubnicki, C., Felten, E. W., and Li, K. 1996. Improving release-consistent shared virtual memory using automatic update. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, San Jose, CA, Feb.]]
[32]
InfiniBand Trade Association. 2002. InfiniBand Architecture Specification. Release 1.1, Nov. Available at www.infinibandta.org/specs.]]
[33]
Johnson, K. L., Kaashoek, M. F., and Wallach, D. A. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain, CO, Dec.]]
[34]
Karlsson, M. and Stenstrom, M. P. 1996. Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, San Jose, CA, Feb.]]
[35]
Keleher, P., Cox, A. L., and Zwaenepoel, W. 1992. Lazy release consistency for software distributed shared memory. In Proceedings of the 19th International Symposium on Computer Architecture, Gold Coast, Australia, May.]]
[36]
Kontothanassis, L. I. and Scott, M. L. 1995a. High performance software coherence for current and future architectures. J. Para. Distrib. Comput. 29, 2 (Nov.), 179--195.]]
[37]
Kontothanassis, L. I. and Scott, M. L. 1995b. Software cache coherence for large scale multiprocessors. In Proceedings of the 1st International Symposium on High Performance Computer Architecture, Raleigh, NC, Jan.]]
[38]
Kontothanassis, L. I. and Scott, M. L. 1996. Using memory-mapped network interfaces to improve the performance of distributed shared memory. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, San Jose, CA, Feb.]]
[39]
Kontothanassis, L. I., Hunt, G. C., Stets, R., Hardavellas, N., Cierniak, M., Parthasarathy, S., Meira, W., Dwarkadas, S., and Scott, M. L. 1997. VM-based shared memory on low-latency, remote-memory-access networks. In Proceedings of the 24th International Symposium on Computer Architecture, Denver, CO, June.]]
[40]
LaRowe Jr., R. P. and Ellis, C. S. 1991. Experimental comparison of memory management policies for NUMA multiprocessors. ACM Trans. Comput. Syst. 9, 4 (Nov.), 319--363.]]
[41]
Laudon, J. and Lenoski, D. 1997. The SGI origin: A ccNUMA highly scalable server. In Proceedings of the 24th International Symposium on Computer Architecture, Denver, CO, June.]]
[42]
Li, K. and Schaefer, R. 1989. A hypercube shared virtual memory system. In Proceedings of the 1989 International Conference on Parallel Processing, St. Charles, IL, Aug. Penn. State Univ. Press.]]
[43]
Li, K. and Hudak, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov.), 321--359.]]
[44]
Marchetti, M., Kontothanassis, L. I., Bianchini, R., and Scott, M. L. 1995. Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems. In Proceedings of the 9th International Parallel Processing Symposium, Santa Barbara, CA, Apr.]]
[45]
Monnerat, L. R. and Bianchini, R. 1998. Efficiently adapting to sharing patterns in software DSMs. In Proceedings of the 4th International Symposium on High Performance Computer Architecture, Las Vegas, NV, Feb.]]
[46]
Nikhil, R. S. 1994. Cid: A parallel, “Shared-memory” C for Distributed-Memory Machines. In Proceedings of the 7th Annual Workshop on Languages and Compilers for Parallel Computing, Aug.]]
[47]
Nitzberg, B. and Lo, V. 1991. Distributed shared memory: A Survey of issues and algorithms. Comput. 24, 8 (Aug.), 52--60.]]
[48]
Petersen, K. and Li, K. 1993. Cache coherence for shared memory multiprocessors based on virtual memory support. In Proceedings of the 7th International Parallel Processing Symposium, Newport Beach, CA, Apr.]]
[49]
Philbin, J. F., Dubnicki, C., Bilas, A., and Li, K. 1997. Design and implementation of virtual memory-mapped communication on myrinet. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, Apr.]]
[50]
Reinhardt, S. K., Larus, J. R., and Wood, D. A. 1994. Tempest and Typhoon: User-level shared-memory. In Proceedings of the 21st International Symposium on Computer Architecture, Chicago, IL, Apr.]]
[51]
Samanta, R., Bilas, A., Iftode, L., and Singh, J. P. 1998. Home-based SVM protocols for SMP clusters: Design and performance. In Proceedings of the 4th International Symposium on High Performance Computer Architecture, Las Vegas, NV, Feb.]]
[52]
Sandhu, H. S., Gamsa, B., and Zhou, S. 1993. The shared regions approach to software cache coherence on multiprocessors. In Proceedings of the 4th ACM Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May.]]
[53]
Scales, D. J., Gharachorloo, K., and Thekkath, C. A. 1996. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, Oct.]]
[54]
Scales, D. J. and Gharachorloo, K. 1997. Towards transparent and efficient software distributed shared memory. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, St. Malo, France, Oct.]]
[55]
Scales, D. J., Gharachorloo, K., and Aggarwal, A. 1998. Fine-grain software distributed shared memory on SMP clusters. In Proceedings of the 4th International Symposium on High Performance Computer Architecture, Las Vegas, NV, Feb.]]
[56]
Schoinas, I., Falsafi, B., Hill, M. D., Larus, J. R. and Wood, D. A. 1998. Sirocco: Cost-effective fine-grain distributed shared memory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Paris, France, Oct.]]
[57]
Schoinas, I., Falsafi, B., Lebeck, A. R., Reinhardt, S. K., Larus, J. R., and Wood, D. A. 1994. Fine-grain access control for distributed shared memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, Oct.]]
[58]
Singh, J. P., Weber, W.-D., and Gupta, A. 1992. SPLASH: Stanford parallel applications for shared-memory. ACM SIGARCH Computer Architecture News 20, 1 (Mar.), 5--44.]]
[59]
Speight, E. and Bennett, J. K. 1998. Using multicast and multithreading to reduce communication in software DSM systems. In Proceedings of the 4th International Symposium on High Performance Computer Architecture, Las Vegas, NV, Feb.]]
[60]
Stets, R., Dwarkadas, S., Kontothanassis, L. I., Rencuzogullari, U., and Scott, M. L. 2000. The Effect of Network Total Order, Broadcast, and Remote-Write Capability on Network-Based Shared Memory Computing. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, Toulouse, France, Jan.]]
[61]
Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G. C., Kontothanassis, L. I., Parthasarathy, S., and Scott, M. L. 1997. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, St. Malo, France, Oct.]]
[62]
Tanenbaum, A. S., Kaashoek, M. F., and Bal, H. E. 1992. Parallel programming using shared objects and broadcasting. Comput. 25, 8 (Aug.), 10--19.]]
[63]
Top 500 Supercomputer Sites. 2003. Univ. of Manheim, Univ. of Tennessee, and NERSC/LBNL, June. http://www.top500.org/lists/2003/06/.]]
[64]
Verghese, B., Devine, S., Gupta, A., and Rosenblum, M. 1996. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, Oct.]]
[65]
von Eicken, T., Basu, A., Buch, V., and Vogels, W. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain, CO, Dec.]]
[66]
Welsh, M., Basu, A., and von Eicken, T. 1997. Incorporating memory management into user-level network interfaces. Tech. Rep. TR97-1620, Cornell Univ., Aug.]]
[67]
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite. In Proceedings of the 22nd International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June.]]
[68]
Yeung, D., Kubiatowitcz, J., and Agarwal, A. 1996. MGS: A multigrain shared memory system. In Proceedings of the 23rd International Symposium on Computer Architecture, Philadelphia, PA, May.]]
[69]
Zekauskas, M. J., Sawdon, W. A., and Bershad, B. N. 1994. Software write detection for distributed shared memory. In Proceedings of the 1st Symposium on Operating Systems Design and Implementation, Monterey, CA, Nov.]]
[70]
Zhou, Y., Iftode, L., Singh, J. P., Li, K., Toonen, B. R., Schoinas, I., Hill, M. D., and Wood, D. A.1997. Relaxed consistency and coherence granularity in DSM systems: A performance evaluation. In Proceedings of the 6th ACM Symposium on Principles and Practice of Parallel Programming, Las Vegas, NV, June.]]

Cited By

View all
  • (2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
  • (2020)Disaggregation and the applicationProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485864(15-15)Online publication date: 13-Jul-2020
  • (2017)CGUW: A system software for heterogeneous IPC mechanism in grid computing environments2017 International Conference on Engineering, Technology and Innovation (ICE/ITMC)10.1109/ICE.2017.8279869(58-62)Online publication date: Jun-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 23, Issue 3
August 2005
117 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1082469
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2005
Published in TOCS Volume 23, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed shared memory
  2. relaxed consistency
  3. software coherence

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
  • (2020)Disaggregation and the applicationProceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing10.5555/3485849.3485864(15-15)Online publication date: 13-Jul-2020
  • (2017)CGUW: A system software for heterogeneous IPC mechanism in grid computing environments2017 International Conference on Engineering, Technology and Innovation (ICE/ITMC)10.1109/ICE.2017.8279869(58-62)Online publication date: Jun-2017
  • (2017)Space-based parallel program design process with high-level communication channelsJournal of the Chinese Institute of Engineers10.1080/02533839.2017.130827540:4(347-354)Online publication date: 17-Apr-2017
  • (2014)ASPIREACM SIGPLAN Notices10.1145/2714064.266022749:10(861-878)Online publication date: 15-Oct-2014
  • (2014)ASPIREProceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications10.1145/2660193.2660227(861-878)Online publication date: 15-Oct-2014
  • (2010)Formulating the real cost of DSM-inherent dependent parameters in HPC clusters2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)10.1109/IPDPSW.2010.5470718(1-6)Online publication date: Apr-2010
  • (2010)Performance evaluation of directory protocols on an optical broadcast-based distributed shared memory multiprocessorComputers and Electrical Engineering10.1016/j.compeleceng.2009.06.00336:1(114-131)Online publication date: 1-Jan-2010
  • (2009)Programming model for a heterogeneous x86 platformACM SIGPLAN Notices10.1145/1543135.154252544:6(431-440)Online publication date: 15-Jun-2009
  • (2009)Programming model for a heterogeneous x86 platformProceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/1542476.1542525(431-440)Online publication date: 15-Jun-2009
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media