SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Authors:
Andrew Erlichson

Computer Systems Lab, Stanford University, Stanford, CA

Computer Systems Lab, Stanford University, Stanford, CA
View Profile

,
Neal Nuckolls

Silicon Graphics Inc., 2011 North Shoreline Blvd., Mountain View, CA

Silicon Graphics Inc., 2011 North Shoreline Blvd., Mountain View, CA
View Profile

,
Greg Chesson

Silicon Graphics Inc., 2011 North Shoreline Blvd., Mountain View, CA

Silicon Graphics Inc., 2011 North Shoreline Blvd., Mountain View, CA
View Profile

,
John Hennessy

Computer Systems Lab, Stanford University, Stanford, CA

Computer Systems Lab, Stanford University, Stanford, CA
View Profile

ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systemsOctober 1996Pages 210–220https://doi.org/10.1145/237090.237187

Published:01 September 1996Publication History

ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems

Pages 210–220

ABSTRACT

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters.To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight processors per node plus additional reserved protocol processors) that range from 6.9 on the communication-intensive FFT program to 21.6 on Ocean (both from the SPLASH 2 suite). In general, clustering is effective in reducing internode miss rates, but as the cluster size increases, increases in the remote latency, mostly due to increased TLB synchronization cost, offset the advantages. For communication-intensive applications, such as FFT, the overhead of sending out network requests, the limited network bandwidth, and the long network latency prevent the achievement of good performance. Overall, this approach still appears promising, but our results indicate that large low latency networks may be needed to make cluster-based virtual shared-memory machines broadly useful as large-scale shared-memory multiprocessors.

References

1.Anant Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D Kranz, J. Kubiatowicz, Beng-Hong Lira, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2-13, June 1995.]] Google ScholarDigital Library
2.Brian Bershad and Matthew J. Zekauskas. Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, Carnegie Mellon University Technical Report No. CMU-CS 91-170, September 1991.]]Google Scholar
3.J.B. Carter. Design of the Munin Distributed Shared Memory System, Journal of Parallel and Distributed Computing, 29(2):219-27, September 1995.]] Google ScholarDigital Library
4.A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software versus Hardware Shared-memory Implementation: a Case Study, In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 106-17, April 1994.]] Google ScholarDigital Library
5.Rohit Chandra, K. Gharachorloo, V. Soundararajan, and A. Gupta. Performance Evaluation of Hybrid Hardware and Software Distributed Shared Memory Protocols, In Proceedings of International Conference on Supercomputing '94, pp. 274-288. July 1994.]] Google ScholarDigital Library
6.Jeffery Chase, F. Amador, E. Lazowska, H. Levy, and R. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors, in Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]] Google ScholarDigital Library
7.D.R. Cheriton, H. Goosen and P. Boyle. Multi-level Shared Caching Techniques for Scalability in VMP-MC, In Proceedings of the 16th International Symposium on Computer Architecture, pp. 16-24, May 1989.]] Google ScholarDigital Library
8.M. Dubois, J. C. Wang, L. A. Barroso, K. L. Lee, and Y. Chen. Delayed Consistency and its Effect on the Miss Rate of Parallel Programs, Proceedings of SuperComputing '95, pp. 197-206, November 1991.]] Google ScholarDigital Library
9.Andrew Erlichson, Basem Nayfeh, Jaswinder P. Singh and Kunle Olukotun. The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications Driven Investigation, Proceedings of SuperComputing '95, Dec. I995.]] Google ScholarDigital Library
10.Ewing Lusk. Portable Programs for Parallel Processors, Holt, Rinehart, and Winston, New York, 1987]] Google ScholarDigital Library
11.K. Gharachofioo, Dan Lenoski, James Laudon, P. Gibbons, Anoop Gupta, and John Hennessy. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, In Proceedings of the 17th International Symposium on Computer Architecture, pp. 15-26, May 1990.]] Google ScholarDigital Library
12.Chris Holt and Jaswinder Pal Singh. Hierarchical N-Body Methods on Shared Address Space Multiprocessors, In Proceedings of the Seventh SIAM International Conference on Parallel Processing for Scientific Computing, pp. 313-18, February 1995.]]Google Scholar
13.Kirk Johnson, M. F. Kaashoek and D. Wallach. CRL: Highperformance All-software Distributed Shared Memory, In Fifteenth A C Symposium on Operating Systems Principles, pp. 213-28, December 1995.]] Google ScholarDigital Library
14.Magnus Karlsson and Per Stenstrom. Performance Evaluation of a Cluster-Based Muluprocessor Built from ATM Switches and Bus- Based Multiprocessor Servers, In Proceedings of the Second International Symposium on High-Performance Computer Architecture, pp. 4-13, February 1996.]] Google ScholarDigital Library
15.Peter Keleher. Lazy Release Consistency for Distributed Shared Memory, PhD Thesis, Rice University, Houston, January 1995.]] Google ScholarDigital Library
16.Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory, In Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 13-21, May 1992.]] Google ScholarDigital Library
17.P. Keleher, Alan Cox, S. Dwarkadas and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems, In Proceedings of USENIX Winter 1994 Conference, pp. 115-32, January 1994.]] Google ScholarDigital Library
18.Jeff Kuskin, David Ofelt, Mark Heinnch, John Heinlein, Richard Simoni, K, Gharachofioo, J. Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum and John Hennessy, The Stanford FLASH Multiprocessor. in Proceedings of the 21st international Symposium on Computer Architecture, pp. 18-21, April 1994.]] Google ScholarDigital Library
19.W. Leler. System-level Parallel Programming Based on Linda, In Proceedings of the Third North American Transputer Users Group, pp. 175-9, April 1990.]]Google Scholar
20.Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989.]] Google ScholarDigital Library
21.Ron Minnich. Mether-NFS: A Modified NFS which supports Virtual Shared Memory, In Proceedings of Symposium on Experiences with Distributed and Multiprocessor Systems IV, pp. 89-107, September 1993.]]Google Scholar
22.Bryan S. Rosenburg. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared- Memory Multiprocessors, In Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]] Google ScholarDigital Library
23.Dan Scales and Monica Lam. The Design and Evaluation of a Shared Object System for Distributed Memory Machines, In Proceedings of I st Symposium on Operation Systems Design and Implementation, pp. 101~ 14, November 1994.]] Google ScholarDigital Library
24.Michael Y. Thompson, J. M. Barton, T. Jermoluk, and J. Wagner. Translation Lookaside Buffer Synchronization in a Multiprocssor System, In Proceeding of USENlX Association Winter Conference, pp. 297-302, February 1988.]]Google Scholar
25.Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy~ The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pp. 219-229, October 1994.]] Google ScholarDigital Library
26.Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, Stanford University Technical Report No. CSL-TR-93-593, December 1993.]] Google ScholarDigital Library
27.Steven Cameron Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24-36, june 1995.]] Google ScholarDigital Library
28.Donald Yeung, John Kubiatowicz, and Anant Agarwal. MGS: A Multi-Grain Shared Memory System, in Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 44-55, April 1996.]] Google ScholarDigital Library

Index Terms

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Recommendations

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
Read More
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
Read More
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
October 1996
290 pages
ISBN:0897917677
DOI:10.1145/237090
Chairmen:
Bill Dally
Massachusetts Institute of Technology
,
Susan Eggets
Univ. of Washington, Seattle
ACM SIGPLAN Notices Volume 31, Issue 9
Sept. 1996
273 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/248209
Chairmen:
Bill Dally
Massachusetts Institute of Technology
,
Susan Eggers
Univ. of Washington, Seattle
Issue’s Table of Contents
ACM SIGOPS Operating Systems Review Volume 30, Issue 5
Dec. 1996
273 pages
ISSN:0163-5980
DOI:10.1145/248208
Chairmen:
Bill Dally
Massachusetts Institute of Technology
,
Susan Eggers
Univ. of Washington, Seattle
Issue’s Table of Contents
Copyright © 1996 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 1996
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
ASPLOS VII Paper Acceptance Rate25of109submissions,23%Overall Acceptance Rate535of2,713submissions,20%
More
Upcoming Conference
ASPLOS '24

Sponsor:

sigarch

sigarch

sigarch

29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 27 - May 1, 2024

La Jolla , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 644
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs