ABSTRACT
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters.To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight processors per node plus additional reserved protocol processors) that range from 6.9 on the communication-intensive FFT program to 21.6 on Ocean (both from the SPLASH 2 suite). In general, clustering is effective in reducing internode miss rates, but as the cluster size increases, increases in the remote latency, mostly due to increased TLB synchronization cost, offset the advantages. For communication-intensive applications, such as FFT, the overhead of sending out network requests, the limited network bandwidth, and the long network latency prevent the achievement of good performance. Overall, this approach still appears promising, but our results indicate that large low latency networks may be needed to make cluster-based virtual shared-memory machines broadly useful as large-scale shared-memory multiprocessors.
- 1.Anant Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D Kranz, J. Kubiatowicz, Beng-Hong Lira, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2-13, June 1995.]] Google ScholarDigital Library
- 2.Brian Bershad and Matthew J. Zekauskas. Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, Carnegie Mellon University Technical Report No. CMU-CS 91-170, September 1991.]]Google Scholar
- 3.J.B. Carter. Design of the Munin Distributed Shared Memory System, Journal of Parallel and Distributed Computing, 29(2):219-27, September 1995.]] Google ScholarDigital Library
- 4.A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software versus Hardware Shared-memory Implementation: a Case Study, In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 106-17, April 1994.]] Google ScholarDigital Library
- 5.Rohit Chandra, K. Gharachorloo, V. Soundararajan, and A. Gupta. Performance Evaluation of Hybrid Hardware and Software Distributed Shared Memory Protocols, In Proceedings of International Conference on Supercomputing '94, pp. 274-288. July 1994.]] Google ScholarDigital Library
- 6.Jeffery Chase, F. Amador, E. Lazowska, H. Levy, and R. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors, in Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]] Google ScholarDigital Library
- 7.D.R. Cheriton, H. Goosen and P. Boyle. Multi-level Shared Caching Techniques for Scalability in VMP-MC, In Proceedings of the 16th International Symposium on Computer Architecture, pp. 16-24, May 1989.]] Google ScholarDigital Library
- 8.M. Dubois, J. C. Wang, L. A. Barroso, K. L. Lee, and Y. Chen. Delayed Consistency and its Effect on the Miss Rate of Parallel Programs, Proceedings of SuperComputing '95, pp. 197-206, November 1991.]] Google ScholarDigital Library
- 9.Andrew Erlichson, Basem Nayfeh, Jaswinder P. Singh and Kunle Olukotun. The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications Driven Investigation, Proceedings of SuperComputing '95, Dec. I995.]] Google ScholarDigital Library
- 10.Ewing Lusk. Portable Programs for Parallel Processors, Holt, Rinehart, and Winston, New York, 1987]] Google ScholarDigital Library
- 11.K. Gharachofioo, Dan Lenoski, James Laudon, P. Gibbons, Anoop Gupta, and John Hennessy. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, In Proceedings of the 17th International Symposium on Computer Architecture, pp. 15-26, May 1990.]] Google ScholarDigital Library
- 12.Chris Holt and Jaswinder Pal Singh. Hierarchical N-Body Methods on Shared Address Space Multiprocessors, In Proceedings of the Seventh SIAM International Conference on Parallel Processing for Scientific Computing, pp. 313-18, February 1995.]]Google Scholar
- 13.Kirk Johnson, M. F. Kaashoek and D. Wallach. CRL: Highperformance All-software Distributed Shared Memory, In Fifteenth A C Symposium on Operating Systems Principles, pp. 213-28, December 1995.]] Google ScholarDigital Library
- 14.Magnus Karlsson and Per Stenstrom. Performance Evaluation of a Cluster-Based Muluprocessor Built from ATM Switches and Bus- Based Multiprocessor Servers, In Proceedings of the Second International Symposium on High-Performance Computer Architecture, pp. 4-13, February 1996.]] Google ScholarDigital Library
- 15.Peter Keleher. Lazy Release Consistency for Distributed Shared Memory, PhD Thesis, Rice University, Houston, January 1995.]] Google ScholarDigital Library
- 16.Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory, In Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 13-21, May 1992.]] Google ScholarDigital Library
- 17.P. Keleher, Alan Cox, S. Dwarkadas and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems, In Proceedings of USENIX Winter 1994 Conference, pp. 115-32, January 1994.]] Google ScholarDigital Library
- 18.Jeff Kuskin, David Ofelt, Mark Heinnch, John Heinlein, Richard Simoni, K, Gharachofioo, J. Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum and John Hennessy, The Stanford FLASH Multiprocessor. in Proceedings of the 21st international Symposium on Computer Architecture, pp. 18-21, April 1994.]] Google ScholarDigital Library
- 19.W. Leler. System-level Parallel Programming Based on Linda, In Proceedings of the Third North American Transputer Users Group, pp. 175-9, April 1990.]]Google Scholar
- 20.Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989.]] Google ScholarDigital Library
- 21.Ron Minnich. Mether-NFS: A Modified NFS which supports Virtual Shared Memory, In Proceedings of Symposium on Experiences with Distributed and Multiprocessor Systems IV, pp. 89-107, September 1993.]]Google Scholar
- 22.Bryan S. Rosenburg. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared- Memory Multiprocessors, In Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]] Google ScholarDigital Library
- 23.Dan Scales and Monica Lam. The Design and Evaluation of a Shared Object System for Distributed Memory Machines, In Proceedings of I st Symposium on Operation Systems Design and Implementation, pp. 101~ 14, November 1994.]] Google ScholarDigital Library
- 24.Michael Y. Thompson, J. M. Barton, T. Jermoluk, and J. Wagner. Translation Lookaside Buffer Synchronization in a Multiprocssor System, In Proceeding of USENlX Association Winter Conference, pp. 297-302, February 1988.]]Google Scholar
- 25.Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy~ The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pp. 219-229, October 1994.]] Google ScholarDigital Library
- 26.Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, Stanford University Technical Report No. CSL-TR-93-593, December 1993.]] Google ScholarDigital Library
- 27.Steven Cameron Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24-36, june 1995.]] Google ScholarDigital Library
- 28.Donald Yeung, John Kubiatowicz, and Anant Agarwal. MGS: A Multi-Grain Shared Memory System, in Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 44-55, April 1996.]] Google ScholarDigital Library
Index Terms
- SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
Recommendations
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Comments