Abstract
Scalable shared memory multiprocessors are promising architectures to achieve teraflops computational power. As they contain a large number of processor and memory elements, such machines have a high probability of failure. In this paper, we investigate an approach based on backward error recovery to provide a highly available scalable shared memory architecture tolerating transient and permanent processor and memory failures.
Preview
Unable to display preview. Download preview PDF.
References
A. Agarwal, D. Chaiken, C. Fields, and K. Kurihara. Directory-based cache coherence in large-scale multiprocessors. IEEE Computer, 49–58, June 1990.
A. Agarwal, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowi cz, K. Kurihara, B. Lim, G. Ma, and D. Nussbaum. The MIT Alewife Machine: A Large-Scale Distributed Memory Multiprocess or. Research report MIT/LCS/TM-454, MIT Laboratory for Computer Science, June 1991.
R.E. Ahmed, R.C. Frazier, and P.N. Marinos. Cache-aided rollback error recovery (carer) algorithms for shared-memory multiprocessor systems. In Proc. of 20th International Symposium on Fault-Tolerant Computing Systems, pages 82–88, Newcastle, June 1990.
J. Archibald. The Cache Coherence Problem in Shared-Memory Multiprocessors. PhD thesis, University of Washington, December 1987.
M. BanĂ¢tre, A. Gefflaut, P. Joubert, P.A. Lee, and C. Morin. An Architecture For Tolerating Processor Failures In Shared-Memory Multiprocessors. Research report 1965, INRIA, March 1993.
J. Bartlett, J. Gray, and B. Horst. Fault tolerance in tandem computer systems. In A. Avizienis, H. Kopetz, and J.C. Laprie, editors, The Evolution of Fault-Tolerant Computing, pages 55–76, Springer Verlag, 1987.
Ph. A. Bernstein. Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing. IEEE Computer, 21(2):37–45, February 1988.
A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under unix. ACM Transactions on Computer Systems, 7(1):1–24, 1989.
N. S. Bowen and D. K. Pradhan. Processor-and memory-based checkpoint and rollback recovery. IEEE Computer, 22–31, February 1993.
S. Frank, H. Burkhardt III, and J. Rothnie. The ksr1: bridging the gap between shared memory and mpps. In COMPCON93, 38th IEEE Computer Society International Conference, pages 285–294, San Francisco, February 1993.
A. Gèfflaut and P. Joubert. SPAM: A Multiprocessor Execution Driven Simulation Kernel. Research report 1966, INRIA, March 1993.
E. Hagersten, A. Landin, and S. Haridi. Ddm — a cache-only memory architecture. IEEE Computer, 25(9):44–54, September 1992.
E. S. Harrison and E. Schmitt. The structure of system/88, a fault-tolerant computer. IBM Systems Journal, 26(3):293–318, 1987.
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690–691, September 1979.
P. Lee and T. Anderson. Fault Tolerance: Principles and Practice. Volume 3 of Dependable Computing and Fault-Tolerant Systems, Springer Verlag, second revised edition, 1990.
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford dash multiprocessor. IEEE Computer, 25(3):63–79, March 1992.
P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent numa and coma arch itectures. In Proc. of 19th Annual International Symposium on Computer Architecture, pages 80–91, May 1992.
M. Stumm, R. Unrau, and O. Krieger. Designing a scalable operating system for shared memory multiprocessors. In Usenix workshop, Micro-kernels and Other Kernel Architectures, pages 285–303, Seattle, Washington, April 1992.
K. L. Wu, W. K. Fuchs, and J. H. Patel. Cache-based error recovery for shared memory multiprocessor systems. In Proc. of 1989 International Conference on Parallel Processing, pages 159–166, University Park, Pennsylvania, 1989.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
BanĂ¢tre, M., Gefflaut, A., Morin, C. (1994). Scalable shared memory multiprocessors: Some ideas to make them reliable. In: BanĂ¢tre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020021
Download citation
DOI: https://doi.org/10.1007/BFb0020021
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive