Skip to main content

Scalable shared memory multiprocessors: Some ideas to make them reliable

  • Hardware Architectures for Fault Tolerance
  • Conference paper
  • First Online:
Hardware and Software Architectures for Fault Tolerance (Fault Tolerance 1993)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

  • 163 Accesses

Abstract

Scalable shared memory multiprocessors are promising architectures to achieve teraflops computational power. As they contain a large number of processor and memory elements, such machines have a high probability of failure. In this paper, we investigate an approach based on backward error recovery to provide a highly available scalable shared memory architecture tolerating transient and permanent processor and memory failures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Agarwal, D. Chaiken, C. Fields, and K. Kurihara. Directory-based cache coherence in large-scale multiprocessors. IEEE Computer, 49–58, June 1990.

    Google Scholar 

  2. A. Agarwal, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowi cz, K. Kurihara, B. Lim, G. Ma, and D. Nussbaum. The MIT Alewife Machine: A Large-Scale Distributed Memory Multiprocess or. Research report MIT/LCS/TM-454, MIT Laboratory for Computer Science, June 1991.

    Google Scholar 

  3. R.E. Ahmed, R.C. Frazier, and P.N. Marinos. Cache-aided rollback error recovery (carer) algorithms for shared-memory multiprocessor systems. In Proc. of 20th International Symposium on Fault-Tolerant Computing Systems, pages 82–88, Newcastle, June 1990.

    Google Scholar 

  4. J. Archibald. The Cache Coherence Problem in Shared-Memory Multiprocessors. PhD thesis, University of Washington, December 1987.

    Google Scholar 

  5. M. BanĂ¢tre, A. Gefflaut, P. Joubert, P.A. Lee, and C. Morin. An Architecture For Tolerating Processor Failures In Shared-Memory Multiprocessors. Research report 1965, INRIA, March 1993.

    Google Scholar 

  6. J. Bartlett, J. Gray, and B. Horst. Fault tolerance in tandem computer systems. In A. Avizienis, H. Kopetz, and J.C. Laprie, editors, The Evolution of Fault-Tolerant Computing, pages 55–76, Springer Verlag, 1987.

    Google Scholar 

  7. Ph. A. Bernstein. Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing. IEEE Computer, 21(2):37–45, February 1988.

    Google Scholar 

  8. A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under unix. ACM Transactions on Computer Systems, 7(1):1–24, 1989.

    Article  Google Scholar 

  9. N. S. Bowen and D. K. Pradhan. Processor-and memory-based checkpoint and rollback recovery. IEEE Computer, 22–31, February 1993.

    Google Scholar 

  10. S. Frank, H. Burkhardt III, and J. Rothnie. The ksr1: bridging the gap between shared memory and mpps. In COMPCON93, 38th IEEE Computer Society International Conference, pages 285–294, San Francisco, February 1993.

    Google Scholar 

  11. A. Gèfflaut and P. Joubert. SPAM: A Multiprocessor Execution Driven Simulation Kernel. Research report 1966, INRIA, March 1993.

    Google Scholar 

  12. E. Hagersten, A. Landin, and S. Haridi. Ddm — a cache-only memory architecture. IEEE Computer, 25(9):44–54, September 1992.

    Google Scholar 

  13. E. S. Harrison and E. Schmitt. The structure of system/88, a fault-tolerant computer. IBM Systems Journal, 26(3):293–318, 1987.

    Google Scholar 

  14. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690–691, September 1979.

    Google Scholar 

  15. P. Lee and T. Anderson. Fault Tolerance: Principles and Practice. Volume 3 of Dependable Computing and Fault-Tolerant Systems, Springer Verlag, second revised edition, 1990.

    Google Scholar 

  16. D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford dash multiprocessor. IEEE Computer, 25(3):63–79, March 1992.

    Google Scholar 

  17. P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent numa and coma arch itectures. In Proc. of 19th Annual International Symposium on Computer Architecture, pages 80–91, May 1992.

    Google Scholar 

  18. M. Stumm, R. Unrau, and O. Krieger. Designing a scalable operating system for shared memory multiprocessors. In Usenix workshop, Micro-kernels and Other Kernel Architectures, pages 285–303, Seattle, Washington, April 1992.

    Google Scholar 

  19. K. L. Wu, W. K. Fuchs, and J. H. Patel. Cache-based error recovery for shared memory multiprocessor systems. In Proc. of 1989 International Conference on Parallel Processing, pages 159–166, University Park, Pennsylvania, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel BanĂ¢tre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

BanĂ¢tre, M., Gefflaut, A., Morin, C. (1994). Scalable shared memory multiprocessors: Some ideas to make them reliable. In: BanĂ¢tre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020021

Download citation

  • DOI: https://doi.org/10.1007/BFb0020021

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57767-6

  • Online ISBN: 978-3-540-48330-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics