Skip to main content

Monitoring SCI Clusters

  • Chapter
  • 324 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1734))

Abstract

The more complex a computer system is, the more important it is to get relevant information about its operational state. In a multi-user environment, there is usually an operator or administrator who is responsible for smooth operation. He or she needs to be aware of any abnormal behavior, e.g. node failure, overload situations, deadlocks (avoidance), bottlenecks or other situations related to availability and performance. To that end, a console is used to inform the operator about the state and the behavior of the machine at one single place.

In a system with a single copy of the operating system (e.g. an SMP system), such a console is a standard feature. An SCI cluster, however, is more complicated. Although it provides physically shared memory and can therefore be considered a NUMA multiprocessor, its ”look and feel” to the user is rather a collection of autonomous nodes each running a complete and independent local operating system. Redirecting console output of the individual nodes to a central terminal is possible but not sufficient, since a node usually simply crashes without sending a message in advance. An operator of an SCI cluster would have to probe the nodes to make sure that all of them are up and running.

In addition to the operational states of the nodes, the operator also wants more detailed information about the utilization and performance of the system, since any anomaly in system behavior may indicate a situation that needs human intervention. A component that provides this kind of information is usually called a monitor. A monitor observes the system by sampling relevant system measures, such as utilization, throughput, and other quantities and makes these measurements available for on-line or off-line analysis. In a multi-programming environment, it should be possible to attribute the measured quantities to the individual programs, offering some insight into their behavior. By providing this functionality, a monitor can help the programmer debug the parallel program or reveal design flaws leading to poor performance. The monitor provides a global bird’s-eye view of the system, which is usually not available in a distributed system.

In the following, we present a monitoring tool that has been developed for SCI cluster computers. Section 25.2 gives an overview of its general structure. Sections 25.3 – 25.5 describe the major components and the way they interact. A short conclusion in Section 25.6 closes this chapter.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, C.: Programmieren verteilter UNIX-Anwendungen. Prentice Hall, Englewood Cliffs (1994)

    Google Scholar 

  2. Comer, D.: Internetworking with TCP/IP. Prentice Hall, Englewood Cliffs (1988)

    Google Scholar 

  3. Langsford, A., Moffet, J.D.: Distributed Systems Measurement. Addison-Wesley, Reading (1993)

    Google Scholar 

  4. Luttenberger, N.: Monitoring von Multiprozessor- und Multicomputersystemen. Arbeitsberichte des Instituts für Mathematische Maschinen und Datenverarbeitungder Universität Erlangen-Nürnberg, Band 22, Nummer 7 (1988)

    Google Scholar 

  5. Maier-Stahel, M.: Erfassung und Visualisierung des Systemzustands ineinem Clusterrechner. Universität Paderborn (1998), http://www.unipaderborn.de/fachbereich/AG/heiss/diplomarbeiten/visualisierung.html

  6. Nye, A.: Xlib Programming Manual. O’Reilly & Associates, Sebastopol (1988)

    Google Scholar 

  7. Ousterhout, J.K.: Entwicklung grafischer Benutzungsschnittstellen für das XWindow System. Addison-Wesley, Reading (1995)

    Google Scholar 

  8. Rago, S.A.: UNIX System V Network Programming. Addison-Wesley, Reading (1993)

    MATH  Google Scholar 

  9. Santifaller, M.: TCP/IP und ONC/NSF in Theorie und Praxis. Addison-Wesley, Reading (1993)

    Google Scholar 

  10. Waldschmidt, K. (ed.): Parallelrechner: Architekturen - Systeme - Werkzeuge. B. G. Teubner (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Maier-Stahel, M., Butenuth, R., Heiss, HU. (1999). Monitoring SCI Clusters. In: Hellwagner, H., Reinefeld, A. (eds) SCI: Scalable Coherent Interface. Lecture Notes in Computer Science, vol 1734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10704208_33

Download citation

  • DOI: https://doi.org/10.1007/10704208_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66696-7

  • Online ISBN: 978-3-540-47048-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics