Monitoring SCI Clusters

Maier-Stahel, Matthias; Butenuth, Roger; Heiss, Hans-Ulrich

doi:10.1007/10704208_33

Monitoring SCI Clusters

Matthias Maier-Stahel⁶,
Roger Butenuth⁶ &
Hans-Ulrich Heiss⁶

Chapter

324 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1734))

Abstract

The more complex a computer system is, the more important it is to get relevant information about its operational state. In a multi-user environment, there is usually an operator or administrator who is responsible for smooth operation. He or she needs to be aware of any abnormal behavior, e.g. node failure, overload situations, deadlocks (avoidance), bottlenecks or other situations related to availability and performance. To that end, a console is used to inform the operator about the state and the behavior of the machine at one single place.

In a system with a single copy of the operating system (e.g. an SMP system), such a console is a standard feature. An SCI cluster, however, is more complicated. Although it provides physically shared memory and can therefore be considered a NUMA multiprocessor, its ”look and feel” to the user is rather a collection of autonomous nodes each running a complete and independent local operating system. Redirecting console output of the individual nodes to a central terminal is possible but not sufficient, since a node usually simply crashes without sending a message in advance. An operator of an SCI cluster would have to probe the nodes to make sure that all of them are up and running.

In addition to the operational states of the nodes, the operator also wants more detailed information about the utilization and performance of the system, since any anomaly in system behavior may indicate a situation that needs human intervention. A component that provides this kind of information is usually called a monitor. A monitor observes the system by sampling relevant system measures, such as utilization, throughput, and other quantities and makes these measurements available for on-line or off-line analysis. In a multi-programming environment, it should be possible to attribute the measured quantities to the individual programs, offering some insight into their behavior. By providing this functionality, a monitor can help the programmer debug the parallel program or reveal design flaws leading to poor performance. The monitor provides a global bird’s-eye view of the system, which is usually not available in a distributed system.

In the following, we present a monitoring tool that has been developed for SCI cluster computers. Section 25.2 gives an overview of its general structure. Sections 25.3 – 25.5 describe the major components and the way they interact. A short conclusion in Section 25.6 closes this chapter.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brown, C.: Programmieren verteilter UNIX-Anwendungen. Prentice Hall, Englewood Cliffs (1994)
Google Scholar
Comer, D.: Internetworking with TCP/IP. Prentice Hall, Englewood Cliffs (1988)
Google Scholar
Langsford, A., Moffet, J.D.: Distributed Systems Measurement. Addison-Wesley, Reading (1993)
Google Scholar
Luttenberger, N.: Monitoring von Multiprozessor- und Multicomputersystemen. Arbeitsberichte des Instituts für Mathematische Maschinen und Datenverarbeitungder Universität Erlangen-Nürnberg, Band 22, Nummer 7 (1988)
Google Scholar
Maier-Stahel, M.: Erfassung und Visualisierung des Systemzustands ineinem Clusterrechner. Universität Paderborn (1998), http://www.unipaderborn.de/fachbereich/AG/heiss/diplomarbeiten/visualisierung.html
Nye, A.: Xlib Programming Manual. O’Reilly & Associates, Sebastopol (1988)
Google Scholar
Ousterhout, J.K.: Entwicklung grafischer Benutzungsschnittstellen für das XWindow System. Addison-Wesley, Reading (1995)
Google Scholar
Rago, S.A.: UNIX System V Network Programming. Addison-Wesley, Reading (1993)
MATH Google Scholar
Santifaller, M.: TCP/IP und ONC/NSF in Theorie und Praxis. Addison-Wesley, Reading (1993)
Google Scholar
Waldschmidt, K. (ed.): Parallelrechner: Architekturen - Systeme - Werkzeuge. B. G. Teubner (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Paderborn, Germany
Matthias Maier-Stahel, Roger Butenuth & Hans-Ulrich Heiss

Authors

Matthias Maier-Stahel
View author publications
You can also search for this author in PubMed Google Scholar
Roger Butenuth
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Ulrich Heiss
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Business Informatics & Application Systems, Department of Information Technology, University Klagenfurt, P.O. Box, 9020, Austria
Hermann Hellwagner
Zuse Institute Berlin,
Alexander Reinefeld

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Maier-Stahel, M., Butenuth, R., Heiss, HU. (1999). Monitoring SCI Clusters. In: Hellwagner, H., Reinefeld, A. (eds) SCI: Scalable Coherent Interface. Lecture Notes in Computer Science, vol 1734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10704208_33

Download citation

DOI: https://doi.org/10.1007/10704208_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66696-7
Online ISBN: 978-3-540-47048-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics