Abstract
The more complex a computer system is, the more important it is to get relevant information about its operational state. In a multi-user environment, there is usually an operator or administrator who is responsible for smooth operation. He or she needs to be aware of any abnormal behavior, e.g. node failure, overload situations, deadlocks (avoidance), bottlenecks or other situations related to availability and performance. To that end, a console is used to inform the operator about the state and the behavior of the machine at one single place.
In a system with a single copy of the operating system (e.g. an SMP system), such a console is a standard feature. An SCI cluster, however, is more complicated. Although it provides physically shared memory and can therefore be considered a NUMA multiprocessor, its ”look and feel” to the user is rather a collection of autonomous nodes each running a complete and independent local operating system. Redirecting console output of the individual nodes to a central terminal is possible but not sufficient, since a node usually simply crashes without sending a message in advance. An operator of an SCI cluster would have to probe the nodes to make sure that all of them are up and running.
In addition to the operational states of the nodes, the operator also wants more detailed information about the utilization and performance of the system, since any anomaly in system behavior may indicate a situation that needs human intervention. A component that provides this kind of information is usually called a monitor. A monitor observes the system by sampling relevant system measures, such as utilization, throughput, and other quantities and makes these measurements available for on-line or off-line analysis. In a multi-programming environment, it should be possible to attribute the measured quantities to the individual programs, offering some insight into their behavior. By providing this functionality, a monitor can help the programmer debug the parallel program or reveal design flaws leading to poor performance. The monitor provides a global bird’s-eye view of the system, which is usually not available in a distributed system.
In the following, we present a monitoring tool that has been developed for SCI cluster computers. Section 25.2 gives an overview of its general structure. Sections 25.3 – 25.5 describe the major components and the way they interact. A short conclusion in Section 25.6 closes this chapter.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Brown, C.: Programmieren verteilter UNIX-Anwendungen. Prentice Hall, Englewood Cliffs (1994)
Comer, D.: Internetworking with TCP/IP. Prentice Hall, Englewood Cliffs (1988)
Langsford, A., Moffet, J.D.: Distributed Systems Measurement. Addison-Wesley, Reading (1993)
Luttenberger, N.: Monitoring von Multiprozessor- und Multicomputersystemen. Arbeitsberichte des Instituts für Mathematische Maschinen und Datenverarbeitungder Universität Erlangen-Nürnberg, Band 22, Nummer 7 (1988)
Maier-Stahel, M.: Erfassung und Visualisierung des Systemzustands ineinem Clusterrechner. Universität Paderborn (1998), http://www.unipaderborn.de/fachbereich/AG/heiss/diplomarbeiten/visualisierung.html
Nye, A.: Xlib Programming Manual. O’Reilly & Associates, Sebastopol (1988)
Ousterhout, J.K.: Entwicklung grafischer Benutzungsschnittstellen für das XWindow System. Addison-Wesley, Reading (1995)
Rago, S.A.: UNIX System V Network Programming. Addison-Wesley, Reading (1993)
Santifaller, M.: TCP/IP und ONC/NSF in Theorie und Praxis. Addison-Wesley, Reading (1993)
Waldschmidt, K. (ed.): Parallelrechner: Architekturen - Systeme - Werkzeuge. B. G. Teubner (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Maier-Stahel, M., Butenuth, R., Heiss, HU. (1999). Monitoring SCI Clusters. In: Hellwagner, H., Reinefeld, A. (eds) SCI: Scalable Coherent Interface. Lecture Notes in Computer Science, vol 1734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10704208_33
Download citation
DOI: https://doi.org/10.1007/10704208_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66696-7
Online ISBN: 978-3-540-47048-9
eBook Packages: Springer Book Archive