Abstract
Distributed systems in enterprises as well astelecommunication environments strongly demand moreautomated fault management. A single fault in thesecomplex systems might cause a huge number of symptomatic error messages and side effects to occur. Thecommon root faults for these symptoms have to beidentified to start fault removal procedures as soon aspossible and to decrease system down-time. This paper presents a methodology for fault isolation inintegrated management systems. A generic model isdescribed that unifies the view of the management systemon the managed environment. It integrates the relevant aspects of network, system, and servicemanagement layers in order to perform integrated faultisolation. Our approach is based on a general dependencygraph model. It captures the information that isrequired to determine the root cause of a fault on theone hand, and the set of fault affected services andcustomers on the other hand. The layered TMNarchitecture serves as an example for an integratedmanagement environment throughout this paper.
Similar content being viewed by others
REFERENCES
CCITT, Principles for a telecommunications management network, Recommendation M.3010, 1992.
ISO 7498-4 Standard Information Processing Systems, Open Systems Interconnection, Basic Reference Model-Part 4: Management Framework, 1991.
G. Dreo and R. Valta, Using master tickets as a storage for problem solving expertise, In Proc. of 4th IFIP/IEEE International Symposium on Integrated Network Management, Chapman and Hall, London, pp. 328–340, 1995.
American National Standard for Information Technology, Fault Isolation-Information Characterization X3T8-1994, Draft, 1994.
ISO/IEC 10164 Standard, Information Technology, Open Systems Interconnection, Management Information Services, 1991.
J. C. La Prie, Dependability: Concepts and terminology, fault tolerant distributed conputing. In IFIP WG 10.4 Dependable Computing and Fault Tolerance, 1990.
Y. A. Nygate and L. Sterling, ASPEN-Designing complex knowledge based systems. In Proceedings of the Ten Israeli Symposium on Artificial Intelligence Computing, Vision, and Neural Networks, pp. 51–60, 1993.
W. Kehl and H. Hopfmüller, Model-based reasoning for the management of telecommunication network. In Proceedings of IEEE International Conference on Communications ICC93, Geneva, pp. 13–17, 1993.
G. Jakobson and M. D. Weissman, Alarm Correlation, IEEE Network, pp. 52–59, 1993.
A. Finkel, The design and validation of rule based expert systems, IBM Research Report, 1992.
Y. A. Nygate, Event correlation using rule and object based techniques. In Processing of Fourth IFIP/IEEE International Symposium on Integrated Network Management, Chapman and Hall, London, pp. 279–289, 1995.
A. A. Hopgood, Rule based control of a telecommunications network using the blackboard model. Artificial Intelligence in Engineering, Vol. 9, pp. 29–38, 1994.
M. Frontini, J. Griffin, and S. Towers, A knowledge-based system for fault localization in wide area networks. In IFIP TC6/WG 6.6 Symposium on Integrated Network Management, Verlag, San Francisco, pp. 519–530, 1991.
S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo, A coding approach to event correlation. In Processing of Fourth IFIP/IEEE International Symposium on Integrated Network Management, Chapman and Hall, London, pp. 266–277, 1995.
A. Bouloutas, S. Calo, and A. Finkel, Alarm correlation and fault identification in communication networks, IBM Technical Report TR-17967, 1992.
J. F. Jordaan and M. Paterok, Event correlation in heterogeneous networks using OSI management framework. In H. G. Hegering and Y. Yemini (eds.), Integrated Network Management, III, North Holland, Amsterdam, pp. 683–695, 1993.
K. Houck, S. B. Calo, and A. Finkel, Towards a practical alarm correlation system. In Processing of Fourth IFIP/IEEE International Symposium on Integrated Network Management, Chapman and Hall, London, pp. 226–237, 1995.
S. Kätker and M. Paterok, Verfahren zur AutomatischenÜberprüfung eines Datenüber-Tragungsnet zwerks, German Patent No. DE 44 28 132 C 2, 1996.
I. Katzela and S. B. Calo, Centralized vs. distributed fault localization. In Processing of Fourth IFIP/IEEE International Symposium on Integrated Network Management, Chapman and Hall, London, pp. 251–261, 1995.
ISO 10165-4 Standard: Information Technology, Open Systems Interconnection, Management Information Services, Structure of Management Information, Part 4: Guidelines for the Definition of Managed Objects, 1991.
Network ManagementForum: Discovering OMNIPoint-A Common Approach to the Integrated Management of Networked Information Systems, Prentice Hall, Englewood Cliffs.
ISO/IEC 10165-7 Standard: Information Technology, Open Systems Interconnection, Structure of Management Information-Part 7: General Relationship Model, 1994.
CCITT Recommendation X.700, Management Framework Definition for Open Systems Interconnection (OSI) for CCITT Applications, 1992.
S. Kätker, A modeling framework for integrated distributed systems fault management. In A. Schill, C. Mittasch, O. Spaniol, and C. Popien (eds.), Distributed Platforms, Chapman and Hall, London, pp. 186–198, 1996.
F. Dupuy, C. Nilson, and Y. Inoue, The TINA Consortium: toward networking telecommunications information services, IEEE Communication Magazine, Vol. 33, No.11, pp. 78–83, 1995.
The Common Object Request Broker: Architecture and Specification, OMG Document No. 91.12.1, Rev. 2.0, 1995.
Rights and permissions
About this article
Cite this article
Katker, S., Geihs, K. A Generic Model for Fault Isolation in Integrated Management Systems. Journal of Network and Systems Management 5, 109–130 (1997). https://doi.org/10.1023/A:1018766610444
Issue Date:
DOI: https://doi.org/10.1023/A:1018766610444