Skip to main content
Log in

An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

The increasing importance of computer networks in this information age demands a high level of network availability and reliability. As we become more dependent on networks in our so-called cyber-world, network faults and downtime become very costly. Sometimes, a slight fault may cause critical disruptions or remediless damages to the network while the network manager is lost among a large amount of alarm messages. Therefore, the development of a practical and effective system for network fault diagnosis becomes an imperative and critical task. In this paper, we develop a hierarchical domain-oriented reasoning mechanism suitable for the delegated management architecture. It is based on the causality graph of a refined network fault propagation model as a result of our empirical study. An automated fault diagnosis system called Alarm Correlation View (or ACView) for isolating network faults in a multi-domain environment is proposed according to the hierarchical reasoning mechanism. This diagnosis system not only provides the process of automated alarm collection and correlation, but also serves the function of efficient fault localization and identification. Furthermore, an alarm-to-fault mapping strategy is used to enhance the fault reasoning capability for uncertain network fault propagation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. J. Y. Chen, The statistics of the dedicated T1 line between FCU and NCTU, December 1998. http:/knight.fcu.edu.tw/-glyeh/cisco.fcu.edu.tw.l.html

  2. D. Potter, The need for network management, Computer Communications, pp. 121-125, March 1991.

  3. R. Doverspike, M. Maeda, S. Narain, J. Pastor, C. C. Shen, N. Stoffel, Y. Tsai, and B. Wilson, Network management research in ATDNet, IEEE Network, pp. 30-41, July 1996.

  4. R. D. Gardner and D. A. Harle, Methods and systems for alarm correlation, Proceedings of Globecom'96, London, pp. 136-140, November 1996.

  5. M. T. Rose, Challenges in network management, IEEE Network, pp. 16-19, November 1993.

  6. K. R. Sheers, HP OpenView Event Correlation Services (ECS), Hewlett-Packard Journal, Article 4, October 1996.

  7. A. Mayer, S. Kliger, and S. Yemini, Event modeling with the MODEL language: A tutorial introduction, 1998. http:/versed.smarts.com

  8. S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, High speed and robust event correlation, IEEE Communication, pp. 82-90, May 1996.

  9. S. Kliger, S. A. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo, A coding approach to event correlation, Proceedings of 4th International Symposium on Integrated Network Management, Santa Barbara, California, pp. 266-277, May 1995.

  10. A. T. Bouloutas, S. B. Calo, A. Finkel, and I. Katzela, Distributed fault identification in telecommunication networks, Journal of Network and Systems Management, Vol. 3, No. 3, pp. 295-312, 1995.

    Google Scholar 

  11. I. Katzela and M. Schwartz, Schemes for fault identification in communication networks, IEEE/ACM Trans. on Networking, Vol. 3, No. 6, pp. 753-764, December 1995.

    Google Scholar 

  12. J. L. Chen and P. H. Huang, A fuzzy expert system for network fault management, Proceedings of IEEE International Conference on Systems, Maintenance, and Cybernetics, Vol. 1, pp. 328-331, October 1996.

    Google Scholar 

  13. S. K¨atker and K. Geihs, A generic model for fault isolation in integrated management systems, Journal of Network and Systems Management, Vol. 5, No. 2, pp. 109-130, 1997.

    Google Scholar 

  14. G. Goldszmidt and Y. Yemini, Computing MIB views via delegated agents, Proceedings of the IEEE Third International Workshop on Systems Management, pp. 86-95, June 1998.

  15. E. T. Liang, TCP/IP alarm transformation in a network management environment, Master Thesis, Department of Information Engineering, Feng Chia University, July 1998.

  16. S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Prentice-Hall Publishing Co., 1995.

  17. W. Stallings, SNMP, SNMPv2 and RMON: Practical Network Management, Addison-Wesley Publishing Co., 1996.

  18. C. S. Chao, D. L. Yang, and A. C. Liu, Alarm Correlation View (ACView), Proceedings of IASTED International Conference on Modeling and Simulation, Philadelphia, pp. 291-253, May 1999.

  19. K. H. Lee, G. S. Poo, and E. S. Seumahu, A managed object view interface mechanism for distributed network management systems, Proceedings of IEEE Singapore International Conference on Networks, pp. 374-378, July 1995.

  20. Hewlett-Packard Company, Event correlation services, 1998. http:/www.hp.com/openview/products/itoecs.html

  21. J. Y. Chen, D. L. Yang, and An-Chi Liu, A MODEL-based object-oriented topology specification for network management, Proceedings of International Computer Symposium on Computer Network, Internet, and Multimedia, Taiwan, pp. 164-170, December 1998.

  22. K. Ohta, T. Mori, N. Kato, H. Sone, G. Mansfield, and Y. Nemoto, Divide and conquer technique for network fault management, Integrated Network Management V, pp. 675-689, 1997.

  23. C. A. R. Hoare, Communicating Sequential Processes, Prentice-Hall Publishing Co., 1985.

  24. R. A. Maxion and F. E. Feather, A case study of Ethernet anomalies in a distributed computing environment, IEEE Trans. on Reliability, Vol. 39, No. 4, pp. 433-443, October 1990.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. S. Chao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chao, C.S., Yang, D.L. & Liu, A.C. An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation. Journal of Network and Systems Management 9, 183–202 (2001). https://doi.org/10.1023/A:1011315125608

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011315125608

Navigation