Skip to main content
Log in

Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

Yemanja is a model-based event correlation engine for multi-layer fault diagnosis. It targets complex propagating fault scenarios, and can smoothly correlate low-level network events with high-level application performance alerts related to quality-of-service violations. Entity-models that represent devices or abstract components encapsulate their behavior. Distantly associated entity-models are not explicitly aware of each other, and communicate through internal event chains. Yemanja's state-based engine supports generic scenario definitions, prioritization of alternate solutions, integrated problem and device testing, and simultaneous analysis of overlapping problems. The system of correlation rules was developed based on the analysis of device and layer functions, and the dependencies among physical and abstract system components. The primary objectives of this research include the development of reusable, configuration independent, correlation scenarios, adaptability and extensibility of the engine to match the constantly changing topology of a multi-domain server farm, and development of a concise specification language that is relatively simple yet powerful.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. G. Jakobson and M. D. Weissman, Alarm correlation, IEEE Network, Vol. 7, No. 6, pp. 52–59, 1993.

    Google Scholar 

  2. G. Liu, A. K. Mok, and E. J. Yang, Composite events for network event correlation. In M. Sloman, S. Mazumdar, and E. Lupu, eds., Integrated Network Management VI, pp. 247–260, IEEE Publishing, 1999.

  3. M. Mansouri-Samani and M. Sloman, GEM-A generalized event monitoring language for distributed systems, IEE/IOP/BCS Distributed Systems Engineering Journal, Vol. 4, No. 2, pp. 96–108, 1997.

    Google Scholar 

  4. R. Gopal, Layered model for supporting fault isolation and recovery. In J. W. Hong and R. Weihmayer, eds., NOMS 2000: 2000 IEEE/IFIP Network Operations and SymposiumThe Network Planet: Management Beyond 2000,” IEEE Publishing, pp. 729–742, 2000.

  5. S. H. Schwartz and D. Zager, Value-oriented network management. In J. W. Hong and R. Weihmayer, eds., NOMS 2000: 2000 IEEE/IFIP Network Operations and SymposiumThe Network Planet: Management Beyond 2000,” IEEE Publishing, pp. 715–728, 2000.

  6. A. Hiles, Service Level Agreements: Managing Cost and Quality in Service Relationships, Chapman and Hall, 1993.

  7. K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger, Oc´eano-SLA-based management of a computing utility, Integrated Network Management VII, IEEE Publishing, pp. 855–868, 2001.

  8. J. Cunha, F. Q. B. da Silva, G. Goldszmidt, and K. Appleby. An architecture to define, store, and monitor iSLAs in server farm, Proceedings of Latin American Network Operations and Management Symposium, Belo Horizonte, Brazil, September 2001.

  9. S.Kätker,Amodeling framework for integrated distributed systems fault management, Proceeding of the IFIP/IEEE International Conference on Distributed Platforms, Dresden, Germany, pp. 187–198, 1996.

  10. R. Perlman, Interconnections, Second Edition: Bridges, Routers, Switches, and Internetworking Protocols, Addison Wesley, 1999.

  11. D. D. Chamberlin, A Complete Guide to DB2 Universal Database, Morgan Kaufmann Publishers, 1998.

  12. J. Case, M. Fedor,M. Schoffstall, and J. Davin,ASimple Network Management Protocol (SNMP). IETF Network Working Group RFC 1157, 1990.

  13. D. Pazel, T. Eilam, L. Fong, M. Kalantar, Karen Appleby, and G. Goldszmidt, Neptune: A dynamic resource allocation and planning system for a cluster computing utility, Proceedings of International Symposium on Cluster Computing and the Grid, Berlin, Germany, May 2002.

  14. E. Decker, P. Langille, A. Rijsinghani, and K. McCloghrie, Definition of Managed Objects for Bridges, IETF Network Working Group RFC 1493, 1993.

  15. Transaction Processing Performance Council. TPC BenchmarkTM W. Available at http://www. tpc.org, 2001.

  16. G. Goldszmidt and G. Hunt, Scaling Internet services by dynamic allocation of connections. In M. Sloman, S. Mazumdar, and E. Lupu, eds., Integrated Network Management VI, pp. 171–184, IEEE Publishing, 1999.

  17. K. McCloghrie and M. Rose, Management Information Base for Network Mangement of TCP/IPbased internets: MIB-II, IETF Network Working Group RFC 1213, 1991.

  18. S. Waldbusser, Remote Network Monitoring Management Information Base. IETF Network Working Group RFC 1271, 1995.

  19. P. Wu, R. Bhatnagar, L. Epstein, M. Bhandaru, and Z. Shi, Alarm correlation engine (ACE). In Proceedings of IEEE/IFIP Network Operation and Management Symposium, pp. 733–742, New Orleans Louisiana, 1998.

  20. Y. A. Nygate, Event correlation using rule and object based techniques. In A. S. Sethi, Y. Reynaud, and F. Faure-Vincent, eds., Integrated Network Management IV, pp. 278–289, Chapman and Hall, 1995.

  21. J. F. Jordaan and M. E. Paterok, Event correlation in heterogeneous networks using the OSI management framework. In H. G. Hegering and Y. Yemini, eds., Integrated Network Management III, North-Holland, pp. 683–695, 1993.

  22. L. Lewis, A case-based reasoning approach to the resolution of faults in communications networks. In H. G. Hegering and Y. Yemini, eds., Integrated Network Management III, North-Holland, pp. 671–681, 1993.

  23. M. Hasan, B. Sugla, and R. Viswanathan, A conceptual framework for network management event correlation and filtering systems. Integrated Network Management VI, IEEE Publishing, pp. 233–246, 1999.

  24. I. Katzela and M. Schwartz, Schemes for fault identification in communication networks. IEEE Transactions on Networking, Vol. 3, No. 6, pp. 733–764, 1995.

    Google Scholar 

  25. S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, High speed and robust event correlation, IEEE Communications Magazine, Vol. 34, No. 5, pp. 82–90, 1996.

    Google Scholar 

  26. K. Houck, S. Calo, and A. Finkel, Towards a practical alarm correlation system. In A. S. Sethi, Y. Reynaud, and F. Faure-Vincent, eds., Integrated Network Management IV, pp. 226–237, Chapman and Hall, 1995.

  27. M. Steinder and A. S. Sethi, Non-deterministic Diagnosis of End-to-End Service Failures in a Multi-layer Communication System. In J. Li, R. Luijten, and E. K. Park, eds., Proceedings of International Conference on Computer Communications and Networks, pp. 374–379, Scottsdale, AR, 2001.

  28. SMARTS. TechReport: Distributed Event Management Architecture. System Management ARTS http://www.smarts.com.

  29. Intelligent Systems Laboratory, Swedish Institute of Computer Science. SICStus Prolog User's Manual, http://www.sics.se/sicstus.

  30. Tivoli, Netview for Unix: Administrator's Guide, Version 6.0, January 2000.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Appleby.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Appleby, K., Goldszmidt, G. & Steinder, M. Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities. Journal of Network and Systems Management 10, 171–194 (2002). https://doi.org/10.1023/A:1015954732370

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1015954732370

Navigation