Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities

Appleby, K.; Goldszmidt, G.; Steinder, M.

doi:10.1023/A:1015954732370

Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities

Published: June 2002

Volume 10, pages 171–194, (2002)
Cite this article

Journal of Network and Systems Management Aims and scope Submit manuscript

K. Appleby¹,
G. Goldszmidt¹ &
M. Steinder²

123 Accesses
16 Citations
3 Altmetric
Explore all metrics

Abstract

Yemanja is a model-based event correlation engine for multi-layer fault diagnosis. It targets complex propagating fault scenarios, and can smoothly correlate low-level network events with high-level application performance alerts related to quality-of-service violations. Entity-models that represent devices or abstract components encapsulate their behavior. Distantly associated entity-models are not explicitly aware of each other, and communicate through internal event chains. Yemanja's state-based engine supports generic scenario definitions, prioritization of alternate solutions, integrated problem and device testing, and simultaneous analysis of overlapping problems. The system of correlation rules was developed based on the analysis of device and layer functions, and the dependencies among physical and abstract system components. The primary objectives of this research include the development of reusable, configuration independent, correlation scenarios, adaptability and extensibility of the engine to match the constantly changing topology of a multi-domain server farm, and development of a concise specification language that is relatively simple yet powerful.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A brief introduction to distributed systems

Article Open access 16 August 2016

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Queue stability and dynamic throughput maximization in multi-agent heterogeneous wireless networks

Article 13 April 2024

REFERENCES

G. Jakobson and M. D. Weissman, Alarm correlation, IEEE Network, Vol. 7, No. 6, pp. 52–59, 1993.
Google Scholar
G. Liu, A. K. Mok, and E. J. Yang, Composite events for network event correlation. In M. Sloman, S. Mazumdar, and E. Lupu, eds., Integrated Network Management VI, pp. 247–260, IEEE Publishing, 1999.
M. Mansouri-Samani and M. Sloman, GEM-A generalized event monitoring language for distributed systems, IEE/IOP/BCS Distributed Systems Engineering Journal, Vol. 4, No. 2, pp. 96–108, 1997.
Google Scholar
R. Gopal, Layered model for supporting fault isolation and recovery. In J. W. Hong and R. Weihmayer, eds., NOMS 2000: 2000 IEEE/IFIP Network Operations and Symposium “The Network Planet: Management Beyond 2000,” IEEE Publishing, pp. 729–742, 2000.
S. H. Schwartz and D. Zager, Value-oriented network management. In J. W. Hong and R. Weihmayer, eds., NOMS 2000: 2000 IEEE/IFIP Network Operations and Symposium “The Network Planet: Management Beyond 2000,” IEEE Publishing, pp. 715–728, 2000.
A. Hiles, Service Level Agreements: Managing Cost and Quality in Service Relationships, Chapman and Hall, 1993.
K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger, Oc´eano-SLA-based management of a computing utility, Integrated Network Management VII, IEEE Publishing, pp. 855–868, 2001.
J. Cunha, F. Q. B. da Silva, G. Goldszmidt, and K. Appleby. An architecture to define, store, and monitor iSLAs in server farm, Proceedings of Latin American Network Operations and Management Symposium, Belo Horizonte, Brazil, September 2001.
S.Kätker,Amodeling framework for integrated distributed systems fault management, Proceeding of the IFIP/IEEE International Conference on Distributed Platforms, Dresden, Germany, pp. 187–198, 1996.
R. Perlman, Interconnections, Second Edition: Bridges, Routers, Switches, and Internetworking Protocols, Addison Wesley, 1999.
D. D. Chamberlin, A Complete Guide to DB2 Universal Database, Morgan Kaufmann Publishers, 1998.
J. Case, M. Fedor,M. Schoffstall, and J. Davin,ASimple Network Management Protocol (SNMP). IETF Network Working Group RFC 1157, 1990.
D. Pazel, T. Eilam, L. Fong, M. Kalantar, Karen Appleby, and G. Goldszmidt, Neptune: A dynamic resource allocation and planning system for a cluster computing utility, Proceedings of International Symposium on Cluster Computing and the Grid, Berlin, Germany, May 2002.
E. Decker, P. Langille, A. Rijsinghani, and K. McCloghrie, Definition of Managed Objects for Bridges, IETF Network Working Group RFC 1493, 1993.
Transaction Processing Performance Council. TPC BenchmarkTM W. Available at http://www. tpc.org, 2001.
G. Goldszmidt and G. Hunt, Scaling Internet services by dynamic allocation of connections. In M. Sloman, S. Mazumdar, and E. Lupu, eds., Integrated Network Management VI, pp. 171–184, IEEE Publishing, 1999.
K. McCloghrie and M. Rose, Management Information Base for Network Mangement of TCP/IPbased internets: MIB-II, IETF Network Working Group RFC 1213, 1991.
S. Waldbusser, Remote Network Monitoring Management Information Base. IETF Network Working Group RFC 1271, 1995.
P. Wu, R. Bhatnagar, L. Epstein, M. Bhandaru, and Z. Shi, Alarm correlation engine (ACE). In Proceedings of IEEE/IFIP Network Operation and Management Symposium, pp. 733–742, New Orleans Louisiana, 1998.
Y. A. Nygate, Event correlation using rule and object based techniques. In A. S. Sethi, Y. Reynaud, and F. Faure-Vincent, eds., Integrated Network Management IV, pp. 278–289, Chapman and Hall, 1995.
J. F. Jordaan and M. E. Paterok, Event correlation in heterogeneous networks using the OSI management framework. In H. G. Hegering and Y. Yemini, eds., Integrated Network Management III, North-Holland, pp. 683–695, 1993.
L. Lewis, A case-based reasoning approach to the resolution of faults in communications networks. In H. G. Hegering and Y. Yemini, eds., Integrated Network Management III, North-Holland, pp. 671–681, 1993.
M. Hasan, B. Sugla, and R. Viswanathan, A conceptual framework for network management event correlation and filtering systems. Integrated Network Management VI, IEEE Publishing, pp. 233–246, 1999.
I. Katzela and M. Schwartz, Schemes for fault identification in communication networks. IEEE Transactions on Networking, Vol. 3, No. 6, pp. 733–764, 1995.
Google Scholar
S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, High speed and robust event correlation, IEEE Communications Magazine, Vol. 34, No. 5, pp. 82–90, 1996.
Google Scholar
K. Houck, S. Calo, and A. Finkel, Towards a practical alarm correlation system. In A. S. Sethi, Y. Reynaud, and F. Faure-Vincent, eds., Integrated Network Management IV, pp. 226–237, Chapman and Hall, 1995.
M. Steinder and A. S. Sethi, Non-deterministic Diagnosis of End-to-End Service Failures in a Multi-layer Communication System. In J. Li, R. Luijten, and E. K. Park, eds., Proceedings of International Conference on Computer Communications and Networks, pp. 374–379, Scottsdale, AR, 2001.
SMARTS. TechReport: Distributed Event Management Architecture. System Management ARTS http://www.smarts.com.
Intelligent Systems Laboratory, Swedish Institute of Computer Science. SICStus Prolog User's Manual, http://www.sics.se/sicstus.
Tivoli, Netview for Unix: Administrator's Guide, Version 6.0, January 2000.

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, 30 Saw Mill River Road, Hawthorne, New York, 10532
K. Appleby & G. Goldszmidt
Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716
M. Steinder

Authors

K. Appleby
View author publications
You can also search for this author in PubMed Google Scholar
G. Goldszmidt
View author publications
You can also search for this author in PubMed Google Scholar
M. Steinder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Appleby.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Appleby, K., Goldszmidt, G. & Steinder, M. Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities. Journal of Network and Systems Management 10, 171–194 (2002). https://doi.org/10.1023/A:1015954732370

Download citation

Issue Date: June 2002
DOI: https://doi.org/10.1023/A:1015954732370

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Queue stability and dynamic throughput maximization in multi-agent heterogeneous wireless networks

REFERENCES

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Navigation

Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Queue stability and dynamic throughput maximization in multi-agent heterogeneous wireless networks

REFERENCES

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation