Skip to main content

Run-Time Root Cause Analysis in Adaptive Distributed Systems

  • Conference paper
On the Move to Meaningful Internet Systems: OTM 2013 Workshops (OTM 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8186))

Abstract

In a distributed environment, several components collaborate with each other to cater a complex functionality. Adaptation in distributed systems is one of the emerging trends that re-configures itself through components addition/removal/update, to cope up with faults. Components are generally inter-dependent, thus a fault propagates from one component to another. Existing root cause analysis techniques generally create a static faults’ dependencies graph to identify the root fault. However, these dependencies keep on changing with adaptations that makes design-time fault dependencies invalid at run-time. This paper describes the problem of deriving causal relationships of faults in adaptive distributed systems. Then, presents a statechart-based solution that statically identifies the sequence of methods execution to derive the causal relationships of faults at run-time. The approach is evaluated, and found that it is highly scalable and time efficient that can be used to reduce the Mean Time To Recover (MTTR) of a distributed system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abdelmoez, W., Nassar, D., Shereshevsky, M., Gradetsky, N., Gunnalan, R., Ammar, H., Yu, B., Mili, A.: Error propagation in software architectures. In: Software Metrics. In: Proceedings of 10th International Symposium on Software Metrics, pp. 384–393 (September 2004)

    Google Scholar 

  2. Chen, M.Y., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, DSN 2002, pp. 595–604. IEEE Computer Society, Washington, DC (2002)

    Chapter  Google Scholar 

  3. Bellur, U., Agrawal, A.: Root cause isolation for self healing in j2ee environments. In: Proceedings of the First International Conference on Self-Adaptive and Self-Organizing Systems, SASO 2007, pp. 324–327. IEEE Computer Society, Washington, DC (2007)

    Chapter  Google Scholar 

  4. Candea, G., Delgado, M., Chen, M., Fox, A.: Automatic failure-path inference: A generic introspection technique for internet applications. In: Proceedings of the The Third IEEE Workshop on Internet Applications, WIAPP 2003, p. 132. IEEE Computer Society, Washington, DC (2003)

    Chapter  Google Scholar 

  5. Liu, Y., Ma, L., Huang, S.: Construct fault diagnosis model based on fault dependency relationship matrix. In: Proceedings of the 2009 Pacific-Asia Conference on Circuits, Communications and Systems, PACCS 2009, pp. 318–321. IEEE Computer Society, Washington, DC (2009)

    Chapter  Google Scholar 

  6. Le, W., Soffa, M.L.: Path-based fault correlations. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2010, pp. 307–316. ACM, New York (2010)

    Chapter  Google Scholar 

  7. Andrews, J., Brennan, G.: Application of the digraph method of fault tree construction to a complex control configuration. Reliability Engineering and System Safety 28(3), 357–384 (1990)

    Article  Google Scholar 

  8. Remenyte-Prescott, R., Andrews, J.: Modeling fault propagation in phased mission systems using petri nets. In: 2011 Proceedings - Annual Reliability and Maintainability Symposium (RAMS), pp. 1–6 (January 2011)

    Google Scholar 

  9. Lo, C.H., Wong, Y.K., Rad, A.B.: Bond graph based bayesian network for fault diagnosis. Appl. Soft Comput. 11(1), 1208–1212 (2011)

    Article  Google Scholar 

  10. Huang, X., Zou, S., Wang, W., Cheng, S.: Fault management for internet services: Modeling and algorithms. In: IEEE International Conference on Communications, ICC 2006, vol. 2, pp. 854–859 (June 2006)

    Google Scholar 

  11. Yemini, S., Kliger, S., Mozes, E., Yemini, Y., Ohsie, D.: High speed and robust event correlation. IEEE Communications Magazine 34(5), 82–90 (1996)

    Article  Google Scholar 

  12. Ensel, C.: Automated generation of dependency models for service management. In: Workshop of the OpenView University Association, OVUA 1999 (1999)

    Google Scholar 

  13. Morin, B., Barais, O., Jezequel, J.M., Fleurey, F., Solberg, A.: Models@ run.time to support dynamic adaptation. Computer 42, 44–51 (2009)

    Article  Google Scholar 

  14. Walsh, A.E. (ed.): Uddi, Soap, and Wsdl: The Web Services Specification Reference Book. Prentice Hall Professional Technical Reference (2002)

    Google Scholar 

  15. Pazzi, L.: Part-whole statecharts for the explicit representation of compound behaviours. In: Evans, A., Caskurlu, B., Selic, B. (eds.) UML 2000. LNCS, vol. 1939, pp. 541–555. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  16. Harel, D.: Statecharts: A visual formalism for complex systems. Sci. Comput. Program. 8(3), 231–274 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  17. 7th Framework Programme European Commision: Transform project (April 2013), http://www.transformproject.eu/

  18. Whittle, J., Schumann, J.: Generating statechart designs from scenarios. In: Proceedings of the 22nd International Conference on Software Engineering, ICSE 2000, pp. 314–323. ACM, New York (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Raj, A., Barrett, S., Clarke, S. (2013). Run-Time Root Cause Analysis in Adaptive Distributed Systems. In: Demey, Y.T., Panetto, H. (eds) On the Move to Meaningful Internet Systems: OTM 2013 Workshops. OTM 2013. Lecture Notes in Computer Science, vol 8186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41033-8_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41033-8_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41032-1

  • Online ISBN: 978-3-642-41033-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics