Skip to main content
Log in

Deliberative, search-based mitigation strategies for model-based software health management

  • SI:SwHM
  • Published:
Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Abstract

Rising software complexity in aerospace systems makes them very difficult to analyze and prepare for all possible fault scenarios at design time; therefore, classical run-time fault tolerance techniques such as self-checking pairs and triple modular redundancy are used. However, several recent incidents have made it clear that existing software fault tolerance techniques alone are not sufficient. To improve system dependability, simpler, yet formally specified and verified run-time monitoring, diagnosis, and fault mitigation capabilities are needed. Such architectures are already in use for managing the health of vehicles and systems. Software health management is the application of these techniques to software systems. In this paper, we briefly describe the software health management techniques and architecture developed by our research group. The foundation of the architecture is a real-time component framework (built upon ARINC-653 platform services) that defines a model of computation for software components. Dedicated architectural elements: the Component Level Health Manager (CLHM) and System Level Health Manager (SLHM) provide the health management services: anomaly detection, fault source isolation, and fault mitigation. The SLHM includes a diagnosis engine that (1) uses a Timed Failure Propagation Graph (TFPG) model derived from the component assembly model, (2) reasons about cascading fault effects in the system, and (3) isolates the fault source component(s). Thereafter, the appropriate system-level mitigation action is taken. The main focus of this article is the description of the fault mitigation architecture that uses goal-based deliberative reasoning to determine the best mitigation actions for recovering the system from the identified failure mode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. An ARINC-653 process is a unit of concurrency that is analogous to a thread in a desktop operating system such as Linux.

  2. An interface is a collection of related methods.

  3. In case of a facet/ provided port, each method in the interface (supported by the facet) is assigned a dedicated ARINC-653 process.

  4. The modeling environment and the Linux runtime are available from https://wiki.isis.vanderbilt.edu/mbshm/index.php/Main_Page.

  5. Note: In ACM, a component does not have multiple publishers of the same kind. Hence, at most one publisher of a component services a specific consumer port of another component.

  6. See technical report [32].

  7. http://www.msoos.org/cryptominisat2/.

  8. See technical report [17] for a detailed discussion.

  9. http://www.msoos.org/cryptominisat2/,v2.9.1.

  10. http://minisat.se/MiniSat+.html.

References

  1. Abdelwahed S, Karsai G, Mahadevan N, Ofsthun SC (2009) Practical considerations in systems diagnosis using timed failure propagation graph models. Instrum Meas IEEE Trans 58(2):240–247

    Article  Google Scholar 

  2. ARINC (2010) ARINC specification 653p1-3: Avionics application software standard interface part 1 - required services. https://www.arinc.com/

  3. Australian Transport Safety Bureau (2005) In-flight upset; 240km NW Perth, WA; Boeing Co 777–200, 9M-MRG. Tech. rep., http://www.atsb.gov.au/publications/investigation_reports/2005/aair/aair200503722.aspx

  4. Australian Transport Safety Bureau (2008) AO-2008-070: In-flight upset, 154 km west of Learmonth, WA, 7 October 2008, VH-QPA, Airbus A330–303. Tech. rep., http://www.atsb.gov.au/publications/investigation_reports/2008/aair/ao-2008-070.aspx

  5. Bailleux O, Boufkhad Y (2003) Efficient cnf encoding of boolean cardinality constraints. In: Principles and practice of constraint programming-9th international conference (CP 2003), pp 108–122

  6. Barry M (2008) http://www.kestreltechnology.com/downloads/FailsafeOverview.pdf

  7. Bengtsson J, Larsen K, Larsson F, Pettersson P, Yi W (1996) UPPAAL: a tool suite for automatic verification of real-time systems. In: Proceedings of the DIMACS/SYCON workshop on Hybrid systems III— verification and control, Springer-Verlag New York, Inc., Secaucus, pp 232–243

  8. Bustard DW, Sterritt R (2006) A requirements engineering perspective on autonomic systems development. Autonomic computing. Concepts, infrastructure, and applications, pp 19–33

  9. Butler R (2008) A primer on architectural level fault tolerance. Tech. rep., NASA scientific and technical information (STI) Program Office, Report No. NASA/TM-2008-215108, available at http://shemesh.larc.nasa.gov/fm/papers/Butler-TM-2008-215108-Primer-FT.pdf

  10. Charette RN (2009) This car runs on code. IEEE Spectrum 46(3):3 http://www.spectrum.ieee.org/feb09/7649

    Google Scholar 

  11. Cheng BH (2009) Software engineering for self-adaptive systems. In: Chap software engineering for self-adaptive systems: a research roadmap. Springer-Verlag, Berlin, Heidelberg, pp 1–26, doi:10.1007/978-3-642-02161-9_1

  12. Conmy P, McDermid J, Nicholson M (2002) Safety analysis and certification of open distributed systems. In: International system safety conference, Denver

  13. Dashofy EM, van der Hoek A, Taylor RN (2002) Towards architecture-based self-healing systems. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems, ACM Press, New York, pp 21–26, doi:10.1145/582128.582133

  14. Dubey A, Karsai G, Mahadevan N (2011) A component model for hard real-time systems: CCM with ARINC-653. Softw Pract Exp 41(12):1517–1550. doi:10.1002/spe.1083

    Article  Google Scholar 

  15. Dubey A, Karsai G, Mahadevan N (2011) Model-based software health management for real-time systems. In: Aerospace conference, 2011 IEEE, IEEE, pp 1–18

  16. Dubey A, Mahadevan N, Karsai G (2012) A deliberative reasoner for model-based software health management. In: The eighth international conference on autonomic and autonomous systems, doi:10.1109/ISORC.2010.39

  17. Dubey A, Mahadevan N, Karsai G (2012) The inertial measurement unit example: a software health management case study. Tech. Rep. ISIS-12-101, Institute for Software Integrated Systems, Vanderbilt University, http://www.isis.vanderbilt.edu/sites/default/files/TechReport_IMU.pdf

  18. Dubey A, Karsai G, Mahadevan N (2013) Fault-adaptivity in hard real-time component based systems. In: de Lemos R, Giese H, Muller HA, Shaw M (eds) Software engineering for self-adaptive systems II, no. 7475 in, Lecture Notes in Computer Science, Springer-Verlag, Berlin, pp 294–323

  19. Eén N, Sörensson N (2003) An extensible sat-solver. In: Theory and applications of satisfiability testing, 6th international conference (SAT 2003), pp 502–518

  20. Eén N, Sörensson N (2006) Translating pseudo-boolean constraints into sat. JSAT 2(1–4):1–26

    MATH  Google Scholar 

  21. Garlan D, Cheng SW, Schmerl B (2003) Architecting dependable systems. In: Chap increasing system dependability through architecture-based self-repair. Springer-Verlag, Berlin, pp 61–89, http://dl.acm.org/citation.cfm?id=1768179.1768183

  22. Goldberg A, Horvath G (2007) Software fault protection with ARINC 653. In: Proceeding of IEEE aerospace conference, Montana, pp 1–11

  23. Greenwell WS, Knight J, Knight JC (2003) What should aviation safety incidents teach us? Technical report, University of Virginia. http://dependability.cs.virginia.edu/publications/safecomp.2003.lessons.pdf

  24. Jagadeesan LJ, Viswanathan R (2005) Passive mid-stream monitoring of real-time properties. In: EMSOFT ’05: Proceedings of the 5th ACM international conference on Embedded software, ACM, New York, pp 343–352, doi:10.1145/1086228.1086291

  25. Johnson SB, Gormley TJ, Kessler SS, Mott CD, Patterson-Hine A, Reichard KM, Scandura PA (2011) System health management: with aerospace applications. Wiley, New York

    Book  Google Scholar 

  26. Laprie JC (1995) Dependable computing and fault tolerance: concepts and terminology. In: Proceeding of twenty-fifth international symposium on fault-tolerant computing, ’ Highlights from Twenty-Five Years’, p 2, http://ieeexplore.ieee.org/iel3/3846/11214/00532603.pdf?arnumber=532603

  27. Laprie JC, Arlat J, B’eounes C, Kanoun K (1995) Architectural issues in software fault-tolerance, chapter 2. Software Fault Tolerance http://www.cse.cuhk.edu.hk/lyu/book/sft/pdf/chap3.pdf

  28. Lightstone S (2007) Seven software engineering principles for autonomic computing development. ISSE 3(1):71–74

    Google Scholar 

  29. Lyu MR (1995) Software fault tolerance, Wiley, New York http://www.cse.cuhk.edu.hk/lyu/book/sft/

  30. Lyu MR (2007) Software reliability engineering: a roadmap. In: 2007 Future of software engineering, IEEE computer society, FOSE ’07, Washington, pp 153–170. doi:10.1109/FOSE.2007.24

  31. Mahadevan N, Dubey A, Karsai G (2011) Application of software health management techniques. In: Proceedings of the 6th international symposium on software engineering for adaptive and self-managing systems, SEAMS ’11, ACM, New York, pp 1–10.doi:10.1145/1988008.1988010

  32. Mahadevan N, Dubey A, Balasubramaniam D, Karsai G (2013) Deliberative reasoning in software health management. Tech. Rep. ISIS-13-111, Institute for Software Integrated Systems, Vanderbilt University, http://www.isis.vanderbilt.edu/sites/default/files/TechReport2013.pdf

  33. Marques-Silva J, Lynce I (2007) Towards robust cnf encodings of cardinality constraints. In: Bessière C (ed) Proceedings of 13th international conference on principles and practice of constraint programming (CP2007), LNCS, vol 4741. Springer, Heidelberg, pp 483–497

  34. Mcintyre MDW, Sebring DL (1994) Integrated fault-tolerant air data inertial reference system. Future of Software Engineering, pp 153–170

  35. Potocti de Montalk J (1991) Computer software in civil aircraft. In: Proceedings. IEEE/AIAA 10th digital avionics systems conference, 1991. pp 324–330, doi:10.1109/DASC.1991.177187

  36. de Moura LM, Bjørner N (2008) Z3: An efficient smt solver. In: Tools and algorithms for the construction and analysis of systems (TACAS), New York, pp 337–340

  37. NASA (2000) Report on the loss of the mars polar lander and deep space 2 missions. Tech. rep., NASA, ftp://ftp.hq.nasa.gov/pub/pao/reports/2000/2000_mpl_report_1.pdf

  38. Nicholson M (2007) Health monitoring for reconfigurable integrated control systems. In: Constituents of modern system safety thinking proceedings of the thirteenth safety-critical systems symposium, vol 5, pp 149–162

  39. Ofsthun S (2002) Integrated vehicle health management for aerospace platforms. Instrum Meas Mag IEEE 5(3):21–24. doi:10.1109/MIM.2002.1028368

    Article  Google Scholar 

  40. Pike L, Goodloe A, Morisset R, Niller S (2010) Copilot: a hard real-time runtime monitor. In: Runtime verification, Springer, pp 345–359

  41. Pullum LL (2001) Software fault tolerance techniques and implementation. Artech House, Inc., Norwood

    MATH  Google Scholar 

  42. Robertson P, Williams B (2006) Automatic recovery from software failure. Commun ACM 49(3):41–47. doi:10.1145/1118178.1118200

    Article  Google Scholar 

  43. Rohr M, Boskovic M, Giesecke S, Hasselbring W (2006) Models in software engineering, workshops, and symposia at models 2006. In: Proceedings of the workshop “Models@run.time” at the 9th international conference on model driven engineering languages and systems (MoDELS/UML’06), vol 4364. University of Massachusetts, Boston

  44. Sammapun U, Lee I, Sokolsky O (2005) Rt- Ma C: runtime monitoring and checking of quantitative and probabilistic properties. In: Proceeding of 11th IEEE international conference on embedded and real-time computing systems and applications, pp 147–153. doi:10.1109/RTCSA.2005.84

  45. Schumann J, Srivastava AN, Mengshoel OJ (2010) Who guards the guardians?: toward v &#v of health management software. In: Proceedings of the First international conference on Runtime verification, RV’10, Springer-Verlag, Heidelberg, pp 399–404. http://dl.acm.org/citation.cfm?id=1939399.1939432

  46. Sha L (2006) The complexity challenge in modern avionics software. In: National Workshop on aviation software systems, design for certifiably dependable systems, Alexandria

  47. Shaw M (2002) “self-healing”: softening precision to avoid brittleness: position paper for woss ’02: workshop on self-healing systems. In: WOSS ’02: Proceedings of the first workshop on self-healing systems, ACM Press, New York, pp 111–114. doi:10.1145/582128.582152

  48. Sheffels M (1992) A fault-tolerant air data/inertial reference unit. In: Proceedings of IEEE/AIAA 11th digital avionics systems conference, 1992, pp 127–131, doi:10.1109/DASC.1992.282171

  49. Srivastava A, Schumann J (2011) The case for software health management. In: Fourth IEEE international conference on space mission challenges for information technology, 2011. SMC-IT 2011, pp 3–9

  50. Taleb-Bendiab A, Bustard DW, Sterritt R, Laws AG, Keenan F (2005) Model-based self-managing systems engineering. In: DEXA workshops, pp 155–159

  51. Torres-pomales W (2000) Software fault tolerance: a tutorial. Tech. rep., NASA, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.8307

  52. Tseitin GS (1968) On the complexity of derivations in the propositional calculus. Stud Math Math Log Part II:115–125

    Google Scholar 

  53. Wang N, Schmidt DC, O’Ryan C (2001) Overview of the CORBA component model. In: Component-based software engineering: putting the pieces together, Addison-Wesley Longman Publishing Co., Inc., Boston, pp 557–571

  54. Wang S, Ayoub A, Sokolsky O, Lee I (2012) Runtime verification of traces under recording uncertainty. In: Proceedings of the second international conference on runtime verification, RV’11. Springer-Verlag, Berlin, pp 442–456. doi:10.1007/978-3-642-29860-8_35

  55. Williams B, Williams B, Ingham M, Chung S, Elliott P (2003) Model-based programming of intelligent embedded systems and robotic space explorers. Proc IEEE 91(1):212–237. doi:10.1109/JPROC.2002.805828

    Article  Google Scholar 

  56. Williams BC, Ingham M, Chung S, Elliott P, Hofbaur M, Sullivan GT (2004) Model-based programming of fault-aware systems. AI Mag 24(4):61–75

    Google Scholar 

  57. Zhang J, Cheng BHC (2005) Specifying adaptation semantics. In: WADS ’05: Proceedings of the 2005 workshop on architecting dependable systems, ACM, New York, pp 1–7. doi:10.1145/1083217.1083220

  58. Zhang J, Cheng BHC (2006) Model-based development of dynamically adaptive software. In: ICSE ’06: Proceeding of the 28th international conference on software engineering, ACM, New York, pp 371–380, doi:10.1145/1134285.1134337

Download references

Acknowledgments

This paper is based on work supported by NASA under award NNX08AY49A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration. The authors would like to thank Dr. Paul Miner, Eric Cooper, and Suzette Person of NASA LaRC for their help and guidance on the project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nagabhushan Mahadevan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahadevan, N., Dubey, A., Balasubramanian, D. et al. Deliberative, search-based mitigation strategies for model-based software health management. Innovations Syst Softw Eng 9, 293–318 (2013). https://doi.org/10.1007/s11334-013-0215-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11334-013-0215-x

Keywords

Navigation