Abstract
Rising software complexity in aerospace systems makes them very difficult to analyze and prepare for all possible fault scenarios at design time; therefore, classical run-time fault tolerance techniques such as self-checking pairs and triple modular redundancy are used. However, several recent incidents have made it clear that existing software fault tolerance techniques alone are not sufficient. To improve system dependability, simpler, yet formally specified and verified run-time monitoring, diagnosis, and fault mitigation capabilities are needed. Such architectures are already in use for managing the health of vehicles and systems. Software health management is the application of these techniques to software systems. In this paper, we briefly describe the software health management techniques and architecture developed by our research group. The foundation of the architecture is a real-time component framework (built upon ARINC-653 platform services) that defines a model of computation for software components. Dedicated architectural elements: the Component Level Health Manager (CLHM) and System Level Health Manager (SLHM) provide the health management services: anomaly detection, fault source isolation, and fault mitigation. The SLHM includes a diagnosis engine that (1) uses a Timed Failure Propagation Graph (TFPG) model derived from the component assembly model, (2) reasons about cascading fault effects in the system, and (3) isolates the fault source component(s). Thereafter, the appropriate system-level mitigation action is taken. The main focus of this article is the description of the fault mitigation architecture that uses goal-based deliberative reasoning to determine the best mitigation actions for recovering the system from the identified failure mode.
Similar content being viewed by others
Notes
An ARINC-653 process is a unit of concurrency that is analogous to a thread in a desktop operating system such as Linux.
An interface is a collection of related methods.
In case of a facet/ provided port, each method in the interface (supported by the facet) is assigned a dedicated ARINC-653 process.
The modeling environment and the Linux runtime are available from https://wiki.isis.vanderbilt.edu/mbshm/index.php/Main_Page.
Note: In ACM, a component does not have multiple publishers of the same kind. Hence, at most one publisher of a component services a specific consumer port of another component.
See technical report [32].
See technical report [17] for a detailed discussion.
References
Abdelwahed S, Karsai G, Mahadevan N, Ofsthun SC (2009) Practical considerations in systems diagnosis using timed failure propagation graph models. Instrum Meas IEEE Trans 58(2):240–247
ARINC (2010) ARINC specification 653p1-3: Avionics application software standard interface part 1 - required services. https://www.arinc.com/
Australian Transport Safety Bureau (2005) In-flight upset; 240km NW Perth, WA; Boeing Co 777–200, 9M-MRG. Tech. rep., http://www.atsb.gov.au/publications/investigation_reports/2005/aair/aair200503722.aspx
Australian Transport Safety Bureau (2008) AO-2008-070: In-flight upset, 154 km west of Learmonth, WA, 7 October 2008, VH-QPA, Airbus A330–303. Tech. rep., http://www.atsb.gov.au/publications/investigation_reports/2008/aair/ao-2008-070.aspx
Bailleux O, Boufkhad Y (2003) Efficient cnf encoding of boolean cardinality constraints. In: Principles and practice of constraint programming-9th international conference (CP 2003), pp 108–122
Barry M (2008) http://www.kestreltechnology.com/downloads/FailsafeOverview.pdf
Bengtsson J, Larsen K, Larsson F, Pettersson P, Yi W (1996) UPPAAL: a tool suite for automatic verification of real-time systems. In: Proceedings of the DIMACS/SYCON workshop on Hybrid systems III— verification and control, Springer-Verlag New York, Inc., Secaucus, pp 232–243
Bustard DW, Sterritt R (2006) A requirements engineering perspective on autonomic systems development. Autonomic computing. Concepts, infrastructure, and applications, pp 19–33
Butler R (2008) A primer on architectural level fault tolerance. Tech. rep., NASA scientific and technical information (STI) Program Office, Report No. NASA/TM-2008-215108, available at http://shemesh.larc.nasa.gov/fm/papers/Butler-TM-2008-215108-Primer-FT.pdf
Charette RN (2009) This car runs on code. IEEE Spectrum 46(3):3 http://www.spectrum.ieee.org/feb09/7649
Cheng BH (2009) Software engineering for self-adaptive systems. In: Chap software engineering for self-adaptive systems: a research roadmap. Springer-Verlag, Berlin, Heidelberg, pp 1–26, doi:10.1007/978-3-642-02161-9_1
Conmy P, McDermid J, Nicholson M (2002) Safety analysis and certification of open distributed systems. In: International system safety conference, Denver
Dashofy EM, van der Hoek A, Taylor RN (2002) Towards architecture-based self-healing systems. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems, ACM Press, New York, pp 21–26, doi:10.1145/582128.582133
Dubey A, Karsai G, Mahadevan N (2011) A component model for hard real-time systems: CCM with ARINC-653. Softw Pract Exp 41(12):1517–1550. doi:10.1002/spe.1083
Dubey A, Karsai G, Mahadevan N (2011) Model-based software health management for real-time systems. In: Aerospace conference, 2011 IEEE, IEEE, pp 1–18
Dubey A, Mahadevan N, Karsai G (2012) A deliberative reasoner for model-based software health management. In: The eighth international conference on autonomic and autonomous systems, doi:10.1109/ISORC.2010.39
Dubey A, Mahadevan N, Karsai G (2012) The inertial measurement unit example: a software health management case study. Tech. Rep. ISIS-12-101, Institute for Software Integrated Systems, Vanderbilt University, http://www.isis.vanderbilt.edu/sites/default/files/TechReport_IMU.pdf
Dubey A, Karsai G, Mahadevan N (2013) Fault-adaptivity in hard real-time component based systems. In: de Lemos R, Giese H, Muller HA, Shaw M (eds) Software engineering for self-adaptive systems II, no. 7475 in, Lecture Notes in Computer Science, Springer-Verlag, Berlin, pp 294–323
Eén N, Sörensson N (2003) An extensible sat-solver. In: Theory and applications of satisfiability testing, 6th international conference (SAT 2003), pp 502–518
Eén N, Sörensson N (2006) Translating pseudo-boolean constraints into sat. JSAT 2(1–4):1–26
Garlan D, Cheng SW, Schmerl B (2003) Architecting dependable systems. In: Chap increasing system dependability through architecture-based self-repair. Springer-Verlag, Berlin, pp 61–89, http://dl.acm.org/citation.cfm?id=1768179.1768183
Goldberg A, Horvath G (2007) Software fault protection with ARINC 653. In: Proceeding of IEEE aerospace conference, Montana, pp 1–11
Greenwell WS, Knight J, Knight JC (2003) What should aviation safety incidents teach us? Technical report, University of Virginia. http://dependability.cs.virginia.edu/publications/safecomp.2003.lessons.pdf
Jagadeesan LJ, Viswanathan R (2005) Passive mid-stream monitoring of real-time properties. In: EMSOFT ’05: Proceedings of the 5th ACM international conference on Embedded software, ACM, New York, pp 343–352, doi:10.1145/1086228.1086291
Johnson SB, Gormley TJ, Kessler SS, Mott CD, Patterson-Hine A, Reichard KM, Scandura PA (2011) System health management: with aerospace applications. Wiley, New York
Laprie JC (1995) Dependable computing and fault tolerance: concepts and terminology. In: Proceeding of twenty-fifth international symposium on fault-tolerant computing, ’ Highlights from Twenty-Five Years’, p 2, http://ieeexplore.ieee.org/iel3/3846/11214/00532603.pdf?arnumber=532603
Laprie JC, Arlat J, B’eounes C, Kanoun K (1995) Architectural issues in software fault-tolerance, chapter 2. Software Fault Tolerance http://www.cse.cuhk.edu.hk/lyu/book/sft/pdf/chap3.pdf
Lightstone S (2007) Seven software engineering principles for autonomic computing development. ISSE 3(1):71–74
Lyu MR (1995) Software fault tolerance, Wiley, New York http://www.cse.cuhk.edu.hk/lyu/book/sft/
Lyu MR (2007) Software reliability engineering: a roadmap. In: 2007 Future of software engineering, IEEE computer society, FOSE ’07, Washington, pp 153–170. doi:10.1109/FOSE.2007.24
Mahadevan N, Dubey A, Karsai G (2011) Application of software health management techniques. In: Proceedings of the 6th international symposium on software engineering for adaptive and self-managing systems, SEAMS ’11, ACM, New York, pp 1–10.doi:10.1145/1988008.1988010
Mahadevan N, Dubey A, Balasubramaniam D, Karsai G (2013) Deliberative reasoning in software health management. Tech. Rep. ISIS-13-111, Institute for Software Integrated Systems, Vanderbilt University, http://www.isis.vanderbilt.edu/sites/default/files/TechReport2013.pdf
Marques-Silva J, Lynce I (2007) Towards robust cnf encodings of cardinality constraints. In: Bessière C (ed) Proceedings of 13th international conference on principles and practice of constraint programming (CP2007), LNCS, vol 4741. Springer, Heidelberg, pp 483–497
Mcintyre MDW, Sebring DL (1994) Integrated fault-tolerant air data inertial reference system. Future of Software Engineering, pp 153–170
Potocti de Montalk J (1991) Computer software in civil aircraft. In: Proceedings. IEEE/AIAA 10th digital avionics systems conference, 1991. pp 324–330, doi:10.1109/DASC.1991.177187
de Moura LM, Bjørner N (2008) Z3: An efficient smt solver. In: Tools and algorithms for the construction and analysis of systems (TACAS), New York, pp 337–340
NASA (2000) Report on the loss of the mars polar lander and deep space 2 missions. Tech. rep., NASA, ftp://ftp.hq.nasa.gov/pub/pao/reports/2000/2000_mpl_report_1.pdf
Nicholson M (2007) Health monitoring for reconfigurable integrated control systems. In: Constituents of modern system safety thinking proceedings of the thirteenth safety-critical systems symposium, vol 5, pp 149–162
Ofsthun S (2002) Integrated vehicle health management for aerospace platforms. Instrum Meas Mag IEEE 5(3):21–24. doi:10.1109/MIM.2002.1028368
Pike L, Goodloe A, Morisset R, Niller S (2010) Copilot: a hard real-time runtime monitor. In: Runtime verification, Springer, pp 345–359
Pullum LL (2001) Software fault tolerance techniques and implementation. Artech House, Inc., Norwood
Robertson P, Williams B (2006) Automatic recovery from software failure. Commun ACM 49(3):41–47. doi:10.1145/1118178.1118200
Rohr M, Boskovic M, Giesecke S, Hasselbring W (2006) Models in software engineering, workshops, and symposia at models 2006. In: Proceedings of the workshop “Models@run.time” at the 9th international conference on model driven engineering languages and systems (MoDELS/UML’06), vol 4364. University of Massachusetts, Boston
Sammapun U, Lee I, Sokolsky O (2005) Rt- Ma C: runtime monitoring and checking of quantitative and probabilistic properties. In: Proceeding of 11th IEEE international conference on embedded and real-time computing systems and applications, pp 147–153. doi:10.1109/RTCSA.2005.84
Schumann J, Srivastava AN, Mengshoel OJ (2010) Who guards the guardians?: toward v &#v of health management software. In: Proceedings of the First international conference on Runtime verification, RV’10, Springer-Verlag, Heidelberg, pp 399–404. http://dl.acm.org/citation.cfm?id=1939399.1939432
Sha L (2006) The complexity challenge in modern avionics software. In: National Workshop on aviation software systems, design for certifiably dependable systems, Alexandria
Shaw M (2002) “self-healing”: softening precision to avoid brittleness: position paper for woss ’02: workshop on self-healing systems. In: WOSS ’02: Proceedings of the first workshop on self-healing systems, ACM Press, New York, pp 111–114. doi:10.1145/582128.582152
Sheffels M (1992) A fault-tolerant air data/inertial reference unit. In: Proceedings of IEEE/AIAA 11th digital avionics systems conference, 1992, pp 127–131, doi:10.1109/DASC.1992.282171
Srivastava A, Schumann J (2011) The case for software health management. In: Fourth IEEE international conference on space mission challenges for information technology, 2011. SMC-IT 2011, pp 3–9
Taleb-Bendiab A, Bustard DW, Sterritt R, Laws AG, Keenan F (2005) Model-based self-managing systems engineering. In: DEXA workshops, pp 155–159
Torres-pomales W (2000) Software fault tolerance: a tutorial. Tech. rep., NASA, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.8307
Tseitin GS (1968) On the complexity of derivations in the propositional calculus. Stud Math Math Log Part II:115–125
Wang N, Schmidt DC, O’Ryan C (2001) Overview of the CORBA component model. In: Component-based software engineering: putting the pieces together, Addison-Wesley Longman Publishing Co., Inc., Boston, pp 557–571
Wang S, Ayoub A, Sokolsky O, Lee I (2012) Runtime verification of traces under recording uncertainty. In: Proceedings of the second international conference on runtime verification, RV’11. Springer-Verlag, Berlin, pp 442–456. doi:10.1007/978-3-642-29860-8_35
Williams B, Williams B, Ingham M, Chung S, Elliott P (2003) Model-based programming of intelligent embedded systems and robotic space explorers. Proc IEEE 91(1):212–237. doi:10.1109/JPROC.2002.805828
Williams BC, Ingham M, Chung S, Elliott P, Hofbaur M, Sullivan GT (2004) Model-based programming of fault-aware systems. AI Mag 24(4):61–75
Zhang J, Cheng BHC (2005) Specifying adaptation semantics. In: WADS ’05: Proceedings of the 2005 workshop on architecting dependable systems, ACM, New York, pp 1–7. doi:10.1145/1083217.1083220
Zhang J, Cheng BHC (2006) Model-based development of dynamically adaptive software. In: ICSE ’06: Proceeding of the 28th international conference on software engineering, ACM, New York, pp 371–380, doi:10.1145/1134285.1134337
Acknowledgments
This paper is based on work supported by NASA under award NNX08AY49A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration. The authors would like to thank Dr. Paul Miner, Eric Cooper, and Suzette Person of NASA LaRC for their help and guidance on the project.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mahadevan, N., Dubey, A., Balasubramanian, D. et al. Deliberative, search-based mitigation strategies for model-based software health management. Innovations Syst Softw Eng 9, 293–318 (2013). https://doi.org/10.1007/s11334-013-0215-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11334-013-0215-x