Abstract
As software and software intensive systems are becoming increasingly ubiquitous, the impact of failures can be tremendous. In some industries such as aerospace, medical devices, or automotive, such failures can cost lives or endanger mission success. Software faults can arise due to the interaction between the software, the hardware, and the operating environment. Unanticipated environmental changes lead to software anomalies that may have significant impact on the overall success of the mission. Latent coding errors can at any time during system operation trigger faults despite the fact that usually a significant effort has been expended in verification and validation (V&V) of the software system. Nevertheless, it is becoming increasingly more apparent that pre-deployment V&V is not enough to guarantee that a complex software system meets all safety, security, and reliability requirements. Software Health Management (SWHM) is a new field that is concerned with the development of tools and technologies to enable automated detection, diagnosis, prediction, and mitigation of adverse events due to software anomalies, while the system is in operation. The prognostic capability of the SWHM to detect and diagnose failures before they happen will yield safer and more dependable systems for the future. This paper addresses the motivation, needs, and requirements of software health management as a new discipline and motivates the need for SWHM in safety critical applications.
Similar content being viewed by others
Notes
In this article, we refer to the host system as the system, which is undergoing health management. The host system may comprise hardware, software, or a combination thereof.
References
ADAC: Pannenstatistik (Wikipedia Germany) (2008). http://de.wikipedia.org/wiki/Pannenstatistik
Adler M (2006) The planetary society blog: spirit Sol 18 Anomaly. http://www.planetary.org/blog/article/00000702/
Andrews D (2011) Managing the bad day. NASA Acad Shar Knowl 44:5–10
Associates B (2009) Run-time verification and validation for safety-critical flight control systems. Air Force SBIR/STTR, AF04-246 https://www.afsbirsttr.com/Publications/Documents/Innovation-121109-BarronAssociates-AF04-246.pdf
Barringer H, Falcone Y, Finkbeiner B, Havelund K, Lee I, Pace GJ, Rosu G, Sokolsky O, Tillmann N (eds) (2010) Runtime verification—first international conference, RV 2010, 2010. Proceedings, Lecture Notes in Computer Science, vol 6418. Springer, Berlin
Barry M, Horvath G (2009) Goal-based flight software health management services (extended abstract). In: Karsai [30]. http://www.isis.vanderbilt.edu/workshops/smc-it-2009-shm
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New york
Board NTS (1989) NTSB identification DCA97MA058, Korean Airlines LTD. http://www.ntsb.gov/ntsb/brief.asp?ev_id=20001213X31759&key=1
Boehm B (2007) Software risk management: principles and practices. In: Selby RW (ed) Software engineering: Barry W. Boehm’s lifetime contributions to software. Wiley, London
Chakarov A, Sankaranarayanan S, Fainekos GE (2012) Combining time and frequency domain specifications for periodic signals. In: Khurshid and Sen [32], pp 294–309
Charette R (2009) This car runs on code. http://spectrum.ieee.org/green-tech/advanced-cars/this-car-runs-on-code
Cherry S (2012) How stuxnet is rewriting the cyberterrorism playbook. IEEE Spectrum. http://spectrum.ieee.org/podcast/telecom/security/how-stuxnet-is-rewriting-the-cyberterrorism-playbook
Codetta-Raiteri D, Portinale L, Guiotto A, Yushstein Y (2012) Evaluation of anomaly and failure scenarios involving an exploration rover: a Bayesian network approach. In: Proceedings of the 11th international symposium on artificial intelligence, robotics, and automation in space (iSAIRAS-2012)
Darwiche A (2009) Modeling and reasoning with Bayesian networks. Cambridge University Press, Cambridge
Degani A (2004) Taming HAL: designing interfaces beyond 2001. Palgrave Macmillan, New York
Dong W, Leucker M, Schallhart C (2008) Impartial anticipations in runtime verification. In: 6th International symposium on automated technology for verification and analysis (ATVA’08), no. 5311 in LNCS. Springer, Berlin
Dubey A, Karsai G, Kereskenyi R, Mahadevan M (2010) A real-time component framework: experience with CCM and ARINC-653. In: IEEE international symposium on object-oriented real-time, distributed computing
F-22: F-22 Raptor stealthfighter (1992). http://www.f-22raptor.com/index_airframe.php1992
FAA: Airworthiness directive 2005–18-51 (2005). http://rgl.faa.gov/Regulatory_and_Guidance_Library
Filman RE, Elrad T, Clarke S, Aksit M (2004) Aspect-oriented software development. Addison-Wesley, Reading
GlobalSecurity.org: F-22 Raptor (2004). http://www.globalsecurity.org/military/systems/aircraft/f-22-testfly.htm
Goodlow A, Pike L (2009) Toward monitoring fault-tolerant embedded systems (extended abstract). In: Karsai [30]. http://www.isis.vanderbilt.edu/workshops/smc-it-2009-shm
Greenwell WS, Knight JC (2003) What should aviation safety incidents teach us? Technical Report. University of Virginia
Havelund K, Roşu G (2001) Monitoring Java programs with Java PathExplorer. In: Proceeding of the first workshop on runtime verification. Electronic notes in theoretical computer science, vol. 55(2). Elsevier, Amsterdam
Iverson DL (2004) Inductive system health monitoring. In: Proceedings of the 2004 international conference on artificial intelligence (IC-AI’04), CSREA Press
Jackson D, Thomas M, Millett LI (2007) Software for dependable systems: sufficient evidence? National Academy Press, Washington
Jardine A, Lin D, Banjevic D (2006) A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mech Syst Signal Process 20(7):1483–1510
Jee E, Wang S, Kim JK, Lee J, Sokolsky O, Lee I (2010) A safety-assured development approach for real-time software. In: RTCSA. IEEE Computer Society, pp 133–142
Johnson D (2007) Raptors arrive at Kadena. http://www.af.mil/news/story.asp?storyID=123041567
Karsai G (ed) (2009) 1st international workshop on software health management (SHM 2009). ISIS, Vanderbilt University. http://www.isis.vanderbilt.edu/workshops/smc-it-2009-shm
Karsai G (ed) (2011) 2nd international workshop on software health management (SHM 2011). ISIS, Vanderbilt University. http://www.isis.vanderbilt.edu/workshops/smc-it-2011-shm
Khurshid S, Sen K (eds) (2012) Runtime verification—second international conference, RV 2011, San Francisco, September 27–30, 2011. Revised selected papers, Lecture Notes in Computer Science, vol 7186. Springer, Berlin
Kurtoglu T, Lutz R, Patterson-Hine A (2009) Using auto-generated diagnostic trees for optimized fault handling (extended abstract). In: Karsai [30]. http://www.isis.vanderbilt.edu/workshops/smc-it-2009-shm
Leveson N (1995) Safeware system safety and computers. Addison-Wesley, Reading
Leveson N, Turner CS (1993) An investigation of the Therac-25 accidents. IEEE Comput 26(1):18–41
Lindsey AE, Pecheur C (2004) Simulation-based verification of autonomous controllers via Livingstone Pathfinder. In: Jensen K, Podelski A (eds) Proceedings TACAS 2004, Lecture Notes in Computer Science, vol 2988. Springer, Berlin, pp 357–371
Mars Spirit Wiki (2005) Mars spirit software problem. http://c2.com/cgi/wiki?MarsSpiritSoftwareProblem
Melone L (2012) Car-hacking: remote access and other security issues. Computer World. http://www.computerworld.com/s/article/9229919/Car_hacking_Remote_access_and_other_security_issues
Milea NA, Khoo SC, Lo D, Pop C (2011) Nort: runtime anomaly-based monitoring of malicious behavior for windows. In: Proceedings of runtime verification (RV 2011), LNCS, vol 7186. Springer, Berlin, pp 115–130
Mobley R (2004) Condition based maintenance. In: Davies A (ed) Handbook of condition monitoring: techniques and methodologies. Chapman & Hall, London, pp 35–54
Narasimhan S (2007) Automated diagnosis of physical systems. In: International conference on accelerator and large experimental physics control systems (ICALEPCS ’07)
Narasimhan S, Brownston L (2007) HyDE—a general framework for Stochastic and Hybrid model-based diagnosis. In: 18th international workshop on principles of diagnosis (DX ’07)
Neumann P (2009) Illustrative risks to the public in the use of computer systems and related technology. http://www.csl.sri.com/users/neumann/illustrative.html
Pike L, Niller S, Wegmann N (2012) Runtime verification for ultra-critical systems. In: Khurshid and Sen [32], pp 310–324
Pizka M, Panas T (2009) Establishing economic effectiveness through software health management (extended abstract). In: Karsai [30]. http://www.isis.vanderbilt.edu/workshops/smc-it-2009-shm
Qadeer S (ed) (2012) Runtime verification 2012 (RV’12). preproceedings, Springer LNCS, Berlin. http://rv2012.ku.edu.tr/accepted-papers/ (to be published)
Rawnsley A (2011) Iran’s alleged drone hack: tough, but possible. Wired
Regan P, Hamilton S (2004) NASA’s mission reliable. IEEE Comput 37(1):59–68
Richardson J (2011) Stuxnet as cyberwarfare: applying the law of war to the virtual battlefield. Soc Sci Res Netw. http://ssrn.com/abstract=1892888 or doi:10.2139/ssrn.1892888
RTCA: DO-178B: software considerations in airborne systems and equipment certification (1992). http://www.rtca.org
RTCA: DO-178C/ED-12C: software considerations in airborne systems and equipment certification (2012). http://www.rtca.org
Sistla AP, Zefran M, Feng Y (2012) Runtime monitoring of stochastic cyber-physical systems with hybrid state. In: Khurshid and Sen [32], pp 276–293
Sophos: top 10 malware (2008). http://www.sophos.com/security/top-10/
Srivastava AN, Das S (2009) Detection and prognostics on low dimensional systems. IEEE Trans Syst Man Cybern Part C 39(1)
Srivastava AN, Meyer C, Mah R (2009) Integrated vehicle health management technical plan. Technical report, NASA
Stephenson D (2006) The airplane doctors. Boeing Frontiers 5(1):36–41. http://www.boeing.com/news/frontiers/archive/2006/august/ts_sf09.pdf
Süddeutsche Zeitung S (2010) Bevor es zu spät ist: Rückrufe in der Automobilbranche. http://www.sueddeutsche.de/automobil/13/503237/text/
Toyota: Toyota Prius recall—update ABS software (2010). http://www.toyota.com/recall/abs.html?srchid=K610_p280864979
Wikipedia: Mars Rover spirit (2005) http://en.wikipedia.org/wiki/Spirit_rover
Wikipedia: autonomic computing (2012) http://en.wikipedia.org/wiki/Autonomic_computing
Wilhide P (2000) Mars program assessment report outlines route to success. http://mars.jpl.nasa.gov/msp98/news/news71.html
Winter D (2008) Statement of Mr. Don C. Winter, VP Eng & IT, boeing phantom works before a hearing on NITRD. Committee on Science and Technology, U.S. House of Representatives
Zhao C, Dong W, Wang J, Sui P, Qi Z (2009) Software active online monitoring under anticipatory semantics (extended abstract). In: Karsai [30] http://www.isis.vanderbilt.edu/workshops/smc-it-2009-shm
Acknowledgments
The authors would like to thank Eric Cooper, Paul Miner, Robert Mah, Claudia Meyer, Serdar Uckun, Gabor Karsai, and the NASA partners working on software health management. The authors would also like to thank the reviewers for valuable comments. This article was written under the support of the NASA Aviation Safety Program Integrated Vehicle Health Management project and NASA’s OSMA SARP project “Advanced tools and techniques for V&V of IVHM systems”. This paper is a substantially revised and extended version of a paper presented at SMC-IT 2011.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Srivastava, A.N., Schumann, J. Software health management: a necessity for safety critical systems. Innovations Syst Softw Eng 9, 219–233 (2013). https://doi.org/10.1007/s11334-013-0212-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11334-013-0212-0