Skip to main content
Log in

Software fault mitigation and availability assurance techniques

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Companies are expected to keep their systems up and running and make data continuously available. Several recent studies have established that most system outages are due to software faults. In this paper, we discuss availability aspects of large software-based systems. We begin by classifying software faults into Bohrbugs and Mandelbugs, and identify aging-related bugs as a subtype of the latter. We then examine mitigation methods for Mandelbugs in general and aging-related bugs in particular. Finally, we discuss techniques for the quantitative availability assurance for such systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Adams E (1984) Optimizing preventive service of the software products. IBM J Res Dev 28(1):2–14

    Article  Google Scholar 

  • Arlat J, Costes A, Crouzet Y, Laprie J-C, Powell D (1993) Fault injection and dependability evaluation of fault tolerant systems. IEEE Trans Comput 42(8):913–923

    Article  Google Scholar 

  • Avižienis A, Chen L (1977) On the implementation of N-version programming for software fault tolerance during execution. In: Proc. IEEE computer software and applications conference, Chicago, pp 149–155

  • Avritzer A, Weyuker EJ (1997) Monitoring smoothly degrading systems for increased dependability. Empir Softw Eng 2(1):59–77

    Article  Google Scholar 

  • Avritzer A, Bondi A, Grottke M, Trivedi KS, Weyuker EJ (2006) Performance assurance via software rejuvenation: monitoring, statistics and algorithms. In: Proc. international conference on dependable systems and networks 2006, Philadelphia, pp 435–444

  • Barlow RE, Campo R (1975) Total time on test processes and applications to failure data analysis. In: Barlow RE, Fussell J, Singpurwalla ND (eds) Reliability and fault tree analysis. SIAM, Philadelphia, pp 451–481

    Google Scholar 

  • Bernstein L, Kintala C (2004) Software rejuvenation. CrossTalk 17(8):23–26

    Google Scholar 

  • Bharadwaj R (2008) Verified software: the real grand challenge. In: Meyer B, Woodcock J (eds) Verified software: theories, tools, experiments. Lecture notes in computer science, vol 4171, Springer, Berlin, pp 318–324

  • Bolch G, Greiner S, de Meer H, Trivedi KS (2006) Queueing networks and Markov chains modeling and performance evaluation with computer science applications, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Candea G, Cutler J, Fox A (2004) Improving availability with recursive microreboots: a soft-state system case study. Perform Eval 56(1–4):213–248

    Article  Google Scholar 

  • Cassidy KJ, Gross KC, Malekpour A (2002) Advanced pattern recognition for detection of complex software aging in online transaction processing servers. In: Proc. international conference on dependable systems and networks, Washington, pp 478–482

  • Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS, Vaidyanathan K, Zeggert WP (2001) Proactive management of software aging. IBM J Res Dev 45(2):311–332

    Article  Google Scholar 

  • Chen D, Selvamuthu D, Chen D, Li L, Some RR, Nikora AP, Trivedi KS (2002) Reliability and availability analysis for the JPL remote exploration and experimentation system. In: Proc international conference on dependable systems and networks, Bethesda, pp 337–344

  • Cisco Systems (2001) Cisco catalyst memory leak vulnerability. Document ID:13618, Cisco Security Advisory. http://www.cisco.com/warp/public/707/cisco-sa-20001206-catalyst-memleak.shtml. Accessed 22 Dec 2010

  • Devraj A, Mishra K, Trivedi KS (2010) Uncertainty propagation in analytic availability models. In: Proc. IEEE symposium on reliable distributed systems, New Delhi

  • Dohi T, Goševa-Popstojanova K, Trivedi KS (2000) Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule. In: Proc. 2000 Pacific rim international symposium on dependable computing, Los Angeles, pp 77–84

  • Dohi T, Goševa-Popstojanova K, Trivedi KS (2001) Estimating software rejuvenation schedule in high assurance systems. Comput J 44(6):473–485

    Article  MATH  Google Scholar 

  • Dumitras T, Srivastava D, Narasimhan P (2005) Architecting and implementing versatile depend-ability. In: Gacek C, Romanovsky A, de Lemos R (eds) Architecting dependable systems, vol III. Lecture notes in computer science, vol 3549, Springer, Berlin, pp 212–231

  • Garg S, Puliafito A, Telek M, Trivedi KS (1995) Analysis of software rejuvenation using Markov regenerative stochastic Petri net. In: Proc. sixth international symposium on software reliability engineering, Toulouse, pp 24–27

  • Garg S, van Moorsel A, Vaidyanathan K, Trivedi KS (1998) A methodology for detection and estimation of software aging. In: Proc. ninth international symposium on software reliability engineering, Paderborn, pp 283–292

  • Garg S, Huang Y, Kintala CMR, Trivedi KS, Yajnik S (1999) Performance and reliability evaluation of passive replication schemes in application level fault tolerance. In: Proc. 29th annual international symposium on fault tolerant computing, Madison, pp 15–18

  • Gray J (1986) Why do computers stop and what can be done about it? In: Proc. 5th symposium on reliability in distributed systems, Los Angeles, pp 3–12

  • Grottke M, Trivedi KS (2005a) Software faults, software aging and software rejuvenation. J Reliab Eng Assoc Jpn 27(7):425–438

    Google Scholar 

  • Grottke M, Trivedi KS (2005b) A classification of software faults. In: Supplemental proc. sixteenth international IEEE symposium on software reliability engineering, Chicago, USA, pp 4.19–4.20

  • Grottke M, Trivedi KS (2007) Fighting bugs: remove, retry, replicate and rejuvenate. IEEE Comput 40(2):107–109

    Google Scholar 

  • Grottke M, Trivedi KS (2008) Analysis of the escalated levels of failure recovery approach. Working paper, University of Erlangen-Nuremberg, Nuremberg

  • Grottke M, Li L, Vaidyanathan K, Trivedi KS (2006) Analysis of software aging in a web server. IEEE Trans Reliab 55(3):411–420

    Article  Google Scholar 

  • Grottke M, Matias R Jr, Trivedi KS (2008) The fundamentals of software aging. In: Proc. first IEEE workshop on software aging and rejuvenation, Seattle

  • Grottke M, Nikora A, Trivedi KS (2010) An empirical investigation of fault types in space mission system software. In: Proc. 2010 IEEE/IFIP international conference on dependable systems and networks, Chicago, pp 447–456

  • Hellerstein J, Diao Y, Parekh S, Tilbury DM (2004) Feedback control of computer systems. Wiley, New York

    Book  Google Scholar 

  • Hoffman G, Malek M, Trivedi KS (2006) A best practice guide to resource forecasting for the Apache webserver. In: Proc. Pacific rim dependability conference, Riverside, pp 183–193

  • Hong Y, Chen D, Li L, Trivedi KS (2002) Closed loop design for software rejuvenation. In: Proc. workshop on self-healing, adaptive and self-managed systems, New York

  • Horning JJ, Lauer HC, Melliar-Smith PM, Randell B (1974) A program structure for error detection and recovery. In: Lecture notes in computer science, vol 16, Springer, Berlin, pp 177–193

  • Hsueh M-C, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. IEEE Comput 30(4):75–82

    Google Scholar 

  • Huang Y, Kintala C, Kolettis N, Fulton N (1995) Software rejuvenation: analysis, module and applications. In: Proc. twenty-fifth international symposium on fault-tolerant computing, Pasadena, pp 381–390

  • Hunter SW, Smith WE (1999) Availability modeling and analysis of a two node cluster. In: Proc. 5th international conference on information systems, analysis and synthesis, Orlando

  • Kourai K, Chiba S (2007) A fast rejuvenation technique for server consolidation with virtual machines. In: Proc. international conference on dependable systems and networks 2007, Edinburgh, pp 245–255

  • Lanus M, Liang Yin, Trivedi KS (2003) Hierarchical composition and aggregation of state-based availability and performability models. IEEE Trans Reliab 52(1):44–52

    Article  Google Scholar 

  • Laprie J-C (ed) (1992) Dependability, basic concepts and terminology. Springer, New York

    MATH  Google Scholar 

  • Laprie J-C, Arlat J, Béounes C, Kanoun K, Hourtolle C (1987) Hardware and software fault tolerance: definition and analysis of architectural solutions. In: Proc. 17th international symposium on fault-tolerant computing, Pittsburgh, pp 116–121

  • Lee I, Iyer RK (1995) Software dependability in the Tandem GUARDIAN system. IEEE Trans Softw Eng 21(5):455–467

    Article  Google Scholar 

  • Lindemann C (1998) Performance modelling with deterministic and stochastic Petri nets. Wiley, New York

    MATH  Google Scholar 

  • Liu Y, Ma Y, Han J, Levendel H, Trivedi KS (2005) A proactive approach towards always-on availability in broadband cable networks. Comput Commun 28(1):51–64

    Article  Google Scholar 

  • Mainkar V, Trivedi KS (1996) Sufficient conditions for existence of a fixed point in stochastic reward net-based iterative methods. IEEE Trans Softw Eng 22(9):640–653

    Article  Google Scholar 

  • Marshall E (1992) Fatal error: how Patriot overlooked a Scud. Science 255:1347

    Article  Google Scholar 

  • Matias R Jr, Freitas Filho PJ (2006) An experimental study on software aging and rejuvenation in web servers. In: Proc. 30th IEEE annual international computer software and applications conference, Chicago, vol 1, pp 189–196

  • Matias R Jr, Trivedi KS, Maciel P (2010) Using accelerated life tests to estimate time to software aging failure. In Proc. IEEE international symposium on software reliability engineering, San Jose, pp 211–219

  • Matias R Jr, Barbetta PA, Trivedi KS (2010) Accelerated degradation tests applied to software aging experiments. IEEE Trans Reliab 59(1):102–114

    Article  Google Scholar 

  • Meeker WQ, Escobar LA (1998) Statistical methods for reliability data. Wiley, New York

    MATH  Google Scholar 

  • Mendiratta VB (1999) Reliability analysis of clustered computing systems. In: Proc. ninth international symposium on software reliability engineering, Paderborn, pp 268–272

  • Mendiratta VB, Souza JM, Zimmerman G (2007) Using software failure data for availability evaluation. In: Designer and developer forum, GLOBECOM 2007, Washington

  • Montgomery DC (2004) Design and analysis of experiments, 6th edn. Wiley, New York

    Google Scholar 

  • Narasimhan P, Dumitras T, Pertet S, Reverte CF, Slember J, Srivastava D (2005) MEAD: support for real-time fault tolerant CORBA. Concurr Comput Pract Exp 17(12):1527–1545

    Article  Google Scholar 

  • Nelson W (1982) Applied life data analysis. Wiley, New York

    Book  MATH  Google Scholar 

  • Nicol D, Sanders W, Trivedi KS (2004) Model-based evaluation: from dependability to security. IEEE Trans Dependable Secur Comput 1(1):48–65

    Article  Google Scholar 

  • Pertet S, Narasimhan P (2004) Proactive recovery in distributed CORBA applications. In: Proc. international conference on dependable systems and networks, Florence, pp 357–366

  • Pertet S, Narasimhan P (2005) Causes of failure in web applications. Carnegie Mellon University Parallel Data Lab Technical Report, CMU-PDL-05-109

  • Pietrantuono R, Russo S, Trivedi KS (2010) Online monitoring of software system reliability. In: Proc. dependable computing conference, Tokyo, pp 209–218

  • Raymond ES (1991) The new hacker’s dictionary. MIT, Cambridge

    Google Scholar 

  • Sahner RA, Trivedi KS, Puliafito A (1996) Performance and reliability analysis of computer systems. Kluwer, Boston

    MATH  Google Scholar 

  • Sato N, Nakamura H, Trivedi KS (2007) Detecting performance and reliability bottlenecks of composite web services. In: Proc. ICSOC, Vienna

  • Shereshevsky M, Crowell J, Cukic B, Gandikota V, Liu Y (2003) Software aging and multifractality of memory resources. In: Proc. international conference on dependable systems and networks, San Francisco, pp 721–730

  • Silva L, Madeira H, Silva JG (2006) Software aging and rejuvenation in a SOAP-based server. In: Proc. fifth IEEE international symposium on network computing and applications, Cambridge, pp 56–65

  • Smith WE, Trivedi KS, Tomek L, Ackeret J (2008) Availability analysis of multi-component blade server systems. IBM Syst J 47(4):621–640

    Article  Google Scholar 

  • Tai A, Chau S, Alkalaj L, Hecht H (1999) On-board preventive maintenance: a design-oriented analytic study for long-life applications. Perform Eval 35(3–4):215–232

    Article  MATH  Google Scholar 

  • Tobias P, Trindade D (1995) Applied reliability, 2nd edn. Kluwer, Boston

    Google Scholar 

  • Tomek L, Trivedi KS (1991) Fixed-point iteration in availability modeling. In: Dal Cin M (ed) Proc. fifth international GI/ITG/GMA conference on fault-tolerant computing systems, Springer, Berlin, pp 229–240

  • Trivedi KS (2000) Availability analysis of Cisco GSR 12000 and Juniper M20/M40. Cisco Technical Report

  • Trivedi KS (2001) Probability & statistics with reliability, queueing and computer science applications, 2nd edn. Wiley, New York

    Google Scholar 

  • Trivedi KS, Vasireddy R, Trindade D, Nathan S, Castro R (2006) Modeling high availability systems. In: Proc. Pacific rim dependability conference, Riverside, pp 11–20

  • Trivedi KS, Wang D, Hunt DJ, Rindos A, Smith WE, Vashaw B (2008) Availability modeling of SIP protocol on IBM Websphere. In: Proc. pacific rim dependability conference, Taipei, pp 323–330

  • Trivedi KS, Wang D, Hunt J (2010) Computing the number of calls dropped due to failures. In: Proc. IEEE international symposium on software reliability engineering, San Jose, pp 11–20

  • Vaidyanathan K, Trivedi KS (2005) A comprehensive model for software rejuvenation. IEEE Trans Dependable Secur Comput 2(2):124–137

    Article  Google Scholar 

  • Vaidyanathan K, Harper RE, Hunter SW, Trivedi KS (2001) Analysis and implementation of software rejuvenation in cluster systems. In: ACM SIGMETRICS conference on measurement and modeling of computer systems, Cambridge, USA, pp 62–71

  • Vilkomir SA, Parnas DL, Mendiratta VB, Murphy E (2005) Availability evaluation of hardware/software systems with several recovery procedures. In: Proc. twenty-ninth annual international computer software and applications conference, Edinburgh, UK, pp 473–478

  • Wang D, Trivedi KS (2009) Modeling user-perceived reliability based on user behavior graphs. Int J Reliab Qual Saf Eng 16(4):303–330

    Article  Google Scholar 

  • Wang D, Fricks R, Trivedi KS (2003) Dealing with non-exponential distributions in dependability models. In: Kotsis G (ed), Performance evaluation—stories and perspectives, Österreichische Computer Gesellschaft, Wien, pp 273–302

  • Winslett M (2005) Bruce Lindsay speaks out. In: ACM SIGMOD Record, June 2005, pp 71–79

  • Xie W, Hong Y, Trivedi KS (2005) Analysis of a two-level software rejuvenation policy. Reliab Eng Syst Saf 87(1):13–22

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kishor S. Trivedi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Trivedi, K.S., Grottke, M. & Andrade, E. Software fault mitigation and availability assurance techniques. Int J Syst Assur Eng Manag 1, 340–350 (2010). https://doi.org/10.1007/s13198-011-0038-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-011-0038-9

Keywords

Navigation