Abstract
Companies are expected to keep their systems up and running and make data continuously available. Several recent studies have established that most system outages are due to software faults. In this paper, we discuss availability aspects of large software-based systems. We begin by classifying software faults into Bohrbugs and Mandelbugs, and identify aging-related bugs as a subtype of the latter. We then examine mitigation methods for Mandelbugs in general and aging-related bugs in particular. Finally, we discuss techniques for the quantitative availability assurance for such systems.
Similar content being viewed by others
References
Adams E (1984) Optimizing preventive service of the software products. IBM J Res Dev 28(1):2–14
Arlat J, Costes A, Crouzet Y, Laprie J-C, Powell D (1993) Fault injection and dependability evaluation of fault tolerant systems. IEEE Trans Comput 42(8):913–923
Avižienis A, Chen L (1977) On the implementation of N-version programming for software fault tolerance during execution. In: Proc. IEEE computer software and applications conference, Chicago, pp 149–155
Avritzer A, Weyuker EJ (1997) Monitoring smoothly degrading systems for increased dependability. Empir Softw Eng 2(1):59–77
Avritzer A, Bondi A, Grottke M, Trivedi KS, Weyuker EJ (2006) Performance assurance via software rejuvenation: monitoring, statistics and algorithms. In: Proc. international conference on dependable systems and networks 2006, Philadelphia, pp 435–444
Barlow RE, Campo R (1975) Total time on test processes and applications to failure data analysis. In: Barlow RE, Fussell J, Singpurwalla ND (eds) Reliability and fault tree analysis. SIAM, Philadelphia, pp 451–481
Bernstein L, Kintala C (2004) Software rejuvenation. CrossTalk 17(8):23–26
Bharadwaj R (2008) Verified software: the real grand challenge. In: Meyer B, Woodcock J (eds) Verified software: theories, tools, experiments. Lecture notes in computer science, vol 4171, Springer, Berlin, pp 318–324
Bolch G, Greiner S, de Meer H, Trivedi KS (2006) Queueing networks and Markov chains modeling and performance evaluation with computer science applications, 2nd edn. Wiley, New York
Candea G, Cutler J, Fox A (2004) Improving availability with recursive microreboots: a soft-state system case study. Perform Eval 56(1–4):213–248
Cassidy KJ, Gross KC, Malekpour A (2002) Advanced pattern recognition for detection of complex software aging in online transaction processing servers. In: Proc. international conference on dependable systems and networks, Washington, pp 478–482
Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS, Vaidyanathan K, Zeggert WP (2001) Proactive management of software aging. IBM J Res Dev 45(2):311–332
Chen D, Selvamuthu D, Chen D, Li L, Some RR, Nikora AP, Trivedi KS (2002) Reliability and availability analysis for the JPL remote exploration and experimentation system. In: Proc international conference on dependable systems and networks, Bethesda, pp 337–344
Cisco Systems (2001) Cisco catalyst memory leak vulnerability. Document ID:13618, Cisco Security Advisory. http://www.cisco.com/warp/public/707/cisco-sa-20001206-catalyst-memleak.shtml. Accessed 22 Dec 2010
Devraj A, Mishra K, Trivedi KS (2010) Uncertainty propagation in analytic availability models. In: Proc. IEEE symposium on reliable distributed systems, New Delhi
Dohi T, Goševa-Popstojanova K, Trivedi KS (2000) Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule. In: Proc. 2000 Pacific rim international symposium on dependable computing, Los Angeles, pp 77–84
Dohi T, Goševa-Popstojanova K, Trivedi KS (2001) Estimating software rejuvenation schedule in high assurance systems. Comput J 44(6):473–485
Dumitras T, Srivastava D, Narasimhan P (2005) Architecting and implementing versatile depend-ability. In: Gacek C, Romanovsky A, de Lemos R (eds) Architecting dependable systems, vol III. Lecture notes in computer science, vol 3549, Springer, Berlin, pp 212–231
Garg S, Puliafito A, Telek M, Trivedi KS (1995) Analysis of software rejuvenation using Markov regenerative stochastic Petri net. In: Proc. sixth international symposium on software reliability engineering, Toulouse, pp 24–27
Garg S, van Moorsel A, Vaidyanathan K, Trivedi KS (1998) A methodology for detection and estimation of software aging. In: Proc. ninth international symposium on software reliability engineering, Paderborn, pp 283–292
Garg S, Huang Y, Kintala CMR, Trivedi KS, Yajnik S (1999) Performance and reliability evaluation of passive replication schemes in application level fault tolerance. In: Proc. 29th annual international symposium on fault tolerant computing, Madison, pp 15–18
Gray J (1986) Why do computers stop and what can be done about it? In: Proc. 5th symposium on reliability in distributed systems, Los Angeles, pp 3–12
Grottke M, Trivedi KS (2005a) Software faults, software aging and software rejuvenation. J Reliab Eng Assoc Jpn 27(7):425–438
Grottke M, Trivedi KS (2005b) A classification of software faults. In: Supplemental proc. sixteenth international IEEE symposium on software reliability engineering, Chicago, USA, pp 4.19–4.20
Grottke M, Trivedi KS (2007) Fighting bugs: remove, retry, replicate and rejuvenate. IEEE Comput 40(2):107–109
Grottke M, Trivedi KS (2008) Analysis of the escalated levels of failure recovery approach. Working paper, University of Erlangen-Nuremberg, Nuremberg
Grottke M, Li L, Vaidyanathan K, Trivedi KS (2006) Analysis of software aging in a web server. IEEE Trans Reliab 55(3):411–420
Grottke M, Matias R Jr, Trivedi KS (2008) The fundamentals of software aging. In: Proc. first IEEE workshop on software aging and rejuvenation, Seattle
Grottke M, Nikora A, Trivedi KS (2010) An empirical investigation of fault types in space mission system software. In: Proc. 2010 IEEE/IFIP international conference on dependable systems and networks, Chicago, pp 447–456
Hellerstein J, Diao Y, Parekh S, Tilbury DM (2004) Feedback control of computer systems. Wiley, New York
Hoffman G, Malek M, Trivedi KS (2006) A best practice guide to resource forecasting for the Apache webserver. In: Proc. Pacific rim dependability conference, Riverside, pp 183–193
Hong Y, Chen D, Li L, Trivedi KS (2002) Closed loop design for software rejuvenation. In: Proc. workshop on self-healing, adaptive and self-managed systems, New York
Horning JJ, Lauer HC, Melliar-Smith PM, Randell B (1974) A program structure for error detection and recovery. In: Lecture notes in computer science, vol 16, Springer, Berlin, pp 177–193
Hsueh M-C, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. IEEE Comput 30(4):75–82
Huang Y, Kintala C, Kolettis N, Fulton N (1995) Software rejuvenation: analysis, module and applications. In: Proc. twenty-fifth international symposium on fault-tolerant computing, Pasadena, pp 381–390
Hunter SW, Smith WE (1999) Availability modeling and analysis of a two node cluster. In: Proc. 5th international conference on information systems, analysis and synthesis, Orlando
Kourai K, Chiba S (2007) A fast rejuvenation technique for server consolidation with virtual machines. In: Proc. international conference on dependable systems and networks 2007, Edinburgh, pp 245–255
Lanus M, Liang Yin, Trivedi KS (2003) Hierarchical composition and aggregation of state-based availability and performability models. IEEE Trans Reliab 52(1):44–52
Laprie J-C (ed) (1992) Dependability, basic concepts and terminology. Springer, New York
Laprie J-C, Arlat J, Béounes C, Kanoun K, Hourtolle C (1987) Hardware and software fault tolerance: definition and analysis of architectural solutions. In: Proc. 17th international symposium on fault-tolerant computing, Pittsburgh, pp 116–121
Lee I, Iyer RK (1995) Software dependability in the Tandem GUARDIAN system. IEEE Trans Softw Eng 21(5):455–467
Lindemann C (1998) Performance modelling with deterministic and stochastic Petri nets. Wiley, New York
Liu Y, Ma Y, Han J, Levendel H, Trivedi KS (2005) A proactive approach towards always-on availability in broadband cable networks. Comput Commun 28(1):51–64
Mainkar V, Trivedi KS (1996) Sufficient conditions for existence of a fixed point in stochastic reward net-based iterative methods. IEEE Trans Softw Eng 22(9):640–653
Marshall E (1992) Fatal error: how Patriot overlooked a Scud. Science 255:1347
Matias R Jr, Freitas Filho PJ (2006) An experimental study on software aging and rejuvenation in web servers. In: Proc. 30th IEEE annual international computer software and applications conference, Chicago, vol 1, pp 189–196
Matias R Jr, Trivedi KS, Maciel P (2010) Using accelerated life tests to estimate time to software aging failure. In Proc. IEEE international symposium on software reliability engineering, San Jose, pp 211–219
Matias R Jr, Barbetta PA, Trivedi KS (2010) Accelerated degradation tests applied to software aging experiments. IEEE Trans Reliab 59(1):102–114
Meeker WQ, Escobar LA (1998) Statistical methods for reliability data. Wiley, New York
Mendiratta VB (1999) Reliability analysis of clustered computing systems. In: Proc. ninth international symposium on software reliability engineering, Paderborn, pp 268–272
Mendiratta VB, Souza JM, Zimmerman G (2007) Using software failure data for availability evaluation. In: Designer and developer forum, GLOBECOM 2007, Washington
Montgomery DC (2004) Design and analysis of experiments, 6th edn. Wiley, New York
Narasimhan P, Dumitras T, Pertet S, Reverte CF, Slember J, Srivastava D (2005) MEAD: support for real-time fault tolerant CORBA. Concurr Comput Pract Exp 17(12):1527–1545
Nelson W (1982) Applied life data analysis. Wiley, New York
Nicol D, Sanders W, Trivedi KS (2004) Model-based evaluation: from dependability to security. IEEE Trans Dependable Secur Comput 1(1):48–65
Pertet S, Narasimhan P (2004) Proactive recovery in distributed CORBA applications. In: Proc. international conference on dependable systems and networks, Florence, pp 357–366
Pertet S, Narasimhan P (2005) Causes of failure in web applications. Carnegie Mellon University Parallel Data Lab Technical Report, CMU-PDL-05-109
Pietrantuono R, Russo S, Trivedi KS (2010) Online monitoring of software system reliability. In: Proc. dependable computing conference, Tokyo, pp 209–218
Raymond ES (1991) The new hacker’s dictionary. MIT, Cambridge
Sahner RA, Trivedi KS, Puliafito A (1996) Performance and reliability analysis of computer systems. Kluwer, Boston
Sato N, Nakamura H, Trivedi KS (2007) Detecting performance and reliability bottlenecks of composite web services. In: Proc. ICSOC, Vienna
Shereshevsky M, Crowell J, Cukic B, Gandikota V, Liu Y (2003) Software aging and multifractality of memory resources. In: Proc. international conference on dependable systems and networks, San Francisco, pp 721–730
Silva L, Madeira H, Silva JG (2006) Software aging and rejuvenation in a SOAP-based server. In: Proc. fifth IEEE international symposium on network computing and applications, Cambridge, pp 56–65
Smith WE, Trivedi KS, Tomek L, Ackeret J (2008) Availability analysis of multi-component blade server systems. IBM Syst J 47(4):621–640
Tai A, Chau S, Alkalaj L, Hecht H (1999) On-board preventive maintenance: a design-oriented analytic study for long-life applications. Perform Eval 35(3–4):215–232
Tobias P, Trindade D (1995) Applied reliability, 2nd edn. Kluwer, Boston
Tomek L, Trivedi KS (1991) Fixed-point iteration in availability modeling. In: Dal Cin M (ed) Proc. fifth international GI/ITG/GMA conference on fault-tolerant computing systems, Springer, Berlin, pp 229–240
Trivedi KS (2000) Availability analysis of Cisco GSR 12000 and Juniper M20/M40. Cisco Technical Report
Trivedi KS (2001) Probability & statistics with reliability, queueing and computer science applications, 2nd edn. Wiley, New York
Trivedi KS, Vasireddy R, Trindade D, Nathan S, Castro R (2006) Modeling high availability systems. In: Proc. Pacific rim dependability conference, Riverside, pp 11–20
Trivedi KS, Wang D, Hunt DJ, Rindos A, Smith WE, Vashaw B (2008) Availability modeling of SIP protocol on IBM Websphere. In: Proc. pacific rim dependability conference, Taipei, pp 323–330
Trivedi KS, Wang D, Hunt J (2010) Computing the number of calls dropped due to failures. In: Proc. IEEE international symposium on software reliability engineering, San Jose, pp 11–20
Vaidyanathan K, Trivedi KS (2005) A comprehensive model for software rejuvenation. IEEE Trans Dependable Secur Comput 2(2):124–137
Vaidyanathan K, Harper RE, Hunter SW, Trivedi KS (2001) Analysis and implementation of software rejuvenation in cluster systems. In: ACM SIGMETRICS conference on measurement and modeling of computer systems, Cambridge, USA, pp 62–71
Vilkomir SA, Parnas DL, Mendiratta VB, Murphy E (2005) Availability evaluation of hardware/software systems with several recovery procedures. In: Proc. twenty-ninth annual international computer software and applications conference, Edinburgh, UK, pp 473–478
Wang D, Trivedi KS (2009) Modeling user-perceived reliability based on user behavior graphs. Int J Reliab Qual Saf Eng 16(4):303–330
Wang D, Fricks R, Trivedi KS (2003) Dealing with non-exponential distributions in dependability models. In: Kotsis G (ed), Performance evaluation—stories and perspectives, Österreichische Computer Gesellschaft, Wien, pp 273–302
Winslett M (2005) Bruce Lindsay speaks out. In: ACM SIGMOD Record, June 2005, pp 71–79
Xie W, Hong Y, Trivedi KS (2005) Analysis of a two-level software rejuvenation policy. Reliab Eng Syst Saf 87(1):13–22
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Trivedi, K.S., Grottke, M. & Andrade, E. Software fault mitigation and availability assurance techniques. Int J Syst Assur Eng Manag 1, 340–350 (2010). https://doi.org/10.1007/s13198-011-0038-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-011-0038-9