Availability analysis and improvement of Active/Standby cluster systems using software rejuvenation

https://doi.org/10.1016/S0164-1212(01)00107-8Get rights and content

Abstract

Cluster systems, using commercially available personal computers connected in a loosely coupled fashion can provide high levels of availability. To improve the availability of personal computer-based Active/Standby cluster systems, we have conducted a study of software rejuvenation that follows a proactive fault-tolerant approach to handle software-origin system failure. In this paper, we map software rejuvenation and switchover states with a semi-Markov process and get mathematical steady-state solutions of the chain. We calculate the availability and the downtime of Active/Standby cluster systems using the solutions and find that software rejuvenation can be used to improve the availability of Active/Standby cluster systems.

Introduction

If the downtime of a system is less than 5 min per year (availability: 99.999%), the system can be classified as a highly available system. Due to the increasing complexity of software, studies on how to implement a highly available system using cluster technology are becoming more actively sought after. Cluster systems using commercially available personal computers connected in a loosely coupled fashion can provide high levels of availability. Moreover, highly available cluster systems become more and more popular for their cost effectiveness (Buyya, 1999). For example, the downtime of duplex systems built by using two clustered low-end personal computers is less than 9 h per year (availability: 99.99%). However, as cluster systems consist of many servers, one must solve the low availability problems caused by the high chance of the server software failures (Park et al., 2000; Lee and Lyer, 1995).

Generally, software-aging phenomena such as memory leak and buffer overflow proceed fast in the software of cluster servers due to the loss of communications or data. After rejuvenating cluster systems by buffer flushing, memory cleaning, file system purging, and initialization of the file allocation table, the systems can restart their service from a healthy condition in which the probability of a software failure is very low (Huang et al., 1995a). Software rejuvenation that terminates an application or a system intentionally and restarts it in a clean internal state prevents failures from occurring, while previous fault-tolerant methods recover from failures after happen (Huang et al., 1995b). As the system manager decides to stop the operation of cluster servers, cleans the internal state of the server processes, and restarts them, software rejuvenation does not require additional costs. Therefore, the method can be regarded as the proper choice for an application requiring high availability.

The connection of highly available Active/Standby cluster systems is represented in Fig. 1. Through fast access network devices such as asymmetric digital subscriber line (ADSL) modem and cable modem, thousands of clients can join cluster systems. The data in the disk arrays are shared via storage interconnected with all the cluster servers. In Active/Standby configuration, there is a primary server where the critical application runs and backup servers that are used as spares in standby mode (Buyya, 1999). We investigate the availability of the Active/Standby cluster systems with a different number of backups and operation policies to evaluate the effect of software rejuvenation.

Due to the fast increase in size and complexity of software, the frequency of software-originated system failure is much higher than that of hardware-originated system failure. It is therefore almost impossible to develop error-free software. As the software used in servers begins to age, software faults such as memory loss, file sharing error, and data damage are prone to occur. However, it is very difficult to detect the failure of a cluster server caused by software aging (this kind of error is called “heisenbugs” in the fault tolerance field) (Garg et al., 1998). If software faults increase with software-aging, the possibility of a system failure becomes high. The following are popular techniques, which have been used for software fault tolerance (Johnson, 1989):

  • Recovery block: if errors occur, the process is re-executed in other modules, which have the same functionality.

  • N-version programming: N independent software modules are executed at the same time. The results are compared and then the majority of the results are selected as the output.

  • N-self checking programming: if a module fails to run, a standby module will continue its operations thereafter.

  • Checkpointing: periodically saves the temporary result of a process task and, when a failure occurs the process re-executes its operations again not from the beginning but from the latest saved checkpoint.


However, due to high cost and software complexity the above-mentioned reactive methods are hardly used for the availability improvement of cluster systems. Software rejuvenation is based on the idea of the preventive maintenance techniques that have been used in the mechanical engineering field for a long time. Garg et al. propose the idea of software rejuvenation as a means for availability improvement (Garg et al., 1995a, Garg et al., 1995b; Garg et al., 1997; Pfening et al., 1996). In the calculation of an optimal rejuvenation period and job loss probability, they use buffer size and workload for the model parameter. However, they did not consider the cost function, which is used in the evaluation of the rejuvenation policies. Huang et al. considered cost function, which calculates the cost of the downtime during rejuvenation and shutdown period (Huang et al., 1995b, Huang et al., 1996; Wang et al., 1997). However, the state transition diagram is very simple and only the simplex system is analyzed. Garg et al. combines a checkpointing method and software rejuvenation to minimize the completion time of a request (Garg et al., 1996). Levendel uses software rejuvenation to handle reliable data exchange between the server and terminals in a communication environment (Levendel, 1999).

In this paper, we present a proactive fault-tolerant approach called software rejuvenation to improve the availability of cluster systems. We map software rejuvenation and switchover states with a semi-Markov process and acquire mathematical steady-state solutions of the chain. By adopting software rejuvenation, we calculate and improve the availability of cluster systems. To our knowledge at this time, no other research has applied software rejuvenation to cluster servers. The previous studies do not generalize the number of server parameters in their analysis.

The organization of the paper is as follows. In section 1, we define the problem and address related research. Section 2 presents a system availability model in which the operational states of Active/Standby cluster systems using software rejuvenation are described and in the following section, the model is analyzed and experimental results are given to validate the model solution. Finally, we conclude that software rejuvenation is a viable method and present further research issues.

Section snippets

System model

A state transition diagram of Active/Standby cluster systems concerning software rejuvenation and switchover states is presented in Fig. 2. The assumptions used in the modeling are as follows:

  • Failure rate (λ) and repair rate (μ) of the server are identical at all states.

  • Unstable rate (λu), the speed of escaping the healthy condition is identical at all states.

  • Rejuvenation rate (λr), the frequency of rejuvenation is identical at all states.

  • Mean time spent during the rejuvenation process is

Availability

The cluster systems are not available in all of the rejuvenation processes in the normal state (1), all of the switchover states, and the failure state (0). The availability of Active/Standby cluster systems is defined as follows:Availability=1−P0+∑j=1kPR1,j+∑i=2nj=1kPTi,j+PSi,j.

Downtime cost

Predictable shutdown cost is far less than that of unexpected shutdown (CfCr). Downtime cost of Active/Standby cluster systems can be calculated from the unavailability of cluster systems and defined as a function of

Conclusions

Highly available proprietary fault-tolerant systems using tightly coupled hardware and software are expensive to develop and deploy. We have analyzed the availability of Active/Standby cluster systems built with loosely coupled commercially available personal computers. According to the system-operating parameters, we have calculated steady-state probabilities, availability, and downtime cost of Active/Standby cluster systems by adopting a software rejuvenation technique. We have validated the

Acknowledgements

This work is supported in part by the Ministry of Information & Communication of Korea (“Support Project of University Foundation Research 〈2001〉” supervised by IITA) and supported in part by the Ministry of Education of Korea (Brain Korea 21 Project Supervised by Korea Research Foundation).

Kiejin Park was born in Seoul, Korea. He received the B.S. and M.S. degrees in Industrial Engineering from Hanyang University and POSTECH, Korea, in 1989 and 1991, respectively, and Ph.D. degree in Department of Computer Engineering, Graduate School of Ajou University, Korea, in 2001. He is currently with Department of Software, Anyang University in Korea. From 1991 to 1996, He worked in the Computer and Communication Research Center of Samsung Advanced Institute of Technology, Korea, as an

References (16)

  • A Pfening

    Optimal rejuvenation for tolerating soft failures

    Performance Evaluation

    (1996)
  • R Buyya
  • Garg, S. et al., 1995a. Time and load based software rejuvenation: policy, evaluation and optimality. In: Proceedings...
  • Garg, S. et al., 1995b. Analysis of software rejuvenation using Markov regenerative stochastic petri net. In:...
  • Garg, S. et al., 1996. Minimizing completion time of a program by checkpointing and rejuvenation. In: Proceedings of...
  • Garg, S. et al., 1997. On the analysis of software rejuvenation policies. In: Proceedings of 12th Annual Conference on...
  • S Garg

    Analysis of preventive maintenance in transactions based software systems

    IEEE Transactions on Computers

    (1998)
  • Y Huang

    Software tools and libraries for fault tolerance

    Bulletin of the Technical Committee on Operating Systems and Application Environment (TCOS)

    (1995)
There are more references available in the full text version of this article.

Cited by (42)

  • Minimum cost replacement and maintenance scheduling in dual-dissimilar-unit standby systems

    2022, Reliability Engineering and System Safety
    Citation Excerpt :

    Consider, for example, redundant computer systems performing online data processing. Such systems usually consist of two units and each unit periodically undergoes a rejuvenation/restart procedure [31–33] (to avoid performance deterioration caused by overheating, memory overwhelming etc.) while the other one operates. The units change their operation/idle status repeatedly during the mission.

  • Optimal periodic software rejuvenation policies based on interval reliability criteria

    2018, Reliability Engineering and System Safety
    Citation Excerpt :

    The former proposed two types of rejuvenation policies; risk-level rejuvenation policy and alert threshold rejuvenation policy, and evaluated numerically the long-term reward measures, the latter compared three kinds of rejuvenation policies quantitatively. Park and Kim [45] designed the optimal rejuvenation scheme for the cable modem termination system and some cluster systems. By taking account of both full restart and partial restart actions, Xie et al. [46] considered an inspection-based preventive maintenance and introduced a two-level software rejuvenation policy; service-level rejuvenation and box-level rejuvenation.

  • SIL2 assessment of an Active/Standby COTS-based Safety-Related system

    2018, Reliability Engineering and System Safety
    Citation Excerpt :

    All of these, however, can provide a little improvement, which is also difficult to quantify. Hence, what is proposed here is a technique highly used in reliability engineering: Software Rejuvenation [4,21], a technique of proactive fault tolerance in which the system is periodically reboot to clean the memory. In fact, it is well known that most critical SW failures are transient.

  • Availability analysis for repairable system with warm standby, switching failure and reboot delay

    2013, International Journal of Mathematics in Operational Research
View all citing articles on Scopus

Kiejin Park was born in Seoul, Korea. He received the B.S. and M.S. degrees in Industrial Engineering from Hanyang University and POSTECH, Korea, in 1989 and 1991, respectively, and Ph.D. degree in Department of Computer Engineering, Graduate School of Ajou University, Korea, in 2001. He is currently with Department of Software, Anyang University in Korea. From 1991 to 1996, He worked in the Computer and Communication Research Center of Samsung Advanced Institute of Technology, Korea, as an Assistant Researcher. From 1996 to 1997, he was with the Software Research and Development Center of Samsung Electronics Co., Korea, as a Senior Researcher. From 2001 to 2002, he worked in the Network Equipment Test Center of Electronics and Telecommunications Research Institute (ETRI) as a senior researcher. His research interests include software dependability, fault-tolerant computing, performance evaluation, simulation, multimedia systems, and cluster systems.

Sungsoo Kim was born in Seoul, Korea. He received the B.S. and M.S. degrees in electronic engineering from Sogang University, Korea, in 1982 and 1984, respectively, and Ph.D. degree in computer science from Texas A&M University, College Station, Texas, in 1995. He is currently an Associate Professor in Graduate School of Information and Communication. Ajou University in Korea. From 1983 to 1986, he worked in the Research and Development Center of Samsung Electronics Co., Korea. From 1987 to 1996, he was with the Computer and Communications Research Center of Samsung Advanced Institute of Technology, Korea, as a Principle Researcher. His research interests include fault-tolerent computing, performance evaluation, multimedia systems, mobile systems, and cluster systems.

View full text