Availability analysis and improvement of Active/Standby cluster systems using software rejuvenation

doi:10.1016/S0164-1212(01)00107-8

Journal of Systems and Software

Volume 61, Issue 2, 15 March 2002, Pages 121-128

https://doi.org/10.1016/S0164-1212(01)00107-8 Get rights and content

Abstract

Cluster systems, using commercially available personal computers connected in a loosely coupled fashion can provide high levels of availability. To improve the availability of personal computer-based Active/Standby cluster systems, we have conducted a study of software rejuvenation that follows a proactive fault-tolerant approach to handle software-origin system failure. In this paper, we map software rejuvenation and switchover states with a semi-Markov process and get mathematical steady-state solutions of the chain. We calculate the availability and the downtime of Active/Standby cluster systems using the solutions and find that software rejuvenation can be used to improve the availability of Active/Standby cluster systems.

Introduction

If the downtime of a system is less than 5 min per year (availability: 99.999%), the system can be classified as a highly available system. Due to the increasing complexity of software, studies on how to implement a highly available system using cluster technology are becoming more actively sought after. Cluster systems using commercially available personal computers connected in a loosely coupled fashion can provide high levels of availability. Moreover, highly available cluster systems become more and more popular for their cost effectiveness (Buyya, 1999). For example, the downtime of duplex systems built by using two clustered low-end personal computers is less than 9 h per year (availability: 99.99%). However, as cluster systems consist of many servers, one must solve the low availability problems caused by the high chance of the server software failures (Park et al., 2000; Lee and Lyer, 1995).

Generally, software-aging phenomena such as memory leak and buffer overflow proceed fast in the software of cluster servers due to the loss of communications or data. After rejuvenating cluster systems by buffer flushing, memory cleaning, file system purging, and initialization of the file allocation table, the systems can restart their service from a healthy condition in which the probability of a software failure is very low (Huang et al., 1995a). Software rejuvenation that terminates an application or a system intentionally and restarts it in a clean internal state prevents failures from occurring, while previous fault-tolerant methods recover from failures after happen (Huang et al., 1995b). As the system manager decides to stop the operation of cluster servers, cleans the internal state of the server processes, and restarts them, software rejuvenation does not require additional costs. Therefore, the method can be regarded as the proper choice for an application requiring high availability.

The connection of highly available Active/Standby cluster systems is represented in Fig. 1. Through fast access network devices such as asymmetric digital subscriber line (ADSL) modem and cable modem, thousands of clients can join cluster systems. The data in the disk arrays are shared via storage interconnected with all the cluster servers. In Active/Standby configuration, there is a primary server where the critical application runs and backup servers that are used as spares in standby mode (Buyya, 1999). We investigate the availability of the Active/Standby cluster systems with a different number of backups and operation policies to evaluate the effect of software rejuvenation.

Due to the fast increase in size and complexity of software, the frequency of software-originated system failure is much higher than that of hardware-originated system failure. It is therefore almost impossible to develop error-free software. As the software used in servers begins to age, software faults such as memory loss, file sharing error, and data damage are prone to occur. However, it is very difficult to detect the failure of a cluster server caused by software aging (this kind of error is called “heisenbugs” in the fault tolerance field) (Garg et al., 1998). If software faults increase with software-aging, the possibility of a system failure becomes high. The following are popular techniques, which have been used for software fault tolerance (Johnson, 1989):

•
Recovery block: if errors occur, the process is re-executed in other modules, which have the same functionality.
•
N-version programming: N independent software modules are executed at the same time. The results are compared and then the majority of the results are selected as the output.
•
N-self checking programming: if a module fails to run, a standby module will continue its operations thereafter.
•
Checkpointing: periodically saves the temporary result of a process task and, when a failure occurs the process re-executes its operations again not from the beginning but from the latest saved checkpoint.

However, due to high cost and software complexity the above-mentioned reactive methods are hardly used for the availability improvement of cluster systems. Software rejuvenation is based on the idea of the preventive maintenance techniques that have been used in the mechanical engineering field for a long time. Garg et al. propose the idea of software rejuvenation as a means for availability improvement (Garg et al., 1995a, Garg et al., 1995b; Garg et al., 1997; Pfening et al., 1996). In the calculation of an optimal rejuvenation period and job loss probability, they use buffer size and workload for the model parameter. However, they did not consider the cost function, which is used in the evaluation of the rejuvenation policies. Huang et al. considered cost function, which calculates the cost of the downtime during rejuvenation and shutdown period (Huang et al., 1995b, Huang et al., 1996; Wang et al., 1997). However, the state transition diagram is very simple and only the simplex system is analyzed. Garg et al. combines a checkpointing method and software rejuvenation to minimize the completion time of a request (Garg et al., 1996). Levendel uses software rejuvenation to handle reliable data exchange between the server and terminals in a communication environment (Levendel, 1999).

In this paper, we present a proactive fault-tolerant approach called software rejuvenation to improve the availability of cluster systems. We map software rejuvenation and switchover states with a semi-Markov process and acquire mathematical steady-state solutions of the chain. By adopting software rejuvenation, we calculate and improve the availability of cluster systems. To our knowledge at this time, no other research has applied software rejuvenation to cluster servers. The previous studies do not generalize the number of server parameters in their analysis.

The organization of the paper is as follows. In section 1, we define the problem and address related research. Section 2 presents a system availability model in which the operational states of Active/Standby cluster systems using software rejuvenation are described and in the following section, the model is analyzed and experimental results are given to validate the model solution. Finally, we conclude that software rejuvenation is a viable method and present further research issues.

Section snippets

System model

A state transition diagram of Active/Standby cluster systems concerning software rejuvenation and switchover states is presented in Fig. 2. The assumptions used in the modeling are as follows:

•
Failure rate (λ) and repair rate (μ) of the server are identical at all states.
•
Unstable rate (λ_u), the speed of escaping the healthy condition is identical at all states.
•
Rejuvenation rate (λ_r), the frequency of rejuvenation is identical at all states.
•
Mean time spent during the rejuvenation process is

Availability

The cluster systems are not available in all of the rejuvenation processes in the normal state (1), all of the switchover states, and the failure state (0). The availability of Active/Standby cluster systems is defined as follows: $Availability =1− P_{0} +∑_{j=1}^{k} P_{R_{1,j}} +∑_{i=2}^{n} ∑_{j=1}^{k} P_{T_{i,j}} +P_{S_{i,j}} .$

Downtime cost

Predictable shutdown cost is far less than that of unexpected shutdown (C_f≫C_r). Downtime cost of Active/Standby cluster systems can be calculated from the unavailability of cluster systems and defined as a function of

Conclusions

Highly available proprietary fault-tolerant systems using tightly coupled hardware and software are expensive to develop and deploy. We have analyzed the availability of Active/Standby cluster systems built with loosely coupled commercially available personal computers. According to the system-operating parameters, we have calculated steady-state probabilities, availability, and downtime cost of Active/Standby cluster systems by adopting a software rejuvenation technique. We have validated the

Acknowledgements

This work is supported in part by the Ministry of Information & Communication of Korea (“Support Project of University Foundation Research 〈2001〉” supervised by IITA) and supported in part by the Ministry of Education of Korea (Brain Korea 21 Project Supervised by Korea Research Foundation).

References (16)

A Pfening
Optimal rejuvenation for tolerating soft failures
Performance Evaluation
(1996)
R Buyya
Garg, S. et al., 1995a. Time and load based software rejuvenation: policy, evaluation and optimality. In: Proceedings...
Garg, S. et al., 1995b. Analysis of software rejuvenation using Markov regenerative stochastic petri net. In:...
Garg, S. et al., 1996. Minimizing completion time of a program by checkpointing and rejuvenation. In: Proceedings of...
Garg, S. et al., 1997. On the analysis of software rejuvenation policies. In: Proceedings of 12th Annual Conference on...
S Garg
Analysis of preventive maintenance in transactions based software systems
IEEE Transactions on Computers
(1998)
Y Huang
Software tools and libraries for fault tolerance
Bulletin of the Technical Committee on Operating Systems and Application Environment (TCOS)
(1995)

There are more references available in the full text version of this article.

Cited by (42)

Minimum cost replacement and maintenance scheduling in dual-dissimilar-unit standby systems
2022, Reliability Engineering and System Safety
Citation Excerpt :
Consider, for example, redundant computer systems performing online data processing. Such systems usually consist of two units and each unit periodically undergoes a rejuvenation/restart procedure [31–33] (to avoid performance deterioration caused by overheating, memory overwhelming etc.) while the other one operates. The units change their operation/idle status repeatedly during the mission.
Recent studies have shown that reusing standby elements during the mission may improve the mission success probability significantly. However, such a benefit cannot be effectively achieved without a careful design of the replacement and maintenance schedule (RMS), which determines work period durations of operating elements and types of maintenance performed for idle elements. This paper makes contributions by modeling and optimizing the RMS for a heterogeneous dual-unit warm standby system with the aim to minimize the total expected mission cost, covering operation, standby and maintenance costs as well as mission failure penalty cost. The two system elements are dissimilar, characterized by different performance, failure time distribution and cost parameters. For a successful mission, a specified amount of work must be accomplished before both elements become failed or unavailable. We propose a new probabilistic model-based methodology for assessing the mission success probability and expected mission cost (EMC) of the considered system. An optimization problem is further formulated and solved to find the optimal RMS minimizing the EMC. A case study of a two-pump oil transfer system is conducted to demonstrate the proposed model and effects of different cost parameters on the optimal RMS solution and corresponding mission success probability and EMC.
Optimal periodic software rejuvenation policies based on interval reliability criteria
2018, Reliability Engineering and System Safety
Citation Excerpt :
The former proposed two types of rejuvenation policies; risk-level rejuvenation policy and alert threshold rejuvenation policy, and evaluated numerically the long-term reward measures, the latter compared three kinds of rejuvenation policies quantitatively. Park and Kim [45] designed the optimal rejuvenation scheme for the cable modem termination system and some cluster systems. By taking account of both full restart and partial restart actions, Xie et al. [46] considered an inspection-based preventive maintenance and introduced a two-level software rejuvenation policy; service-level rejuvenation and box-level rejuvenation.
Software aging often affects the performance of software systems and may eventually cause them to fail. A complementary approach to handle transient software failures due to the software aging is called software rejuvenation. It is a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. In this paper, we consider the optimal software rejuvenation policies maximizing the interval reliability using the Markov regenerative process formalism. We derive analytically the optimal software rejuvenation timing that maximizes the limiting interval reliability or the interval reliability with exponentially distributed operation time. Further, we examine numerically the transient behavior of the interval reliability at an arbitrary operation time. Our results under the interval reliability criteria are extensions of some earlier papers, since the interval reliability is a comprehensive measure that specializes to pointwise system availability and to system reliability.
SIL2 assessment of an Active/Standby COTS-based Safety-Related system
2018, Reliability Engineering and System Safety
Citation Excerpt :
All of these, however, can provide a little improvement, which is also difficult to quantify. Hence, what is proposed here is a technique highly used in reliability engineering: Software Rejuvenation [4,21], a technique of proactive fault tolerance in which the system is periodically reboot to clean the memory. In fact, it is well known that most critical SW failures are transient.
The need of reducing costs and shortening development time is resulting in a more and more pervasive use of Commercial-Off-The-Shelf components also for the development of Safety-Related systems, which traditionally relied on ad-hoc design. This technology trend exacerbates the inherent difficulty of satisfying – and certifying – the challenging safety requirements imposed by safety certification standards, since the complexity of individual components (and consequently of the overall system) has increased by orders of magnitude. To bridge this gap, this paper proposes an approach to safety certification that is rigorous while also practical. The approach is hybrid, meaning that it effectively combines analytical modeling and field measurements. The techniques are presented and the results validated with respect to an Active/Standby COTS-Based industrial system, namely the Train Management System of Hitachi-Ansaldo STS, which has to satisfy Safety Integrity Level 2 requirements. A modeling phase is first used to identify COTS safety bottlenecks. For these components, a mitigation strategy is proposed, and then validated in an experimental phase that is conducted on the real system. The study demonstrates that with a relatively little effort we are able to configure the target system in such a way that it achieves SIL2.
Availability analysis for repairable system with warm standby, switching failure and reboot delay
2013, International Journal of Mathematics in Operational Research
Reliability and Safety Evaluation of the Station Autonomous Machine System Based on Markov Model
2024, Advances in Transdisciplinary Engineering
Safety enhancement design method and control strategy for CCU of high-speed train
2022, Advances in Mechanical Engineering

View all citing articles on Scopus

Kiejin Park was born in Seoul, Korea. He received the B.S. and M.S. degrees in Industrial Engineering from Hanyang University and POSTECH, Korea, in 1989 and 1991, respectively, and Ph.D. degree in Department of Computer Engineering, Graduate School of Ajou University, Korea, in 2001. He is currently with Department of Software, Anyang University in Korea. From 1991 to 1996, He worked in the Computer and Communication Research Center of Samsung Advanced Institute of Technology, Korea, as an Assistant Researcher. From 1996 to 1997, he was with the Software Research and Development Center of Samsung Electronics Co., Korea, as a Senior Researcher. From 2001 to 2002, he worked in the Network Equipment Test Center of Electronics and Telecommunications Research Institute (ETRI) as a senior researcher. His research interests include software dependability, fault-tolerant computing, performance evaluation, simulation, multimedia systems, and cluster systems.

Sungsoo Kim was born in Seoul, Korea. He received the B.S. and M.S. degrees in electronic engineering from Sogang University, Korea, in 1982 and 1984, respectively, and Ph.D. degree in computer science from Texas A&M University, College Station, Texas, in 1995. He is currently an Associate Professor in Graduate School of Information and Communication. Ajou University in Korea. From 1983 to 1986, he worked in the Research and Development Center of Samsung Electronics Co., Korea. From 1987 to 1996, he was with the Computer and Communications Research Center of Samsung Advanced Institute of Technology, Korea, as a Principle Researcher. His research interests include fault-tolerent computing, performance evaluation, multimedia systems, mobile systems, and cluster systems.

View full text

Availability analysis and improvement of Active/Standby cluster systems using software rejuvenation

Abstract

Introduction

Section snippets

System model

Availability

Downtime cost

Conclusions

Acknowledgements

Performance Evaluation

Analysis of preventive maintenance in transactions based software systems

IEEE Transactions on Computers

Software tools and libraries for fault tolerance

Bulletin of the Technical Committee on Operating Systems and Application Environment (TCOS)