A comparative experimental study of software rejuvenation overhead
Introduction
Seventeen years ago, the notion of software aging was formally introduced in [1]. Since then, much theoretical and experimental research has been conducted in order to characterize and understand this important phenomenon. Software aging can be understood as being a continual and growing degradation of the software’s internal state and/or its operating environment during its execution. A general characteristic of this phenomenon is the gradual performance degradation and/or an increase in the software failure rate [2]. Aging in a software system, as in human beings, is an accumulative process. The accumulating effects of successive internal error occurrences [3] directly influence the aging-related failure manifestation. Software aging effects are the practical consequence of errors caused by software fault activations. They work by gradually leading the system’s erroneous state towards a failure occurrence. This gradual shifting as a consequence of aging effects accumulation is the fundamental nature of the software aging phenomenon [2]. It is important to highlight that a system fails due to the consequences of aging effects accumulated over time. For example, a given aged application fails due to insufficiency of available physical memory caused by the accumulation of memory leaks. In this case, the software fault causing memory leaks is a defect in the program code that prevents the use of previously allocated memory, but no longer in use; thus the memory leak is the observed effect of an aging-related fault being activated. The input patterns that exercise the code fragment where the aging-related fault is located are called the aging factors [2]. Hence, aging-related faults may remain dormant until their activation by the aging factors. The activation time can be represented as a random variable. This unpredictability in the manifestation of aging effects explains why locating and removing aging-related faults is very costly in terms of time and human resources [2], [3]. Common causes of software aging are the accumulation of numerical errors, greedy resource allocation policies, non-safe resource releasing strategies, and also degradation problems such as file system and memory fragmentations. Most of these problems are caused by bad software design or faulty code. Regardless of the cause of aging, the presence of aging factors has a deleterious effect on the dependability of software systems.
In order to mitigate the effects of software aging, the concept of software rejuvenation was proposed in [1]. This is a preventive maintenance technique that helps to postpone or prevent the occurrence of failures attributable to the aging effects. Many ideas have been proposed to implement software rejuvenation. For example, stopping and restarting an application process that is suffering from aging (e.g., accumulated memory leaks). This approach aims to prevent or postpone an unexpected application failure that could cause data loss or even major consequences. The main concern, then, is focused on deciding the instant to trigger the rejuvenation mechanism. The software rejuvenation approaches could be classified in two main groups: Time-based and inspection-based strategies. In time-based strategies, rejuvenation is applied regularly and at predetermined time intervals. Time-based strategies are widely used in real environments, such web servers [4].
In contrast to this approach, inspection-based rejuvenation is based on measuring the progress of the aging effects and, when it crosses a certain prespecified limit, triggers the chosen rejuvenation mechanism. In inspection-based strategies, we found three different approaches to determine the optimal moment of triggering the rejuvenation based on the system state: threshold-based, prediction-based, and a mixed approach. In threshold-based approaches, a threshold is pre-fixed by a human expert for every metric under consideration that is an aging indicator [5], [6]. In the prediction-based approaches, some prediction method is applied to predict the time to the exhaustion of resources or the time to failure caused by the software aging. Then, the rejuvenation trigger epoch is decided based on the predicted time to exhaustion. In this case, we can find different prediction methods in use: machine learning, statistical approaches, or structural models [7], [8], [9], [10]. More recently, we also find some papers combining both these approaches, using prediction methods to determine the optimal threshold to trigger the rejuvenation [11]. Fig. 1 graphically presents this classification of rejuvenation scheduling. There is also an orthogonal classification based on whether the whole process of prediction and scheduling of rejuvenation is carried out off-line [8], [9], [12] or on-line [5], [13], [14], [15].
All rejuvenation strategies, in general, have in common the fact that the rejuvenation mechanism usually involves stopping the aged software to refresh its internal state. During this process, it is not uncommon to expect service downtime when the rejuvenation is being carried out. Hence, many papers in this field have concentrated on reducing (e.g., [12], [13], [16]) or even avoiding (e.g., [5], [7], [14]) service downtime during a software rejuvenation execution. This is the reason for the importance of properly scheduling software rejuvenation. In order to determine the best time epochs for triggering software rejuvenation, the use of analytic models [17], monitoring system resources followed by statistical analysis [12], [18], or their combination [4], [15] have been advocated.
As seen from the literature, many different approaches (e.g., [4], [7], [19], [20]) have been proposed to deal with important issues when implementing software rejuvenation. However, to the best of our knowledge, there is no study comparing the effectiveness of these techniques under the same experimental conditions. In all the previous studies, the main goal was to determine the optimal time epoch to trigger specific rejuvenation mechanisms, however, these studies do not take into account the differences in rejuvenation overhead.
To contribute to the body of knowledge in this area, this paper presents a comparative experimental study of different rejuvenation techniques, covering all different levels of rejuvenation granularity investigated so far: application level, operating system (OS) level, virtualization level, and physical node level. The main purpose of this study is a comparative evaluation of the overhead caused by the rejuvenation strategies according to their granularity. Our experimental evaluation is focused on performance overhead, and memory fragmentation overhead caused by the six rejuvenation strategies under evaluation. We also analyze the effectiveness of the rejuvenation techniques to remove the aging effects. We note that beneficial effects of rejuvenation have been well quantified in previous works; here we are quantifying and comparing the overhead accruing upon triggering rejuvenation. Rejuvenation scheduling methods should attempt to balance its beneficial effect with its overhead. This paper is a major extension of our previously published paper in WoSAR 2011 [21]. The main new contributions of this paper are: (i) analysis of the rejuvenation overhead with respect to memory fragmentation and its consequences; (ii) a more comprehensive analysis of the results, a guideline of the pros and cons of the different rejuvenation strategies in dealing with the aging effects and in minimizing the rejuvenation overhead, and finally (iii) a detailed discussion of different rejuvenation scheduling options available together with guidelines in helping design effective rejuvenation scheduling algorithms.
The rest of this paper is organized as follows. Section 2 revisits the fundamentals of memory-related aging effects, particularly memory leak and memory fragmentation that are investigated in our experimental study. Section 3 provides the basics of the selected rejuvenation strategies. Section 4 describes the methodology used to conduct the experiments, emphasizing the experimental plan and the instrumentation. Section 5 discusses comparative results. Section 6 presents different rejuvenation maintenance policies known in their ability to improve the behavior of systems suffering the effects of aging. Finally, Section 7 presents our conclusions and final remarks.
Section snippets
Revisiting memory-related software aging
The most prevalent aging effects investigated in the literature are memory related, specifically memory leaks. Another important memory-related aging effect is the memory fragmentation, but unlike the case of memory leaks, it has not been extensively investigated in the context of software aging. The difficulty involved in experimentally measuring memory fragmentation in a real system is significant, and that is probably one reason for the lack of experimental studies on this topic. Note that
Rejuvenation granularities
Our study is focused on the experimental comparison of different rejuvenation techniques under the same operational conditions, measuring their overhead on the performance, as well as their effectiveness in removing the aging effects. We have classified the rejuvenation techniques based on their granularity. We define granularity as the level that the rejuvenation mechanism directly targets. Based on the system architecture, we can define five main rejuvenation granularity levels while
Experimental setup
In this section we present the instrumentation and experimental plan adopted in our study.
Analysis of experimental results
In this section we present the results obtained from our experiments. We compare them mainly in terms of the execution overhead that each evaluated rejuvenation technique causes on the target application, especially on the client side. We also investigate the effectiveness of the rejuvenation execution on the server side in terms of memory consumption and fragmentation. Note that the values presented in this section are averages over five replications for each experiment.
Rejuvenation scheduling
In order to enhance the availability and mitigate the software aging effects, it is critical to design rejuvenation maintenance policies. Two main rejuvenation scheduling approaches can be defined: time-based and inspection-based.
Time-based approaches trigger rejuvenation at predetermined points of time. If based on monitoring or root cause detection techniques, we are able to figure out the state of various aging indicators, then we can apply rejuvenation on the exact aged layer
Conclusions
This paper presents an experimental evaluation of six rejuvenation strategies categorized in terms of granularity. We have conducted eight experiments with virtualized and non-virtualized environments in order to quantify the influence of this technology on the overhead of the rejuvenation strategies under consideration. The results show that the overhead impact of the rejuvenation techniques is related to their granularity. Fine-grain techniques such as application-level rejuvenation
Acknowledgments
This research was supported in part by the NASA Office of Safety and Mission Assurance (OSMA) Software Assurance Research Program (SARP) under the JPL subcontract # 1440119. We also thank CNPq (National Research Council of Brazil) for the financial support. We also thank Daniel Tes for his help during the test bed setup.
Javier Alonso received the master’s degree in Computer Science in 2004 and the Ph.D. degree from the Technical University of Catalonia (Universitat Politecnica de Catalunya, UPC) in 2011, respectively. From 2006 to 2011 he was an assistant lecturer at the Computer Architecture Department at UPC. He is currently a postdoctoral associate under the supervision of Professor K.S. Trivedi, at Duke University, Durham, NC. Dr. Alonso has served as a reviewer for IEEE TRANSACTIONS ON COMPUTERS,
References (47)
- et al.
Performability analysis of clustered systems with rejuvenation under varying workload
Performance Evaluation
(2007) - et al.
Analysis of a two-level software rejuvenation policy
Reliability Engineering & System Safety
(2005) - Y. Huang, C. Kintala, N. Kolettis, N. Fulton, Software rejuvenation: analysis, module and applications, in: The 15th...
- M. Grottke, R. Matias, K. Trivedi, The fundamentals of software aging, in: The 1st Intl. Workshop on Software Aging and...
- et al.
Basic concepts and taxonomy of dependable and secure computing
IEEE Transactions on Dependable and Secure Computing
(2004) - et al.
A comprehensive model for software rejuvenation
IEEE Transactions on Dependable and Secure Computing
(2005) - et al.
High-available grid services through the use of virtualized clustering
- et al.
Using virtualization to improve software rejuvenation
IEEE Transactions on Computers
(2009) - et al.
An experimental study on software aging and rejuvenation in web servers
- A. Andrzejak, L. Silva, Using machine learning for non-intrusive modeling and prediction of software aging, in:...
Analysis and implementation of software rejuvenation in cluster systems
Software rejuvenation in eucalyptus cloud computing infrastructure: a hybrid method based on multiple thresholds and time series prediction
International Transactions on Systems Science and Applications
Analysis of software aging in a web server
IEEE Transactions on Reliability
Proactive management of software aging
IBM Journal of Research and Development
Using virtualization to improve software rejuvenation
Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule
On-board preventive maintenance: analysis of effectiveness and optimal duty period
A methodology for detection and estimation of software aging
Software rejuvenation in embedded systems
Journal of Automata, Languages and Combinatorics
Fast software rejuvenation of virtual machine monitors
IEEE Transactions on Dependable and Secure Computing
Cited by (58)
A method of multidimensional software aging prediction based on ensemble learning: A case of Android OS
2024, Information and Software TechnologySoftware micro-rejuvenation for Android mobile systems
2022, Journal of Systems and SoftwareCitation Excerpt :Weng et al. (2016) show that warm rejuvenation (application restart) is not effective for mitigating Android aging, and propose to use active learning based on random forest to build an Android behavior model to schedule rejuvenation (Weng et al., 2017). Xiang et al. argue that traditional models for aging and rejuvenation cannot be applied to mobile devices, as they neglect the patterns of usage behavior and experience specific to them (Xiang et al., 2020; Alonso et al., 2013). They propose a model-based approach that considers the typical usage of a mobile device (with frequent switches between active and sleep modes) and use Stochastic Petri Nets to model the behavior and properly trigger rejuvenation during low-usage periods, such as when the device is in sleep mode.
Optimization of partial software rejuvenation policy
2019, Reliability Engineering and System SafetyCitation Excerpt :This work has focused on evaluating and optimizing the task completion probability. As mentioned in the Introduction, while rejuvenations may improve the average system processing speed, they also incur extra system overhead and downtime [15,16]. Thus, the rejuvenation policy adopted not only affects the task completion probability considered in this work, but also affects the expected total mission cost (involving rejuvenation overhead/cost, operation cost of software running, and penalty cost from the software failure).
Javier Alonso received the master’s degree in Computer Science in 2004 and the Ph.D. degree from the Technical University of Catalonia (Universitat Politecnica de Catalunya, UPC) in 2011, respectively. From 2006 to 2011 he was an assistant lecturer at the Computer Architecture Department at UPC. He is currently a postdoctoral associate under the supervision of Professor K.S. Trivedi, at Duke University, Durham, NC. Dr. Alonso has served as a reviewer for IEEE TRANSACTIONS ON COMPUTERS, PERFORMANCE EVALUATION, and several international conferences. His research interests in are in dependability, reliability, availability, performance, performability and survivability modeling of computer and communication systems.
Rivalino Matias, Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned his M.S (1997) and Ph.D. (2006) degrees in Computer Science, and Industrial and Systems engineering from the Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with the Department of Electrical and Computer Engineering at Duke University, Durham, NC, working as a research associate under the supervision of Dr. Kishor Trivedi. He also works for IBM Research Triangle Park in a research related to embedded system availability and reliability analytical modeling. He is currently an Associate Professor in the Computer School at Federal University of Uberlândia, Brazil. Dr. Matias has served as a reviewer for IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF SYSTEMS AND SOFTWARE, and several international conferences. His research interests include dependability applied to computing systems, software aging theory, and diagnosis protocols for computing systems.
Elder Vicente received his B.S.(2007) in Control and Automation Engineering from the Polytechnic School of Uberlândia, Brazil. He earned his M.S.(2012) degree in Computer Science from the Federal University of Uberlândia, Brazil. His recent research interests include techniques for optimizing and evaluating the performance of operating systems.
Ana Maria Martins Carvalho received her B.S. (1998) in Informatics from Uberaba University, Brazil. She earned her M.S. (2012) degree in Computer Science from Federal University of Uberlândia, Brazil. Her recent research interests include statistical techniques applied to computing systems and computer network traffic analysis.
Kishor S. Trivedi (M’86–SM’87–F’92) holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He has been on the Duke faculty since 1975. He is the author of a well known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall; a thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has also published two other books entitled Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers; and Queueing Networks and Markov Chains, published by John Wiley. He is a Fellow of the Institute of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published over 420 articles, and has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF RISK AND RELIABILITY, INTERNATIONAL JOURNAL OF PERFORMABILITY ENGINEERING, AND INTERNATIONAL JOURNAL OF QUALITY AND SAFETY ENGINEERING. He is the recipient of IEEE Computer Society Technical Achievement Award for his research on Software Aging and Rejuvenation.
His research interests in are in reliability, availability, performance, performability and survivability modeling of computer and communication systems. He works closely with industry in carrying out reliability/availability analysis, providing short courses on reliability, availability, performability modeling, and in the development and dissemination of software packages such as SHARPE and SPNP.