Elsevier

Performance Evaluation

Volume 70, Issue 3, March 2013, Pages 231-250
Performance Evaluation

A comparative experimental study of software rejuvenation overhead

https://doi.org/10.1016/j.peva.2012.09.002Get rights and content

Abstract

In this paper we present a comparative experimental study of the main software rejuvenation techniques developed so far to mitigate the software aging effects. We consider six different rejuvenation techniques with different levels of granularity: (i) physical node reboot, (ii) virtual machine reboot, (iii) OS reboot, (iv) fast OS reboot, (v) standalone application restart, and (vi) application rejuvenation by a hot standby server. We conduct a set of experiments injecting memory leaks at the application level. We evaluate the performance overhead introduced by software rejuvenation in terms of throughput loss, failed requests, slow requests, and memory fragmentation overhead. We also analyze the selected rejuvenation techniques’ efficiency in mitigating the aging effects. Due to the growing adoption of virtualization technology, we also analyze the overhead of the rejuvenation techniques in virtualized environments. The results show that the performance overheads introduced by the rejuvenation techniques are related to the granularity level. We also capture different levels of memory fragmentation overhead induced by the virtualization demonstrating some drawbacks of using virtualization in comparison with non-virtualized rejuvenation approaches. Finally, based on these research findings we present comprehensive guidelines to support decision making during the design of rejuvenation scheduling algorithms, as well as in selecting the appropriate rejuvenation mechanism.

Introduction

Seventeen years ago, the notion of software aging was formally introduced in [1]. Since then, much theoretical and experimental research has been conducted in order to characterize and understand this important phenomenon. Software aging can be understood as being a continual and growing degradation of the software’s internal state and/or its operating environment during its execution. A general characteristic of this phenomenon is the gradual performance degradation and/or an increase in the software failure rate [2]. Aging in a software system, as in human beings, is an accumulative process. The accumulating effects of successive internal error occurrences [3] directly influence the aging-related failure manifestation. Software aging effects are the practical consequence of errors caused by software fault activations. They work by gradually leading the system’s erroneous state towards a failure occurrence. This gradual shifting as a consequence of aging effects accumulation is the fundamental nature of the software aging phenomenon [2]. It is important to highlight that a system fails due to the consequences of aging effects accumulated over time. For example, a given aged application fails due to insufficiency of available physical memory caused by the accumulation of memory leaks. In this case, the software fault causing memory leaks is a defect in the program code that prevents the use of previously allocated memory, but no longer in use; thus the memory leak is the observed effect of an aging-related fault being activated. The input patterns that exercise the code fragment where the aging-related fault is located are called the aging factors [2]. Hence, aging-related faults may remain dormant until their activation by the aging factors. The activation time can be represented as a random variable. This unpredictability in the manifestation of aging effects explains why locating and removing aging-related faults is very costly in terms of time and human resources [2], [3]. Common causes of software aging are the accumulation of numerical errors, greedy resource allocation policies, non-safe resource releasing strategies, and also degradation problems such as file system and memory fragmentations. Most of these problems are caused by bad software design or faulty code. Regardless of the cause of aging, the presence of aging factors has a deleterious effect on the dependability of software systems.

In order to mitigate the effects of software aging, the concept of software rejuvenation was proposed in [1]. This is a preventive maintenance technique that helps to postpone or prevent the occurrence of failures attributable to the aging effects. Many ideas have been proposed to implement software rejuvenation. For example, stopping and restarting an application process that is suffering from aging (e.g., accumulated memory leaks). This approach aims to prevent or postpone an unexpected application failure that could cause data loss or even major consequences. The main concern, then, is focused on deciding the instant to trigger the rejuvenation mechanism. The software rejuvenation approaches could be classified in two main groups: Time-based and inspection-based strategies. In time-based strategies, rejuvenation is applied regularly and at predetermined time intervals. Time-based strategies are widely used in real environments, such web servers [4].

In contrast to this approach, inspection-based rejuvenation is based on measuring the progress of the aging effects and, when it crosses a certain prespecified limit, triggers the chosen rejuvenation mechanism. In inspection-based strategies, we found three different approaches to determine the optimal moment of triggering the rejuvenation based on the system state: threshold-based, prediction-based, and a mixed approach. In threshold-based approaches, a threshold is pre-fixed by a human expert for every metric under consideration that is an aging indicator [5], [6]. In the prediction-based approaches, some prediction method is applied to predict the time to the exhaustion of resources or the time to failure caused by the software aging. Then, the rejuvenation trigger epoch is decided based on the predicted time to exhaustion. In this case, we can find different prediction methods in use: machine learning, statistical approaches, or structural models [7], [8], [9], [10]. More recently, we also find some papers combining both these approaches, using prediction methods to determine the optimal threshold to trigger the rejuvenation [11]. Fig. 1 graphically presents this classification of rejuvenation scheduling. There is also an orthogonal classification based on whether the whole process of prediction and scheduling of rejuvenation is carried out off-line [8], [9], [12] or on-line [5], [13], [14], [15].

All rejuvenation strategies, in general, have in common the fact that the rejuvenation mechanism usually involves stopping the aged software to refresh its internal state. During this process, it is not uncommon to expect service downtime when the rejuvenation is being carried out. Hence, many papers in this field have concentrated on reducing (e.g., [12], [13], [16]) or even avoiding (e.g., [5], [7], [14]) service downtime during a software rejuvenation execution. This is the reason for the importance of properly scheduling software rejuvenation. In order to determine the best time epochs for triggering software rejuvenation, the use of analytic models [17], monitoring system resources followed by statistical analysis [12], [18], or their combination [4], [15] have been advocated.

As seen from the literature, many different approaches (e.g., [4], [7], [19], [20]) have been proposed to deal with important issues when implementing software rejuvenation. However, to the best of our knowledge, there is no study comparing the effectiveness of these techniques under the same experimental conditions. In all the previous studies, the main goal was to determine the optimal time epoch to trigger specific rejuvenation mechanisms, however, these studies do not take into account the differences in rejuvenation overhead.

To contribute to the body of knowledge in this area, this paper presents a comparative experimental study of different rejuvenation techniques, covering all different levels of rejuvenation granularity investigated so far: application level, operating system (OS) level, virtualization level, and physical node level. The main purpose of this study is a comparative evaluation of the overhead caused by the rejuvenation strategies according to their granularity. Our experimental evaluation is focused on performance overhead, and memory fragmentation overhead caused by the six rejuvenation strategies under evaluation. We also analyze the effectiveness of the rejuvenation techniques to remove the aging effects. We note that beneficial effects of rejuvenation have been well quantified in previous works; here we are quantifying and comparing the overhead accruing upon triggering rejuvenation. Rejuvenation scheduling methods should attempt to balance its beneficial effect with its overhead. This paper is a major extension of our previously published paper in WoSAR 2011 [21]. The main new contributions of this paper are: (i) analysis of the rejuvenation overhead with respect to memory fragmentation and its consequences; (ii) a more comprehensive analysis of the results, a guideline of the pros and cons of the different rejuvenation strategies in dealing with the aging effects and in minimizing the rejuvenation overhead, and finally (iii) a detailed discussion of different rejuvenation scheduling options available together with guidelines in helping design effective rejuvenation scheduling algorithms.

The rest of this paper is organized as follows. Section 2 revisits the fundamentals of memory-related aging effects, particularly memory leak and memory fragmentation that are investigated in our experimental study. Section 3 provides the basics of the selected rejuvenation strategies. Section 4 describes the methodology used to conduct the experiments, emphasizing the experimental plan and the instrumentation. Section 5 discusses comparative results. Section 6 presents different rejuvenation maintenance policies known in their ability to improve the behavior of systems suffering the effects of aging. Finally, Section 7 presents our conclusions and final remarks.

Section snippets

Revisiting memory-related software aging

The most prevalent aging effects investigated in the literature are memory related, specifically memory leaks. Another important memory-related aging effect is the memory fragmentation, but unlike the case of memory leaks, it has not been extensively investigated in the context of software aging. The difficulty involved in experimentally measuring memory fragmentation in a real system is significant, and that is probably one reason for the lack of experimental studies on this topic. Note that

Rejuvenation granularities

Our study is focused on the experimental comparison of different rejuvenation techniques under the same operational conditions, measuring their overhead on the performance, as well as their effectiveness in removing the aging effects. We have classified the rejuvenation techniques based on their granularity. We define granularity as the level that the rejuvenation mechanism directly targets. Based on the system architecture, we can define five main rejuvenation granularity levels while

Experimental setup

In this section we present the instrumentation and experimental plan adopted in our study.

Analysis of experimental results

In this section we present the results obtained from our experiments. We compare them mainly in terms of the execution overhead that each evaluated rejuvenation technique causes on the target application, especially on the client side. We also investigate the effectiveness of the rejuvenation execution on the server side in terms of memory consumption and fragmentation. Note that the values presented in this section are averages over five replications for each experiment.

Rejuvenation scheduling

In order to enhance the availability and mitigate the software aging effects, it is critical to design rejuvenation maintenance policies. Two main rejuvenation scheduling approaches can be defined: time-based and inspection-based.

Time-based approaches trigger rejuvenation at predetermined points of time. If based on monitoring or root cause detection techniques, we are able to figure out the state of various aging indicators, then we can apply rejuvenation on the exact aged layer

Conclusions

This paper presents an experimental evaluation of six rejuvenation strategies categorized in terms of granularity. We have conducted eight experiments with virtualized and non-virtualized environments in order to quantify the influence of this technology on the overhead of the rejuvenation strategies under consideration. The results show that the overhead impact of the rejuvenation techniques is related to their granularity. Fine-grain techniques such as application-level rejuvenation

Acknowledgments

This research was supported in part by the NASA Office of Safety and Mission Assurance (OSMA) Software Assurance Research Program (SARP) under the JPL subcontract # 1440119. We also thank CNPq (National Research Council of Brazil) for the financial support. We also thank Daniel Tes for his help during the test bed setup.

Javier Alonso received the master’s degree in Computer Science in 2004 and the Ph.D. degree from the Technical University of Catalonia (Universitat Politecnica de Catalunya, UPC) in 2011, respectively. From 2006 to 2011 he was an assistant lecturer at the Computer Architecture Department at UPC. He is currently a postdoctoral associate under the supervision of Professor K.S. Trivedi, at Duke University, Durham, NC. Dr. Alonso has served as a reviewer for IEEE TRANSACTIONS ON COMPUTERS,

References (47)

  • D. Wang et al.

    Performability analysis of clustered systems with rejuvenation under varying workload

    Performance Evaluation

    (2007)
  • W. Xie et al.

    Analysis of a two-level software rejuvenation policy

    Reliability Engineering & System Safety

    (2005)
  • Y. Huang, C. Kintala, N. Kolettis, N. Fulton, Software rejuvenation: analysis, module and applications, in: The 15th...
  • M. Grottke, R. Matias, K. Trivedi, The fundamentals of software aging, in: The 1st Intl. Workshop on Software Aging and...
  • A. Avizienis et al.

    Basic concepts and taxonomy of dependable and secure computing

    IEEE Transactions on Dependable and Secure Computing

    (2004)
  • K. Vaidyanathan et al.

    A comprehensive model for software rejuvenation

    IEEE Transactions on Dependable and Secure Computing

    (2005)
  • J. Alonso et al.

    High-available grid services through the use of virtualized clustering

  • L. Silva et al.

    Using virtualization to improve software rejuvenation

    IEEE Transactions on Computers

    (2009)
  • R. Matias et al.

    An experimental study on software aging and rejuvenation in web servers

  • A. Andrzejak, L. Silva, Using machine learning for non-intrusive modeling and prediction of software aging, in:...
  • J. Alonso, J. Torres, J. Berral, R. Gavalda, Adaptive on-line software aging prediction based on machine learning, in:...
  • K. Vaidyanathan et al.

    Analysis and implementation of software rejuvenation in cluster systems

  • R. Matos et al.

    Software rejuvenation in eucalyptus cloud computing infrastructure: a hybrid method based on multiple thresholds and time series prediction

    International Transactions on Systems Science and Applications

    (2012)
  • M. Grottke et al.

    Analysis of software aging in a web server

    IEEE Transactions on Reliability

    (2006)
  • V. Castelli et al.

    Proactive management of software aging

    IBM Journal of Research and Development

    (2001)
  • L.M. Silva et al.

    Using virtualization to improve software rejuvenation

  • T. Dohi et al.

    Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule

  • A.T. Tai et al.

    On-board preventive maintenance: analysis of effectiveness and optimal duty period

  • S. Garg et al.

    A methodology for detection and estimation of software aging

  • C.M. Kintala

    Software rejuvenation in embedded systems

    Journal of Automata, Languages and Combinatorics

    (2009)
  • K. Kourai et al.

    Fast software rejuvenation of virtual machine monitors

    IEEE Transactions on Dependable and Secure Computing

    (2011)
  • J. Alonso, R. Matias, E. Vicente, A.M. Carvalho, K. Trivedi, A comparative evaluation of software rejuvenation...
  • R. Matias, I. Beicker, B. Leitao, P. Maciel, Measuring software aging effects through OS kernel instrumentation, in:...
  • Cited by (58)

    • Software micro-rejuvenation for Android mobile systems

      2022, Journal of Systems and Software
      Citation Excerpt :

      Weng et al. (2016) show that warm rejuvenation (application restart) is not effective for mitigating Android aging, and propose to use active learning based on random forest to build an Android behavior model to schedule rejuvenation (Weng et al., 2017). Xiang et al. argue that traditional models for aging and rejuvenation cannot be applied to mobile devices, as they neglect the patterns of usage behavior and experience specific to them (Xiang et al., 2020; Alonso et al., 2013). They propose a model-based approach that considers the typical usage of a mobile device (with frequent switches between active and sleep modes) and use Stochastic Petri Nets to model the behavior and properly trigger rejuvenation during low-usage periods, such as when the device is in sleep mode.

    • Optimization of partial software rejuvenation policy

      2019, Reliability Engineering and System Safety
      Citation Excerpt :

      This work has focused on evaluating and optimizing the task completion probability. As mentioned in the Introduction, while rejuvenations may improve the average system processing speed, they also incur extra system overhead and downtime [15,16]. Thus, the rejuvenation policy adopted not only affects the task completion probability considered in this work, but also affects the expected total mission cost (involving rejuvenation overhead/cost, operation cost of software running, and penalty cost from the software failure).

    View all citing articles on Scopus

    Javier Alonso received the master’s degree in Computer Science in 2004 and the Ph.D. degree from the Technical University of Catalonia (Universitat Politecnica de Catalunya, UPC) in 2011, respectively. From 2006 to 2011 he was an assistant lecturer at the Computer Architecture Department at UPC. He is currently a postdoctoral associate under the supervision of Professor K.S. Trivedi, at Duke University, Durham, NC. Dr. Alonso has served as a reviewer for IEEE TRANSACTIONS ON COMPUTERS, PERFORMANCE EVALUATION, and several international conferences. His research interests in are in dependability, reliability, availability, performance, performability and survivability modeling of computer and communication systems.

    Rivalino Matias, Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned his M.S (1997) and Ph.D. (2006) degrees in Computer Science, and Industrial and Systems engineering from the Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with the Department of Electrical and Computer Engineering at Duke University, Durham, NC, working as a research associate under the supervision of Dr. Kishor Trivedi. He also works for IBM Research Triangle Park in a research related to embedded system availability and reliability analytical modeling. He is currently an Associate Professor in the Computer School at Federal University of Uberlândia, Brazil. Dr. Matias has served as a reviewer for IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF SYSTEMS AND SOFTWARE, and several international conferences. His research interests include dependability applied to computing systems, software aging theory, and diagnosis protocols for computing systems.

    Elder Vicente received his B.S.(2007) in Control and Automation Engineering from the Polytechnic School of Uberlândia, Brazil. He earned his M.S.(2012) degree in Computer Science from the Federal University of Uberlândia, Brazil. His recent research interests include techniques for optimizing and evaluating the performance of operating systems.

    Ana Maria Martins Carvalho received her B.S. (1998) in Informatics from Uberaba University, Brazil. She earned her M.S. (2012) degree in Computer Science from Federal University of Uberlândia, Brazil. Her recent research interests include statistical techniques applied to computing systems and computer network traffic analysis.

    Kishor S. Trivedi (M’86–SM’87–F’92) holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. He has been on the Duke faculty since 1975. He is the author of a well known text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall; a thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has also published two other books entitled Performance and Reliability Analysis of Computer Systems, published by Kluwer Academic Publishers; and Queueing Networks and Markov Chains, published by John Wiley. He is a Fellow of the Institute of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published over 420 articles, and has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, JOURNAL OF RISK AND RELIABILITY, INTERNATIONAL JOURNAL OF PERFORMABILITY ENGINEERING, AND INTERNATIONAL JOURNAL OF QUALITY AND SAFETY ENGINEERING. He is the recipient of IEEE Computer Society Technical Achievement Award for his research on Software Aging and Rejuvenation.

    His research interests in are in reliability, availability, performance, performability and survivability modeling of computer and communication systems. He works closely with industry in carrying out reliability/availability analysis, providing short courses on reliability, availability, performability modeling, and in the development and dissemination of software packages such as SHARPE and SPNP.

    View full text