Ensuring system performance for cluster and single server systems

https://doi.org/10.1016/j.jss.2006.07.020Get rights and content

Abstract

A new approach that is useful in identifying and eliminating performance degradation occurring in aging software is proposed. A customer-affecting metric is used to initiate the restoration of such a system to full capacity. A case study is described in which, by simulating an industrial software system, we are able to show that by monitoring a customer-affecting metric and frequently comparing its degradation to the performance objective, we can ensure system stability at a very low cost.

Introduction

Large industrial software systems require extensive monitoring and management to deliver acceptable performance and reliability. Software performance degradation may occur due to increased user activity that results in increased queueing for existing system resources. In such cases, quality of service (QoS) enforcement algorithms are a useful tool to protect system resources from overload conditions. The Strong quality of service enforcement algorithm for distributed objects is one such algorithm. It is designed to deny service requests any time a client violates the previously agreed-upon quality of service agreement, even if the system is not near capacity. A detailed description of this algorithm can be found in Avritzer and Weyuker (1999).

Software performance degradation may also occur as a result of software failures. Soft failures are those that leave the system in a degraded mode, where the system is still operational, but the available system capacity has been greatly reduced. Examples of soft failures have been documented in several software studies including (Avritzer and Larson, 1993, Avritzer and Weyuker, 1995, Avritzer et al., 1996, Avritzer and Weyuker, 1997, Avritzer et al., 2004). The term software aging is sometimes used to describe the case in which the state of the software degrades over time. This sort of degradation has been observed in widely-used software and is described in Trivedi et al., 2000, Castelli et al., 2001.

Our experience has been that soft failures often occur as a result of problems with synchronization mechanisms such as semaphores, kernel structures including file table allocations, locking deadlocks occurring in database management systems, and other resource allocation mechanisms that are essential to the proper operation of large multi-layer distributed systems. Since some of these resources are designed to have self-healing mechanisms such as timeouts, systems sometimes recover from soft failures after a period of time without any preemptive actions being taken. For example, for the Java-based e-commerce system used in our case studies, users had been complaining of very slow response time for periods exceeding one hour, after which the problem would clear by itself. A detailed report of the diagnosis of that soft failure can be found in Avritzer and Weyuker (2004).

One interesting characteristic of this soft failure was that it was related to the interaction between several queues in this multi-layered distributed system for which extensive back-end server instrumentation was in place. While our data showed only short spikes in the ready-to-run queues, as well as in the back-end server utilizations, users were nonetheless complaining of extremely long delays. Careful data analysis did not show any shortage of available resources. This experience motivated us to develop a new algorithm to identify performance degradation in aging software that is dependent on the response time. For this case study, we will refer to response time as the customer-affecting metric. Other customer-affecting metrics, such as the time between job completions or the relative frequency of blocking, could also be used.

We expect that monitoring customer-affecting metrics will be an excellent way of tracking the customer experience, but ultimately, monitoring tools have to track system resource usage as well in order to be able to perform efficient restoration of the software state when failures occur. This restoration process is sometimes known as software rejuvenation (Huang et al., 1995) and involves restoring a software system to full capacity through the deletion of all system transactions and the release of all resources associated with these transactions.

We therefore propose a new way of identifying performance degradation in aging software resulting from soft failures by tracking the values of a customer-affecting metric and using it to trigger rejuvenation when required. Our approach relies on the instrumentation and collection of required metrics and initiates rejuvenation when the customer-affecting metric repeatedly exceeds the specified performance requirement. The focus of this paper is the analysis of the effectiveness of this approach for single server systems and for cluster systems. Computer systems are usually configured as clusters of hosts when it is necessary to provide for enhanced performance and availability through resource sharing and fail-over (Broadwell et al., 2004). A multi-tier distributed e-commerce system is used in the case studies described in this paper. It is an example of a system that was subject to capacity degradation due to software aging.

We are interested in managing the response time of systems composed of several architectural components. Some of the key components that may require rejuvenation when the customer-affecting metric exceeds its target include:

  • The back-end application server running a virtual machine.

  • The back-end operating system.

  • The database.

The outline of our paper is as follows. Section 2 provides a survey of related work.

In Section 3 we present our model of software rejuvenation which has been designed to identify performance degradation using customer-affecting metrics. Section 4 presents simulation results for an e-commerce system using our rejuvenation algorithm. Section 5 contains the analysis of the simulation results. In Section 6 we present extensions of our software rejuvenation approach to clusters of hosts. Section 7 contains our conclusions and suggestions for future research.

Section snippets

Related work

One approach to restoring software to full capacity was described in Avritzer and Weyuker (1997). This approach, designed specifically for telecommunications systems, took advantage of the cyclical nature of telecommunications traffic. Telecommunications operating companies collect detailed data on the traffic patterns in their networks, and therefore can plan to restore their systems to full capacity after a system has degraded. This is analogous to other planned maintenance activities

The system model

The system model used to generate the experimental results is a slightly expanded version of the one used in Avritzer and Weyuker (2004).

  • (i)

    Whenever a thread arrives at the java virtual machine (JVM), a new thread arrival is scheduled with an exponentially distributed interarrival time, and the number of active threads is incremented by one.

  • (ii)

    Using an exponential distribution, a running time is selected for the thread, according to the specified CPU processing time.

  • (iii)

    If the number of threads executing

Experimental results for single server systems

The experiments we describe in this section evaluate the impact of using our software rejuvenation approach to guarantee that the performance requirement will be met. In each experiment we are comparing two or more situations.

The system under study is an e-commerce system with 16 CPUs running a JVM. We have built a discrete event simulator to evaluate the system performance, and we have run each experiment for 500,000 transactions, divided into 5 replications to compute the 95% confidence

Analysis of single server experiments

In Section 4 we presented the results of four experiments that were designed to evaluate the performance of an e-commerce system, when software rejuvenation was used to control the value of a customer-affecting metric. The system model included multi-processing, garbage collection and kernel overhead.

For the e-commerce system under study, performance was impacted by the variability introduced by garbage collection events and by the overhead added to the processing time when kernel overhead was

Extensions for cluster systems

In this section we evaluate the applicability of our dynamic rejuvenation algorithm to clusters of identical systems. We also introduce and evaluate two extensions to our dynamic rejuvenation algorithm:

  • Dynamic rejuvenation algorithm for M-host clusters with non-overlapping rejuvenation events.

  • Dynamic rejuvenation algorithm with linear selection of the number of buckets, K, or the depth of the buckets, D.

Two-host clusters are usually deployed as hot spares in a high availability configuration in

Conclusions

We have evaluated the performance of two versions of a new software rejuvenation algorithm by building a discrete event simulator and by running several simulation studies. We have evaluated the use of customer-affecting metrics as a tool for identifying performance problems that are not easily diagnosed by analyzing system resource usage. We have found that software rejuvenation based on tracking customer-affecting metrics can be a very powerful technique that can quickly detect software aging

Alberto Avritzer received a Ph.D. in Computer Science from the University of California, Los Angeles, an M.Sc. in Computer Science for the Federal University of Minas Gerais, Brazil, and the B.Sc. in Computer Engineering from the Technion, Israel Institute of Technology. He is currently a member of the technical staff in the Software Engineering Department at Siemens Corporate Research, Princeton, New Jersey. Before moving to Siemens Corporate Research, he spent 13 years at AT&T Bell

References (19)

  • A. Bobbio et al.

    Fine grained software degradation models for optimal rejuvenation policies

    Performance Evaluation

    (2001)
  • A. Avritzer et al.

    Load testing software using deterministic state testing

  • A. Avritzer et al.

    The automatic generation of load test suites and the assessment of the resulting software

    IEEE Transactions on Software Engineering

    (1995)
  • A. Avritzer et al.

    Monitoring smoothly degrading systems for increased dependability

    Empirical Software Engineering

    (1997)
  • A. Avritzer et al.

    Quality of service enforcement for distributed objects

    IEE/Proceedings on Software

    (1999)
  • A. Avritzer et al.

    The role of modeling in the performance testing of e-commerce application

    IEEE Transactions on Software Engineering

    (2004)
  • A. Avritzer et al.

    Reliability testing of rule-based systems

    IEEE Software

    (1996)
  • Avritzer, A., Ros, J., Weyuker, E.J., 2004. Estimating the CPU utilization of a rule-based system. In: ACM Fourth...
  • Broadwell, P.M., 2004. Response time as a performability metric for online services. UC Berkeley Computer Science...
There are more references available in the full text version of this article.

Cited by (13)

  • Methods and opportunities for rejuvenation in aging distributed software systems

    2010, Journal of Systems and Software
    Citation Excerpt :

    The literature on software rejuvenation is extensive and includes Avritzer and Weyuker (1997), Castelli et al. (2001), Dohi et al. (2000a,b), Garg et al. (1995), Garg et al. (1996), Grottke and Trivedi (2005), Jia et al. (2008), Huang et al. (1995) and Trivedi et al. (2000). Software rejuvenation research has been categorized into analytical modeling-based approaches (Dohi et al., 2000b; Grottke and Trivedi, 2005; Huang et al., 1995; Trivedi et al., 2000), and measurement-based approaches (Castelli et al., 2001; Jia et al., 2008; Avritzer and Weyuker, 1997; Avritzer et al., 2005, 2007). In analytical modeling approaches, a system model is constructed with and without software rejuvenation and an attempt is made to derive a closed-form expression of the optimal rejuvenation schedule.

  • Software rejuvenation: Key concepts and granularity

    2020, Handbook Of Software Aging And Rejuvenation: Fundamentals, Methods, Applications, And Future Directions
  • Experiences with academic-industrial collaboration on empirical studies of software systems

    2017, Proceedings - 2017 IEEE 28th International Symposium on Software Reliability Engineering Workshops, ISSREW 2017
  • Software Rejuvenation Policies for Cluster System

    2016, Proceedings of the National Academy of Sciences India Section A - Physical Sciences
  • Quantitative Assessments of Distributed Systems: Methodologies and Techniques

    2015, Quantitative Assessments of Distributed Systems: Methodologies and Techniques
  • Availability analysis of software rejuvenation in active/standby cluster system

    2015, International Journal of Industrial and Systems Engineering
View all citing articles on Scopus

Alberto Avritzer received a Ph.D. in Computer Science from the University of California, Los Angeles, an M.Sc. in Computer Science for the Federal University of Minas Gerais, Brazil, and the B.Sc. in Computer Engineering from the Technion, Israel Institute of Technology. He is currently a member of the technical staff in the Software Engineering Department at Siemens Corporate Research, Princeton, New Jersey. Before moving to Siemens Corporate Research, he spent 13 years at AT&T Bell Laboratories, where he developed tools and techniques for performance testing and analysis. He spent the summer of 1987 at IBM Research, at Yorktown Heights. His research interests are in software engineering, particularly software testing, monitoring and rejuvenation of smoothly degrading systems, and metrics to assess software architecture, and he has published over 40 papers in journals and refereed conference proceedings in those areas. He is a member of ACM SIGSOFT, and IEEE.

André Bondi is a performance engineer in the Software Engineering Department at Siemens Corporate Research, Inc. in Princeton, New Jersey. He has worked on performance issues in several domains, including telecommunications, conveyor systems, finance systems, building surveillance, and network management systems. Just prior to joining Siemens, he held senior performance positions at two startup companies. Before that, he spent more than ten years working on a variety of performance and operational issues at AT&T Labs and its predecessor, Bell Labs, in New Jersey. Before joining Bell Labs, he was an Assistant Professor of Computer Science at the University of California, Santa Barbara for three years. He holds a Ph.D. in computer science from Purdue University and an M.Sc. in statistics from University College London.

Elaine Weyuker received a Ph.D. in Computer Science from Rutgers University, and an M.S.E. from the University of Pennsylvania. She is currently an AT&T Fellow, performing research in software testing and metrics and has published more than 130 papers in journals and refereed conference proceedings. She is also interested in the theory of computation, and is the author of two editions of a book (with Martin Davis and Ron Sigal), “Computability, Complexity, and Languages”, published by Academic Press. Prior to moving to AT&T Labs, she was on the faculty of the Courant Institute of Mathematical Sciences of New York University, was a faculty member at the City University of New York, a Systems’ Engineer at I.B.M. and a programmer at Texaco, Inc.

She was elected to the National Academy of Engineering, is an IEEE Fellow, an ACM Fellow, and an AT&T Fellow. She was the 2004 recipient of the IEEE Harlan D. Mills award, a recipient of the YWCA Woman of Achievement Award, and was named the Outstanding Alumni at the Rutgers University 50th Anniversary celebration. She was also the recipient of the AT&T Chairman’s award for her mentoring activities and efforts to foster diversity.

She is the co-chair of the ACM-W committee, a member of the executive committee of the Coalition to Diversify Computing, a member of the Rutgers University Graduate School Advisory Board and was a member of the Board of Directors of the Computing Research Association (CRA). She is a member of the editorial boards of IEEE Transactions on Software Engineering, IEEE Transactions on Dependable and Secure Computing, IEEE Spectrum, the Empirical Software Engineering Journal and the Journal of Systems and Software, and was a founding editor of the ACM Transactions of Software Engineering and Methodology.

View full text