Failure patterns in operating systems: An exploratory and observational study

https://doi.org/10.1016/j.jss.2017.03.058Get rights and content

Highlights

  • A protocol to discover operating systems failure patterns is proposed.

  • Discovered 45 genuine failure patterns.

  • Failures in operating system services were the most prevalent.

  • Found empirical evidences of failure correlation (cross- and autocorrelation).

Abstract

Sophisticated critical computer applications need to run on top of operating system (OS) software. Given the natural intrinsic dependency of user applications on the OS software, OS failures can severely impact even the most reliable applications. Thus, it is essential to understand how OS failures occur in order to improve software reliability. In this paper, we present an exploratory and observational study on OS failure patterns. We analyze 7007 real OS failures collected from 566 computers used in different workplaces. We start with a general characterization of the failure dataset examined in this study, where interesting findings are presented, e.g., the most frequent failure types per period of a day and per different workplaces. Next, we investigate the existence of failure patterns. For this purpose, we introduce an OS failure pattern discovery protocol that identifies failure patterns exhibiting consistency across different computers used in the same as well as different workplaces. In total, we discovered 45 failure patterns with 153,511 occurrences. Based on these patterns, we found that the most prevalent failures were related to the software updates of the OS components. The main causes of these failures involved infrastructural and environmental factors such as disk-space unavailability and concurrent execution of OS services. Empirical evidence of time-correlated failures of these OS components is also discussed in this paper. Other findings include the OS components that contributed more to create the discovered failure patterns and the most prevalent combination of failure events and their temporal order. This study aims at to contribute to a better understanding of the mechanisms behind OS failures.

Introduction

An extensive review of the software reliability literature shows that most of the studies in this area have concentrated on the reliability of user applications. Note that sophisticated applications need to run on top of operating system (OS) software; thus, OS failures can severely impact even the most reliable applications.

Some previous studies have investigated failures in OS software (Chou et al., 2001, Ganapathi and Patterson, 2005, Ganapathi et al., 2006, Kalyanakrishnam et al., 1999, Swift et al., 2003, Xu et al., 1999); however, none of them have yet approached the problem of discovering and characterizing OS failure patterns. In the context of this paper, OS failure patterns are defined as combinations of systematically repeated OS failure events in several computers analyzed at different workplaces. It is important to highlight that discovering patterns in OS failures is relevant from different perspectives.

For instance, studies on software reliability modeling regularly assume that OS failures occur independently (Smith et al., 2008, Trivedi et al., 2008); this assumption may produce inaccurate results for scenarios wherein failures are correlated. In these cases, failure patterns help to discover whether a failure is independent; even the existence of causal relationships can be assessed, which is of major importance for analytical and experimental researches in this field. From the OS manufacturers’ perspective, analysis of OS failure patterns is valuable in identifying the most recurrent failures in the product. Similarly, for system administrators, OS failure patterns are helpful to support the planning of preventive actions, such as turning off a given service that is known to be strongly correlated to a critical failure; this is possible through temporal analyses of recurrent failure combinations found in the discovered patterns.

In this paper, we investigated the existence of patterns in OS failures. Firstly, we answered the question: Do OS failures follow patterns? We found robust evidence of the existence of different patterns in the OS failure dataset analyzed, so subsequently we addressed questions such as ``Are these patterns consistent across computers from different workplaces?” ``Which OS components are dominant contributors constituting these patterns?” ``Which failures are recurrent in the discovered patterns?” ``Are these failures’ time-correlated?”

To answer these and other research questions we analyzed 7007 OS failure records collected from different desktop computers. We searched for individual and combined failure events that occurred consistently in multiple computers from different workplaces. To achieve this goal, we introduced and used a protocol to discover OS failure patterns. This new protocol required the definition of original concepts that are described in the taxonomy (Section 4.1) created for this specific purpose. Our method for OS failure pattern discovery (Section 4.2) can be applied to failure data of any operating system of interest, although the empirical findings (Section 5) discussed in this paper are specific to the OS platform investigated as the case study. Our main findings include:

  • Discovered 45 OS failure patterns with 153,511 occurrences;

  • Failures in OS services were the most prevalent;

  • Software updates applied to OS components are one of the dominant contributors to constitute failure patterns;

  • Found empirical evidence of OS failure correlation, particularly autocorrelation.

To the best of our knowledge, this study is the first to propose a protocol to discover failure patterns in operating systems. Understanding software failure patterns is valuable from the following different theoretical and practical perspectives:

  • Software dependability can be improved knowing the patterns of failures and their correlations, which is key for early failure detection and thus creating more effective fault-tolerant mechanisms.

  • Software testing can benefit optimizing resources to identify software components with lower reliability, and focusing on the most prevalent and impactful failure causes.

  • System modeling can offer more realistic and precise analytical models if taking into consideration the possible failure interactions uncovered by the patterns.

The remainder of the paper is organized as follows. Section 2 presents the semantics of OS failures as considered in this study. Section 3 describes the OS failure dataset used in our empirical study. Section 4 explains the proposed protocol applied to discover the OS failure patterns presented here. Section 5 discusses the most relevant results of the study. Section 6 presents the threats to internal and external validity. Section 7 discusses related works. Finally, Section 8 closes with our conclusion and plan for extending this research.

Section snippets

OS failures

In this study, we followed the same definition of failure presented in Avižienis et al. (2004), in which a failure is characterized as an event that occurs when the delivered service of a system/subsystem deviates from the correct service. The correct service is delivered when the service implements the system/subsystem function, in other words, the delivered service is exactly what the system/subsystem is intended to do. Therefore, a failure occurs because the service's output does not comply

Material

In this section, we present the dataset of OS failures analyzed in this study. We first describe how we collected and categorized the failure records (Section 3.1), and then we present a general characterization of the failure dataset (Section 3.2). Next, we explain the dataset-stratification approach adopted by us for identifying the most observed types and subtypes of OS failures (Section 3.3); these types and subtypes are used in this study.

Method

In this section, we present the protocol created and used in this study to discover and characterize the OS failure patterns. This protocol discovers patterns using the temporal order of the OS failure occurrences recorded as time series. According to Volna et al. (2016), time series data occur naturally in different application areas, such as economics, environmental modeling, and demographics, among others. Therefore, identifying patterns in time series is considered an important problem

Results and discussion

In this section, we present the results obtained applying the failure pattern discovery protocol introduced in Section 4 to the OS failure dataset described in Section 3.

Threats to validity

Like any empirical research work, this study has limitations that must be considered when interpreting its results. In this subsection, we highlight these validity threats (Siegmund et al., 2015) and the strategies adopted to mitigate them.

Related works

Ganapathi et al. (2006) found that the Windows OS was not responsible for the majority of the system crashes. They analyzed 2528 occurrences of Windows XP kernel crashes, collected from 617 computers, and concluded that most of the crashes were caused by third-party poorly written device driver code. Similarly, Swift et al. (2003) also reported that OS failures were mainly caused by malfunctioning in OS kernel extensions (e.g., device drivers), and concluded that third-party kernel extensions

Conclusion

Firstly, we highlight that the OS Kernel failures represented only 17.34% of all OS failures observed in our dataset, and the most prevalent OS failures were related to OS services (60.63%). It led us to conclude that software reliability studies, focused on operating systems, should not restrict their analyses to OS Kernel failures, as has been the norm in the literature.

Another finding observed in this work and that can influence empirical studies in this area is related to the sampling

Caio Augusto Rodrigues dos Santos received his B.S. (2013) and M.S. (2016) degrees in computer science from the Federal University of Uberlandia (UFU). He is a Ph.D candidate in computer science in UFU, under supervision of Dr. Rivalino Matias Jr. His research interests include empirical software engineering and operating systems.

References (49)

  • M. Scholes et al.

    Estimating betas from nonsynchronous data

    J. .Financ. Econ.

    (1977)
  • A. Avižienis et al.

    Basic concepts and taxonomy of dependable and secure computing

    IEEE Trans. Dependable Secure Comput.

    (2004)
  • T. Beauvisage

    Computer usage in daily life

  • C. Bird et al.

    Extrinsic influence factors in software reliability: a study of 200,000 Windows machines

  • J. Boyce

    Windows 7 Bible

    (2009)
  • A. Chou et al.

    An empirical study of operating systems errors

  • Y.S. Dai et al.

    Modeling and analysis of correlated software failures of multiple types

    IEEE Trans. Reliab.

    (2005)
  • C.A.R. Dos Santos et al.

    Exploratory analysis on failure causes in a mass-market operating system

    ACM SIGOPS Oper. Syst. Rev.

    (2016)
  • A. Eckner

    A framework for the analysis of unevenly-spaced time series data

  • M. Endler

    Windows 7 dominates desktop, XP share slips

  • R. Feldt et al.

    Validity threats in empirical software engineering research - an initial survey

  • A. Ganapathi et al.

    Crash data collection: a Windows case study

  • A. Ganapathi et al.

    Windows XP kernel crash analysis

  • M. Golemati et al.

    Evaluating the significance of the Windows Explorer visualization in personal information management browsing tasks

  • K. Goseva-Popstojanova et al.

    The effects of failure correlation on software reliability and performability

  • K. Goseva-Popstojanova et al.

    Effects of failure correlation on software in operation

  • K. Goseva-Popstojanova et al.

    Failure correlation in software reliability models

    IEEE Trans. Reliab.

    (2000)
  • T. Hayashi et al.

    On covariance estimation of non-synchronously observed diffusion processes

    Bernoulli

    (2005)
  • M. Kalyanakrishnam et al.

    Failure data analysis of a LAN of Windows NT based computers

  • M.C. Lundin et al.

    Correlation of high frequency financial time series

  • F.H. Marriott

    A Dictionary of Statistical Terms

    (1990)
  • R. Matias et al.

    Operating system reliability from the quality of experience viewpoint: an exploratory study

  • R. Matias et al.

    An empirical exploratory study on operating system reliability

  • R. Matias et al.

    Web survey on operating system failures

    (2010)
  • Cited by (6)

    Caio Augusto Rodrigues dos Santos received his B.S. (2013) and M.S. (2016) degrees in computer science from the Federal University of Uberlandia (UFU). He is a Ph.D candidate in computer science in UFU, under supervision of Dr. Rivalino Matias Jr. His research interests include empirical software engineering and operating systems.

    Rivalino Matias Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned his M.S. (1997) and Ph.D. (2006) degrees in computer science, and industrial and systems engineering from the Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with Department of Electrical and Computer Engineering at Duke University, Durham, NC, working as a research associate in NASA/JPL and IBM research projects, under supervision of Dr. Kishor Trivedi. He is currently an associate professor in the School of Computer Science at Federal University of Uberlandia, Brazil. Dr. Matias has served as a reviewer for several reputable scientific international journals and prestigious international conferences, as well as an ad hoc expert for government projects in Brazil, USA and European Union. His research interests include computing dependability, software reliability, software aging and rejuvenation, and operating systems.

    View full text