Failure patterns in operating systems: An exploratory and observational study
Introduction
An extensive review of the software reliability literature shows that most of the studies in this area have concentrated on the reliability of user applications. Note that sophisticated applications need to run on top of operating system (OS) software; thus, OS failures can severely impact even the most reliable applications.
Some previous studies have investigated failures in OS software (Chou et al., 2001, Ganapathi and Patterson, 2005, Ganapathi et al., 2006, Kalyanakrishnam et al., 1999, Swift et al., 2003, Xu et al., 1999); however, none of them have yet approached the problem of discovering and characterizing OS failure patterns. In the context of this paper, OS failure patterns are defined as combinations of systematically repeated OS failure events in several computers analyzed at different workplaces. It is important to highlight that discovering patterns in OS failures is relevant from different perspectives.
For instance, studies on software reliability modeling regularly assume that OS failures occur independently (Smith et al., 2008, Trivedi et al., 2008); this assumption may produce inaccurate results for scenarios wherein failures are correlated. In these cases, failure patterns help to discover whether a failure is independent; even the existence of causal relationships can be assessed, which is of major importance for analytical and experimental researches in this field. From the OS manufacturers’ perspective, analysis of OS failure patterns is valuable in identifying the most recurrent failures in the product. Similarly, for system administrators, OS failure patterns are helpful to support the planning of preventive actions, such as turning off a given service that is known to be strongly correlated to a critical failure; this is possible through temporal analyses of recurrent failure combinations found in the discovered patterns.
In this paper, we investigated the existence of patterns in OS failures. Firstly, we answered the question: Do OS failures follow patterns? We found robust evidence of the existence of different patterns in the OS failure dataset analyzed, so subsequently we addressed questions such as ``Are these patterns consistent across computers from different workplaces?” ``Which OS components are dominant contributors constituting these patterns?” ``Which failures are recurrent in the discovered patterns?” ``Are these failures’ time-correlated?”
To answer these and other research questions we analyzed 7007 OS failure records collected from different desktop computers. We searched for individual and combined failure events that occurred consistently in multiple computers from different workplaces. To achieve this goal, we introduced and used a protocol to discover OS failure patterns. This new protocol required the definition of original concepts that are described in the taxonomy (Section 4.1) created for this specific purpose. Our method for OS failure pattern discovery (Section 4.2) can be applied to failure data of any operating system of interest, although the empirical findings (Section 5) discussed in this paper are specific to the OS platform investigated as the case study. Our main findings include:
- •
Discovered 45 OS failure patterns with 153,511 occurrences;
- •
Failures in OS services were the most prevalent;
- •
Software updates applied to OS components are one of the dominant contributors to constitute failure patterns;
- •
Found empirical evidence of OS failure correlation, particularly autocorrelation.
To the best of our knowledge, this study is the first to propose a protocol to discover failure patterns in operating systems. Understanding software failure patterns is valuable from the following different theoretical and practical perspectives:
- •
Software dependability can be improved knowing the patterns of failures and their correlations, which is key for early failure detection and thus creating more effective fault-tolerant mechanisms.
- •
Software testing can benefit optimizing resources to identify software components with lower reliability, and focusing on the most prevalent and impactful failure causes.
- •
System modeling can offer more realistic and precise analytical models if taking into consideration the possible failure interactions uncovered by the patterns.
The remainder of the paper is organized as follows. Section 2 presents the semantics of OS failures as considered in this study. Section 3 describes the OS failure dataset used in our empirical study. Section 4 explains the proposed protocol applied to discover the OS failure patterns presented here. Section 5 discusses the most relevant results of the study. Section 6 presents the threats to internal and external validity. Section 7 discusses related works. Finally, Section 8 closes with our conclusion and plan for extending this research.
Section snippets
OS failures
In this study, we followed the same definition of failure presented in Avižienis et al. (2004), in which a failure is characterized as an event that occurs when the delivered service of a system/subsystem deviates from the correct service. The correct service is delivered when the service implements the system/subsystem function, in other words, the delivered service is exactly what the system/subsystem is intended to do. Therefore, a failure occurs because the service's output does not comply
Material
In this section, we present the dataset of OS failures analyzed in this study. We first describe how we collected and categorized the failure records (Section 3.1), and then we present a general characterization of the failure dataset (Section 3.2). Next, we explain the dataset-stratification approach adopted by us for identifying the most observed types and subtypes of OS failures (Section 3.3); these types and subtypes are used in this study.
Method
In this section, we present the protocol created and used in this study to discover and characterize the OS failure patterns. This protocol discovers patterns using the temporal order of the OS failure occurrences recorded as time series. According to Volna et al. (2016), time series data occur naturally in different application areas, such as economics, environmental modeling, and demographics, among others. Therefore, identifying patterns in time series is considered an important problem
Results and discussion
In this section, we present the results obtained applying the failure pattern discovery protocol introduced in Section 4 to the OS failure dataset described in Section 3.
Threats to validity
Like any empirical research work, this study has limitations that must be considered when interpreting its results. In this subsection, we highlight these validity threats (Siegmund et al., 2015) and the strategies adopted to mitigate them.
Related works
Ganapathi et al. (2006) found that the Windows OS was not responsible for the majority of the system crashes. They analyzed 2528 occurrences of Windows XP kernel crashes, collected from 617 computers, and concluded that most of the crashes were caused by third-party poorly written device driver code. Similarly, Swift et al. (2003) also reported that OS failures were mainly caused by malfunctioning in OS kernel extensions (e.g., device drivers), and concluded that third-party kernel extensions
Conclusion
Firstly, we highlight that the OS Kernel failures represented only 17.34% of all OS failures observed in our dataset, and the most prevalent OS failures were related to OS services (60.63%). It led us to conclude that software reliability studies, focused on operating systems, should not restrict their analyses to OS Kernel failures, as has been the norm in the literature.
Another finding observed in this work and that can influence empirical studies in this area is related to the sampling
Caio Augusto Rodrigues dos Santos received his B.S. (2013) and M.S. (2016) degrees in computer science from the Federal University of Uberlandia (UFU). He is a Ph.D candidate in computer science in UFU, under supervision of Dr. Rivalino Matias Jr. His research interests include empirical software engineering and operating systems.
References (49)
- et al.
Estimating betas from nonsynchronous data
J. .Financ. Econ.
(1977) - et al.
Basic concepts and taxonomy of dependable and secure computing
IEEE Trans. Dependable Secure Comput.
(2004) Computer usage in daily life
- et al.
Extrinsic influence factors in software reliability: a study of 200,000 Windows machines
Windows 7 Bible
(2009)- et al.
An empirical study of operating systems errors
- et al.
Modeling and analysis of correlated software failures of multiple types
IEEE Trans. Reliab.
(2005) - et al.
Exploratory analysis on failure causes in a mass-market operating system
ACM SIGOPS Oper. Syst. Rev.
(2016) A framework for the analysis of unevenly-spaced time series data
Windows 7 dominates desktop, XP share slips
Validity threats in empirical software engineering research - an initial survey
Crash data collection: a Windows case study
Windows XP kernel crash analysis
Evaluating the significance of the Windows Explorer visualization in personal information management browsing tasks
The effects of failure correlation on software reliability and performability
Effects of failure correlation on software in operation
Failure correlation in software reliability models
IEEE Trans. Reliab.
On covariance estimation of non-synchronously observed diffusion processes
Bernoulli
Failure data analysis of a LAN of Windows NT based computers
Correlation of high frequency financial time series
A Dictionary of Statistical Terms
Operating system reliability from the quality of experience viewpoint: an exploratory study
An empirical exploratory study on operating system reliability
Web survey on operating system failures
Cited by (6)
Software fault localization: An overview of research, techniques, and tools
2023, Handbook of Software Fault Localization: Foundations and AdvancesA Statistical Approach to Predict Operating System Failures Based on Multiple Failures Association
2020, Brazilian Symposium on Computing System Engineering, SBESCThe effectiveness of sharing blended project based learning (SBPBL) model implementation in operating system course
2020, International Journal of Emerging Technologies in LearningAn Empirical Exploratory Analysis of Failure Sequences in a Commodity Operating System
2019, Brazilian Symposium on Computing System Engineering, SBESCInfluence factors on the quality of user experience in os reliability: A qualitative experimental study
2018, ACM International Conference Proceeding SeriesReliability assessment of commercial off-the-shelf operating system software: An empirical study
2018, Brazilian Symposium on Computing System Engineering, SBESC
Caio Augusto Rodrigues dos Santos received his B.S. (2013) and M.S. (2016) degrees in computer science from the Federal University of Uberlandia (UFU). He is a Ph.D candidate in computer science in UFU, under supervision of Dr. Rivalino Matias Jr. His research interests include empirical software engineering and operating systems.
Rivalino Matias Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned his M.S. (1997) and Ph.D. (2006) degrees in computer science, and industrial and systems engineering from the Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with Department of Electrical and Computer Engineering at Duke University, Durham, NC, working as a research associate in NASA/JPL and IBM research projects, under supervision of Dr. Kishor Trivedi. He is currently an associate professor in the School of Computer Science at Federal University of Uberlandia, Brazil. Dr. Matias has served as a reviewer for several reputable scientific international journals and prestigious international conferences, as well as an ad hoc expert for government projects in Brazil, USA and European Union. His research interests include computing dependability, software reliability, software aging and rejuvenation, and operating systems.