Reliability evaluation of hard disk drive failures based on counting processes
Highlights
► Develop a reliability evaluation framework for HDD based on NHPPs. ► Use the framework to explain why traditional ALT methods fail to analyze the data. ► Develop statistical inference and hypothesis testing approaches. ► Apply the framework to a real HDD problem. ► Conduct a simulation to examine the normal approximation used for hypothesis testing.
Introduction
Hard disk drives (HDDs) are probably the most important data storage device in modern computing systems. Failure of an HDD often causes high costs to a user due to losses of important data. Therefore, HDDs are often well designed to achieve an extremely high reliability. According to Schroeder and Gibson [12], the mean time to failure of HDDs ranges from 1,000,000 to 1,500,000 h. A report given by Samsung® [11] also reveals that the mean time between failures of their HDDs is as high as 1,200,000 h, meaning that the annual failure rate is about 0.7% if an exponential distribution is assumed. On the other hand, the rapid evolution of HDD manufacturing techniques imposes a stringent time constraint on development of HDDs of the latest vintages. Typically, the time allowed for the production process ranges from 3 to 9 months. A particular production process includes concept development, design, pilot manufacturing and life testing [15]. Therefore, the time allowed for the life test of a new design is often very short, in spite of the high reliability of the new vintage. How to carry out the life test within a reasonable time frame and how to analyze the test data have become important issues for HDD reliability engineers.
Accelerated life tests (ALTs) provide reliability engineers with a quick way to obtain the lifetime information. During an ALT experiment, all testing samples are subject to high stresses in order to hasten the failure process [9]. Failure data from the ALT are collected and analyzed. The results are then extrapolated to normal use conditions with a view to predicting lifetime characteristics of interest. Success of the ALT depends largely on correct specification of stresses, also known as accelerating factors, for the experiment. Choosing the right accelerating factors requires a good understanding of failure mechanisms of the HDDs. It has been found that particle induced failures are the most prevailing failure mechanism during the service life of an HDD. The particles accumulate in the disks over time and induce a variety of failure modes, e.g., circumferential scratches, head media crashes and scratches along different head-disk interface directions (e.g., [7], [13], [20]). This motivates Tang et al. [16] and Tang et al. [15] to use injected particles as the accelerating factor and to test the samples under different particle injection rates. However, they found that conventional methods for ALT data analysis, e.g., the log-location-scale regression model [5], cannot fit the time-to-failure data well. Therefore, instead of using the traditional chronological time scale, they proposed the cumulative particle count (CPC) as the time scale. A failure occurs when the CPC exceeds a random threshold. This is analogous of using cumulative mileages as the time scale for a tire tread (e.g., [19]). Based on this accelerating factor and the new time scale, Tang et al. [15] developed a statistical modeling framework to analyze the testing data. In their framework, a generalized non-linear regression model was used to fit the CPC paths and a best-fit distribution was chosen to fit the CPC-to-failure data, based on which the failure time distribution can be computed. This framework is basically data-driven, and thus has several deficiencies. For example, it treats the CPC paths as deterministic and ignores the fact that the particle accumulation is indeed a random process, which can be readily seen from Fig. 3 in Section 5. In addition, it cannot explain why traditional methods do not work well for the ALT data of HDDs.
In this paper, we propose a process-based approach for analyzing the HDD testing data. This approach is based on the particle accumulation mechanism of the HDDs. Instead of the generalized non-linear model, we propose to use a counting process to model the CPC paths. This framework is depicted in Fig. 1.
To apply this framework, we shall analyze the physics of particle accumulation to specify which counting process to use. The CPC paths indicate that the rate of particle accumulation appears to reduce over time. Further examination of the HDD physics reveals that there are two sources of particles, i.e., the internal and external sources. The internal source is primarily due to loose particles trapped inside the drives during the manufacture. These loose particles release gradually, accumulate in the disk, and dominate the particle accumulation process in the beginning, which account for the initial phase of high and decreasing accumulation rate observed in the initial stage of the CPC paths. The external source is due to particles from ambient environments. Some dust particles from outside environments escape from the air filter and land on the HDD platter. They accumulate in the disk. After almost all the loose particles in the internal source have released, the external source dominates the particle accumulation process. Under this circumstance, if the accumulation process of external particles is relatively stationary, the particle accumulation process would become approximately linear, which explains the second phase of the CPC paths shown in Fig. 3 in Section 5. This physical phenomenon motivates us to use a counting process with internal and external sources. We assume the internal source has finite but unknown number of loose particles, which have i.i.d. release times to the HDD disk, while particle arrivals from the external source follow a non-homogeneous Poisson process (NHPP). The NHPP assumption is reasonable, as will be justified in Section 2.1. In this study, we investigate properties of this counting process and use it to evaluate reliability of the HDDs.
The rest of this paper is organized as follows. Properties of the counting process with two arrival sources are investigated in Section 2. It is found that when the number of loose particles in the internal source follows a Poisson distribution, the resulting internal particle release process is an NHPP, regardless of the release time distribution of these loose particles. Section 3 presents maximum likelihood (ML) estimation for this counting process. Statistical test methods are discussed in Section 4. The proposed model is used to fit an experimental dataset in Section 5. In Section 6, a simulation study is conducted to examine the appropriateness of the normal approximation when the sample size is moderate. Section 7 concludes the paper.
Section snippets
A counting process with two arrival sources
Let {N(t),t≥0} with N(0)=0 denote the particle accumulation process associated with an HDD. N(t) is the total number of particles that have accumulated in the disk by time t. Suppose that the arrivals are from two independent sources, i.e., the internal and external sources. Generally speaking, it is impossible to distinguish the source of an incoming particle. In this section, we study the properties of these two sources.
Inter-arrival data
Assume K HDDs are subject to test. Consider the particle arrival process {Ni(t),t≥0} for the ith HDD sample over time interval (0,ti), where ti is a predetermined observation time. Let be the observed arrival times. Because {Ni(t),t≥0} is an NHPP with arrival rate function μg(t)+λ(t), it is readily verified thatandwhere Xi,n=Si,n−Si,n−1, n≥1, is
Some statistical tests
According to the experience from lab testing as well as from field use, it is reasonable to assume that the external environment is stable and the external arrival process is stationary. In the following, we will confine ourselves to the case where the external arrival process is an HPP with a constant rate λ.
It is of interest to test existence of the internal source and the external source. Because we have assumed the external process to be an HPP, testing existence of the internal source is
An illustrative example
The proposed framework is applied to evaluate reliability of HDDs for a HDD company. Two experiments are carried out to collect the particle accumulation data and the CPC-to-failure data, respectively. In the first experiment, four prime disks without any particle adsorption apparatus, i.e., the air filter, are operated under the same normal use conditions. An ultra-fine particle counter is installed in each disk to count the number of particles accumulated. Without the air filter, the particle
Simulation study for checking the normal approximation
In the previous sections, the Wald's statistics are used by assuming that the MLE of θ are normally distributed. However, adequacy of such approximation should be theoretically proved or checked by simulation, especially when the data size is small or medium. For illustrative purpose, we consider the scenario where μ=100, λ=1 and the release time distribution of the loose particles is Weibull(100,1.3). If the asymptotic normal property holds, we can expect that each of the following four
Conclusions
In this paper, a reliability evaluation framework for HDD test data has been successfully developed and applied to a real problem. The counting process in the framework consists of two arrival sources. When the number of loose particles in the internal source follows a Poisson distribution, the resulting internal arrival process was shown to be an NHPP. This framework is simpler and easier to use compared with the data-driven method proposed by Tang et al. [15] and the general NHPP models such
Acknowledgment
The authors thank the editor and two reviewers for their constructive comments that have considerably helped in the revision of an earlier version of the paper. The research by Prof Xie is partially supported by a grant from the City University of Hong Kong (Project no.9380058).
References (20)
- et al.
A mixed-Weibull regression model for the analysis of automotive warranty data
Reliability Engineering & System Safety
(2005) - et al.
Reliability modeling involving two Weibull distributions
Reliability Engineering & System Safety
(1995) - et al.
Design stage confirmation of lifetime improvement for newly modified products through accelerated life testing
Reliability Engineering & System Safety
(2010) - et al.
Hazard rate estimation from incomplete and unclean warranty data
Reliability Engineering & System Safety
(2003) - et al.
Trend analysis of the power law process using expectation-maximization algorithm for data censored by inspection intervals
Reliability Engineering & System Safety
(2011) - et al.
Degradation-based burn-in with preventive maintenance
European Journal of Operational Research
(2012) - et al.
Effect of particulate concentration, materials and size on the friction and wear of a negative-pressure picoslider flying on a laser-textured disk
Wear
(2001) - et al.
Statistical models based on counting processes
(1993) - et al.
Prediction of remaining life of power transformers based on left truncated and right censored lifetime data
The Annals of Applied Statistics
(2009) Statistical models and methods for lifetime data
(2003)
Cited by (36)
Random maintenance policies for sustaining the reliability of the product through 2D-warranty
2022, Applied Mathematical ModellingRandom periodic replacement models after the expiry of 2D-warranty
2022, Computers and Industrial EngineeringCitation Excerpt :The research on warranty policies is dependent on reliability modeling technology which can be classified into two categories. One type characterizes the product lifetime via distribution function when the degradation behavior is undetectable (Xiao et al., 2020; Ye et al., 2013; Li et al., 2021; Zhang and Zhang, 2021). For product with measurable degradation paths, the product failure is commonly modelled by degradation process (Huang et al., 2021; Liu et al., 2019; Wang et al., 2021; Qiu and Cui, 2019).
Two methods to approximate the superposition of imperfect failure processes
2021, Reliability Engineering and System SafetyCitation Excerpt :There is a bulk of research discussing different types of stochastic processes for modelling failure processes, or simply put, modelling interfailure times, see[1–4], for example. These models are also applied in maintenance policy optimisation, see[5–12], for example. Consider a system that is composed of multiple components in series.
A data-assisted reliability model for carrier-assisted cold data storage systems
2020, Reliability Engineering and System SafetyCitation Excerpt :With regard to cold storage systems, several past research studied application-specific back-up systems in which keeping the copies of the data (replication) was the primary means of providing durability [16]. Although they have paid attention to failure detection problem, they did not configure their model for true cold storage environment requirements i.e., the necessity of carrying media from one location to another, detrimental effects of mechanical components, unavailability of carriers and drive-related hard errors due to particle accumulation [17]. In [18], the work is extended to cover 4-copy case where the backup system consists of both tapes and hard drives with different failure and repair rates i.e., a heterogeneous storage network.
Design and maintenance for the data storage system considering system rebuilding process
2019, Reliability Engineering and System SafetyMulti-level maintenance strategy of deteriorating systems subject to two-stage inspection
2018, Computers and Industrial EngineeringCitation Excerpt :Peng, Feng, and Coit (2010), Song, Coit, and Feng (2016) and Rafiee, Feng, and Coit (2015) applied DTS models to micro-electro-mechanical systems (MEMS) subject to gradual wear and debris caused by shock loads. Ye, Xie, and Tang (2013) and Ye, Chen, and Shen (2015) established reliability models under extreme shocks and natural graduation for automobile tires, laser devices and hard disks. Most existing DTS models concentrated on stationary degradation/shock processes (Peng, Shen, et al., 2017), according to which the degradation/hazard rate is not affected by the discrete system state.