Reliability evaluation of hard disk drive failures based on counting processes

https://doi.org/10.1016/j.ress.2012.07.003Get rights and content

Abstract

Reliability assessment for hard disk drives (HDDs) is important yet difficult for manufacturers. Motivated by the fact that the particle accumulation in the HDDs, which accounts for most HDD catastrophic failures, is contributed from the internal and external sources, a counting process with two arrival sources is proposed to model the particle cumulative process in HDDs. This model successfully explains the collapse of traditional ALT approaches for accelerated life test data. Parameter estimation and hypothesis tests for the model are developed and illustrated with real data from a HDD test. A simulation study is conducted to examine the accuracy of large sample normal approximations that are used to test existence of the internal and external sources.

Highlights

► Develop a reliability evaluation framework for HDD based on NHPPs. ► Use the framework to explain why traditional ALT methods fail to analyze the data. ► Develop statistical inference and hypothesis testing approaches. ► Apply the framework to a real HDD problem. ► Conduct a simulation to examine the normal approximation used for hypothesis testing.

Introduction

Hard disk drives (HDDs) are probably the most important data storage device in modern computing systems. Failure of an HDD often causes high costs to a user due to losses of important data. Therefore, HDDs are often well designed to achieve an extremely high reliability. According to Schroeder and Gibson [12], the mean time to failure of HDDs ranges from 1,000,000 to 1,500,000 h. A report given by Samsung® [11] also reveals that the mean time between failures of their HDDs is as high as 1,200,000 h, meaning that the annual failure rate is about 0.7% if an exponential distribution is assumed. On the other hand, the rapid evolution of HDD manufacturing techniques imposes a stringent time constraint on development of HDDs of the latest vintages. Typically, the time allowed for the production process ranges from 3 to 9 months. A particular production process includes concept development, design, pilot manufacturing and life testing [15]. Therefore, the time allowed for the life test of a new design is often very short, in spite of the high reliability of the new vintage. How to carry out the life test within a reasonable time frame and how to analyze the test data have become important issues for HDD reliability engineers.

Accelerated life tests (ALTs) provide reliability engineers with a quick way to obtain the lifetime information. During an ALT experiment, all testing samples are subject to high stresses in order to hasten the failure process [9]. Failure data from the ALT are collected and analyzed. The results are then extrapolated to normal use conditions with a view to predicting lifetime characteristics of interest. Success of the ALT depends largely on correct specification of stresses, also known as accelerating factors, for the experiment. Choosing the right accelerating factors requires a good understanding of failure mechanisms of the HDDs. It has been found that particle induced failures are the most prevailing failure mechanism during the service life of an HDD. The particles accumulate in the disks over time and induce a variety of failure modes, e.g., circumferential scratches, head media crashes and scratches along different head-disk interface directions (e.g., [7], [13], [20]). This motivates Tang et al. [16] and Tang et al. [15] to use injected particles as the accelerating factor and to test the samples under different particle injection rates. However, they found that conventional methods for ALT data analysis, e.g., the log-location-scale regression model [5], cannot fit the time-to-failure data well. Therefore, instead of using the traditional chronological time scale, they proposed the cumulative particle count (CPC) as the time scale. A failure occurs when the CPC exceeds a random threshold. This is analogous of using cumulative mileages as the time scale for a tire tread (e.g., [19]). Based on this accelerating factor and the new time scale, Tang et al. [15] developed a statistical modeling framework to analyze the testing data. In their framework, a generalized non-linear regression model was used to fit the CPC paths and a best-fit distribution was chosen to fit the CPC-to-failure data, based on which the failure time distribution can be computed. This framework is basically data-driven, and thus has several deficiencies. For example, it treats the CPC paths as deterministic and ignores the fact that the particle accumulation is indeed a random process, which can be readily seen from Fig. 3 in Section 5. In addition, it cannot explain why traditional methods do not work well for the ALT data of HDDs.

In this paper, we propose a process-based approach for analyzing the HDD testing data. This approach is based on the particle accumulation mechanism of the HDDs. Instead of the generalized non-linear model, we propose to use a counting process to model the CPC paths. This framework is depicted in Fig. 1.

To apply this framework, we shall analyze the physics of particle accumulation to specify which counting process to use. The CPC paths indicate that the rate of particle accumulation appears to reduce over time. Further examination of the HDD physics reveals that there are two sources of particles, i.e., the internal and external sources. The internal source is primarily due to loose particles trapped inside the drives during the manufacture. These loose particles release gradually, accumulate in the disk, and dominate the particle accumulation process in the beginning, which account for the initial phase of high and decreasing accumulation rate observed in the initial stage of the CPC paths. The external source is due to particles from ambient environments. Some dust particles from outside environments escape from the air filter and land on the HDD platter. They accumulate in the disk. After almost all the loose particles in the internal source have released, the external source dominates the particle accumulation process. Under this circumstance, if the accumulation process of external particles is relatively stationary, the particle accumulation process would become approximately linear, which explains the second phase of the CPC paths shown in Fig. 3 in Section 5. This physical phenomenon motivates us to use a counting process with internal and external sources. We assume the internal source has finite but unknown number of loose particles, which have i.i.d. release times to the HDD disk, while particle arrivals from the external source follow a non-homogeneous Poisson process (NHPP). The NHPP assumption is reasonable, as will be justified in Section 2.1. In this study, we investigate properties of this counting process and use it to evaluate reliability of the HDDs.

The rest of this paper is organized as follows. Properties of the counting process with two arrival sources are investigated in Section 2. It is found that when the number of loose particles in the internal source follows a Poisson distribution, the resulting internal particle release process is an NHPP, regardless of the release time distribution of these loose particles. Section 3 presents maximum likelihood (ML) estimation for this counting process. Statistical test methods are discussed in Section 4. The proposed model is used to fit an experimental dataset in Section 5. In Section 6, a simulation study is conducted to examine the appropriateness of the normal approximation when the sample size is moderate. Section 7 concludes the paper.

Section snippets

A counting process with two arrival sources

Let {N(t),t≥0} with N(0)=0 denote the particle accumulation process associated with an HDD. N(t) is the total number of particles that have accumulated in the disk by time t. Suppose that the arrivals are from two independent sources, i.e., the internal and external sources. Generally speaking, it is impossible to distinguish the source of an incoming particle. In this section, we study the properties of these two sources.

Inter-arrival data

Assume K HDDs are subject to test. Consider the particle arrival process {Ni(t),t≥0} for the ith HDD sample over time interval (0,ti), where ti is a predetermined observation time. Let Si,0=0,Si,1,...,Si,Ni(t) be the observed arrival times. Because {Ni(t),t≥0} is an NHPP with arrival rate function μg(t)+λ(t), it is readily verified thatPr{Xi,n>t|Si,n1}=exp(μG¯(t)μG¯(Si,n1)Si,n1tλ(u)du),andp{Xi,n=t|Si,n1}=[μg(t)+λ(t)]exp(μG¯(t)μG¯(Si,n1)Si,n1tλ(u)du),where Xi,n=Si,nSi,n−1, n≥1, is

Some statistical tests

According to the experience from lab testing as well as from field use, it is reasonable to assume that the external environment is stable and the external arrival process is stationary. In the following, we will confine ourselves to the case where the external arrival process is an HPP with a constant rate λ.

It is of interest to test existence of the internal source and the external source. Because we have assumed the external process to be an HPP, testing existence of the internal source is

An illustrative example

The proposed framework is applied to evaluate reliability of HDDs for a HDD company. Two experiments are carried out to collect the particle accumulation data and the CPC-to-failure data, respectively. In the first experiment, four prime disks without any particle adsorption apparatus, i.e., the air filter, are operated under the same normal use conditions. An ultra-fine particle counter is installed in each disk to count the number of particles accumulated. Without the air filter, the particle

Simulation study for checking the normal approximation

In the previous sections, the Wald's statistics are used by assuming that the MLE of θ are normally distributed. However, adequacy of such approximation should be theoretically proved or checked by simulation, especially when the data size is small or medium. For illustrative purpose, we consider the scenario where μ=100, λ=1 and the release time distribution of the loose particles is Weibull(100,1.3). If the asymptotic normal property holds, we can expect that each of the following four

Conclusions

In this paper, a reliability evaluation framework for HDD test data has been successfully developed and applied to a real problem. The counting process in the framework consists of two arrival sources. When the number of loose particles in the internal source follows a Poisson distribution, the resulting internal arrival process was shown to be an NHPP. This framework is simpler and easier to use compared with the data-driven method proposed by Tang et al. [15] and the general NHPP models such

Acknowledgment

The authors thank the editor and two reviewers for their constructive comments that have considerably helped in the revision of an earlier version of the paper. The research by Prof Xie is partially supported by a grant from the City University of Hong Kong (Project no.9380058).

References (20)

There are more references available in the full text version of this article.

Cited by (36)

  • Random periodic replacement models after the expiry of 2D-warranty

    2022, Computers and Industrial Engineering
    Citation Excerpt :

    The research on warranty policies is dependent on reliability modeling technology which can be classified into two categories. One type characterizes the product lifetime via distribution function when the degradation behavior is undetectable (Xiao et al., 2020; Ye et al., 2013; Li et al., 2021; Zhang and Zhang, 2021). For product with measurable degradation paths, the product failure is commonly modelled by degradation process (Huang et al., 2021; Liu et al., 2019; Wang et al., 2021; Qiu and Cui, 2019).

  • Two methods to approximate the superposition of imperfect failure processes

    2021, Reliability Engineering and System Safety
    Citation Excerpt :

    There is a bulk of research discussing different types of stochastic processes for modelling failure processes, or simply put, modelling interfailure times, see[1–4], for example. These models are also applied in maintenance policy optimisation, see[5–12], for example. Consider a system that is composed of multiple components in series.

  • A data-assisted reliability model for carrier-assisted cold data storage systems

    2020, Reliability Engineering and System Safety
    Citation Excerpt :

    With regard to cold storage systems, several past research studied application-specific back-up systems in which keeping the copies of the data (replication) was the primary means of providing durability [16]. Although they have paid attention to failure detection problem, they did not configure their model for true cold storage environment requirements i.e., the necessity of carrying media from one location to another, detrimental effects of mechanical components, unavailability of carriers and drive-related hard errors due to particle accumulation [17]. In [18], the work is extended to cover 4-copy case where the backup system consists of both tapes and hard drives with different failure and repair rates i.e., a heterogeneous storage network.

  • Multi-level maintenance strategy of deteriorating systems subject to two-stage inspection

    2018, Computers and Industrial Engineering
    Citation Excerpt :

    Peng, Feng, and Coit (2010), Song, Coit, and Feng (2016) and Rafiee, Feng, and Coit (2015) applied DTS models to micro-electro-mechanical systems (MEMS) subject to gradual wear and debris caused by shock loads. Ye, Xie, and Tang (2013) and Ye, Chen, and Shen (2015) established reliability models under extreme shocks and natural graduation for automobile tires, laser devices and hard disks. Most existing DTS models concentrated on stationary degradation/shock processes (Peng, Shen, et al., 2017), according to which the degradation/hazard rate is not affected by the discrete system state.

View all citing articles on Scopus
View full text