Elsevier

Expert Systems with Applications

Volume 57, 15 September 2016, Pages 324-336
Expert Systems with Applications

Self-adaptive statistical process control for anomaly detection in time series

https://doi.org/10.1016/j.eswa.2016.03.029Get rights and content

Highlights

  • We model anomaly detection as a statistical testing based on fuzzy set theory.

  • Detection rate and false alarm rate almost are not affected by different K.

  • K optimization is necessary for AUC performance improvement.

  • Fuzzification can effectively reduce false alarm rate.

  • This approach results in high AUC performance and reduces the detection time.

Abstract

Anomaly detection in time series has become a widespread problem in the areas such as intrusion detection and industrial process monitoring. Major challenges in anomaly detection systems include unknown data distribution, control limit determination, multiple parameters, training data and fuzziness of ‘anomaly’. Motivated by these considerations, a novel model is developed, whose salient feature is a synergistic combination of statistical and fuzzy set-based techniques. We view anomaly detection problem as a certain statistical hypothesis testing. Meanwhile, ‘anomaly’ itself includes fuzziness, therefore, can be described with fuzzy sets, which bring a facet of robustness to the overall scheme. Intensive fuzzification is engaged and plays an important role in the successive step of hypothesis testing. Because of intensive fuzzification, the proposed algorithm is distribution-free and self-adaptive, which solves the limitation of control limit and multiple parameters. The framework is realized in an unsupervised mode, leading to great portability and scalability. The performance is assessed in terms of ROC curve on university of California Riverside repository. A series of experiments show that the proposed approach can significantly increase the AUC, while the false alarm rate is improved. In particular, it is capable of detecting anomalies at the earliest possible time.

Introduction

Anomaly detection in time series provides significant information for numerous applications. For example, it can be used to detect intrusions in network data (Abadeh, Mohamadi, & Habibi, 2011), fraud detection (Ahmed, Mahmood, & Islam, 2016), incident faults in industrial process (Brighenti & Sanz-Bobi, 2011). Anomalies in time series can manifest in terms of the changes in the amplitude of data, or can be associated with the changes in the shape of temporal waveforms. In light of this, we categorize anomalies into two types: anomalies in amplitude and anomalies in shape. For example, it is an anomaly in amplitude that is a premature ventricular contraction in electrocardiogram (ECG) signals in Fig. 1 and it is an anomaly in shape that is a premature Poppet withdrawal in a Space Shuttle Marotta Valve time series shown in Fig. 2. These anomalous parts are highlighted in red in both figures.

Anomalies are time series that are the least similar to all other time series and depart from the bounds of the state of statistical control which exists when certain critical process variables remain close to their target values and do not change perceptibly. Time series that stay in a state of statistical control are called in-control data (normal data), otherwise, are called out-of-control data (anomaly). In statistical process control, control charts are used to determine if a process is in a state of statistical control. As shown in Fig. 3, a control chart consists of:

  • (1)

    Points representing a statistic of measurements of a quality characteristic in samples taken from the process at different times or different data.

  • (2)

    The mean of this statistic using all the samples at which a center line is drawn.

  • (3)

    Upper control limits (UCL) and lower control limits (LCL) that indicate the threshold at which the process output is considered statistically ‘unlikely’.

Anomaly detection in time series is more challenging due to several reasons. First, it makes control limits very important decision aids. Control limits provide information about the process behavior and have no intrinsic relationship to any specification targets. In practice, the process mean (the center line) may not coincide with the specified value of the quality characteristic, because the process design simply cannot deliver the process characteristic at the desired level. It is also a key challenge to select a threshold instead of process mean. Second, anomaly is a more complex concept. For example, if one sample’s characteristic is equal to UCL - ε (ε is an infinitesimal positive number), it is normal. But if one sample’s characteristic is equal to UCL + ε, it becomes difficult to determine whether it is normal or abnormal. Third, many other algorithms require several parameters whose values are to be determined. This requires to acquire large amounts of training data, therefore, most of algorithms are realized in the supervised mode.

Due to these major challenges including unknown data distribution, control limit determination, multiple parameters, training data and fuzziness of ‘anomaly’ in anomaly detection systems, a synergistic combination of statistical and fuzzy set-based technique is proposed in this paper. We view anomaly detection as a statistical hypothesis testing and introduce a definition based on control chart in statistical process control. Because the process mean may not coincide with the specified value of the quality, we do not adopt the mean of samples’ characteristic, but a threshold. Anomaly could be a more complex concept, so the threshold should be fuzzy. Fuzzy set theory is taken into account to provide a better characterization of the boundary between normal and abnormal. What’s more, the inequality (>, ≤) in statistical hypothesis test is treated as a fuzzy predicate (the degree of inclusion). Intensive fuzzification process is adopted to realize related parameters determination which is self-adaptive. Therefore, the values of parameters are not required to be specified by the user. Due to the use of fuzzy set theory, statistical hypothesis testing in this paper is a distribution-free and totally unsupervised model. What’s more, the overall scheme is self-adaptive. The utility is demonstrated using synthetic and real data sets. We have conducted a number of studies that show the effectiveness of our algorithm to detect anomalies in time series data.

The paper is structured as follows. Section 2 reviews some previous works on anomaly detection. Section 3 illustrates how anomaly detection can be viewed as a statistical hypothesis testing. A fuzzy-statistical algorithm for detecting anomalies is described in Section 4. We present some applications and perform extensive evaluation in Section 5 to demonstrate both the utility and ability to detect anomalies. Finally, the paper is summarized and concluded in Section 6.

Section snippets

Related works

The broad categories of anomaly detection techniques are: classification-based techniques (Koc , Mazzuchi , Sarkani , 2012, Dangelo , Palmieri , Ficco , Rampone , 2015), nearest neighbor-based techniques (Ceclio , Ottewill , Pretlove , Thornhill , 2014, Sajjad , Bouk , Yousaf , 2015, Lin , Ke , Tsai , 2015), clustering-based techniques (Ahmed , Mahmood , Maher , 2015, Lee , Kim , Kim , 2011), statistical techniques (Zhang , Lu , Zhang , Ruan , 2016, Pierazzi , Casolari , Colajanni , Marchetti ,

Anomaly detection based on statistical process control

According to statistical process control, the statistic of a quality characteristic should be defined, which represents the anomaly score of the data. When the anomaly score is larger, the data is more likely to be anomaly. So, we just consider how large the anomaly score is and do not need to care how small it is. That is to say, we also need to determine the upper control limit, with regard to lower control limit. Now we illustrate how any anomaly detection scheme can be viewed as a

Self-adaptive detection model

As mentioned earlier, ‘anomaly’ is a complex concept and therefore the threshold determination can be realized by engaging fuzzy sets. In this section, we show how to determine threshold using fuzzy set theory and a certain fuzzification process that leads to the treatment of inequalities (>, ≤) as fuzzy predicates. The aim of fuzzification is to achieve optimized results. Let us also note that because of fuzzification, the algorithm is a distribution-free statistical testing.

Experiments and discussions

We begin experiments by showing the usefulness of the proposed algorithm for synthetic data and real-life data, including anomalies in shape and amplitude. Then, we perform several experiments to evaluate its performance. Finally, we contrast our algorithm against several baseline algorithms to show that the proposed algorithm is able to efficiently find anomalies. In our experiments, it is workable without training process. K can assume any value.

Conclusion

In this work, we have mainly developed a self-adaptive algorithm for finding anomalies in time series. A key feature of the algorithm is a synergistic combination of both statistical and fuzzy set-based theories. Exploiting fuzzy set theory in statistical process control, the detection is a distribution-free and unsupervised model. K optimization is necessary for AUC improvement. In this case, detection rate and false alarm rate almost are not affected by different K. False alarm rate has been

Acknowledgments

The work of this paper is funded by the project of National Natural Science Foundation of China (No.91520204) and the project of National High Technology Research and Development Program of China (863 Program) (No. 2015AA015405).

References (39)

Cited by (31)

  • Distributed SFA-CA monitoring approach for nonstationary plant-wide process and its application on a vinyl acetate monomer process

    2022, Process Safety and Environmental Protection
    Citation Excerpt :

    Owing to the increasing demand in plant safety and product quality, process monitoring and fault diagnosis play an increasingly important role. With the application of distributed control system in modern process industry and the development of computing technology, large amounts of process data are restored, leading the development of data-driven methods (Amin et al., 2021; Arunthavanathan et al., 2021; Dominic et al., 2015; Nan et al., 2007; Zheng et al., 2016). Multivariate statistical process monitoring (MSPM) methods, such as principal component analysis (PCA) and partial least square (PLS), can extract the main features from high-dimensional data.

  • Object oriented time series exploration: Applied to power consumption analysis of embedded systems

    2021, Expert Systems with Applications
    Citation Excerpt :

    Finally, the contribution is discussed and summarized in Sections 7 and 8, respectively. Anomalies in time series (Zheng, Li, & Zhao, 2016) usually relate to the amplitude, shape, or time features of temporal waveforms. They can be defined as anomaly scores, e.g., minimal, maximal (extreme) values, some statistical features, etc.

  • Efficient on-line anomaly detection for ship systems in operation

    2019, Expert Systems with Applications
    Citation Excerpt :

    An extensive number of anomaly detection methods are described in the literature and used extensively in a wide variety of applications in various industries. The available techniques comprise (Chandola et al., 2009; Kanarachos, Christopoulos, Chroneos, & Fitzpatrick, 2017; Olson, Judd, & Nichols, 2018; Zheng, Li, & Zhao, 2016): classification methods that are rule-based, or based on Neural Networks, Bayesian Networks or Support Vector Machines; nearest neighbour based methods, including k nearest neighbour and relative density; clustering based methods; and statistical and fuzzy set-based techniques, including parametric and non-parametric methods based on histograms or kernel functions. The fundamental approaches to the problem of anomaly detection can be divided into three categories (Chandola et al., 2009; Hodge & Austin, 2004):

  • Evaluating the benefits of using proactive transformed-domain-based techniques in fraud detection tasks

    2019, Future Generation Computer Systems
    Citation Excerpt :

    This happens when it is important to characterize the involved elements on the basis of the time factor [22]. The information extracted from the time series can be exploited in order to perform different tasks, such as those related to the risk analysis (e.g., Credit Scoring [23] and Stock Forecasting [24]) and Information Security (e.g., Fraud Detection [25] and Intrusion Detection [26]) ones. In other words, the relationship between time series and our fraud detection approach must be sought in the analysis, performed in the frequency domain, of patterns given by the feature values of a transaction.

  • Fuzzified Cuckoo based Clustering Technique for Network Anomaly Detection

    2018, Computers and Electrical Engineering
    Citation Excerpt :

    Results are reported by applying these metrics to aforementioned datasets. Table 11 shows the corresponding results using K-means, Decision Tree, PSO, CSO (MSE), CSO (MSE, SI), FCOAC [22], SADA [23], SSAD [24], TVCPSO [25] and F-CBCT. Further, it can be noticed from Fig. 5 that TVCPSO gives comparable performance in most of the cases but FPR of the proposed F-CBCT is quite less (in the considered datasets) as compared to the proposed one.

  • Multivariate time series anomaly detection: A framework of Hidden Markov Models

    2017, Applied Soft Computing Journal
    Citation Excerpt :

    The first difficulty arises because of the lack of a concise and operational anomaly definition [12]. Unusual points (exhibiting too high or too low values) and unexpected subsequences (e.g., shape changes) [13] appearing in univariate time series can be considered as anomaly. Unlike these definitions, multivariate techniques do not only deal with the abnormal values or subsequences in each time series but also investigate the relationships among these variables.

View all citing articles on Scopus
View full text