Elsevier

Pattern Recognition Letters

Volume 153, January 2022, Pages 254-260
Pattern Recognition Letters

Anomaly detection in streaming data with gaussian process based stochastic differential equations

https://doi.org/10.1016/j.patrec.2021.12.017Get rights and content

Highlights

  • Streaming data are characterised as evolution of Stochastic Differential Equations.

  • Gaussian process regression used for nonparametric estimation of SDE coefficients.

  • Stochastic process theory applied to automatically compare SDE models.

  • Bootstrapping methods allow for control of the false alarm rate of detector.

  • Evaluation on datasets demonstrates detector’s high discriminative power.

Abstract

This paper characterises streaming data as the evolution of a stochastic differential equation, with the aim of extracting information that can be used to detect anomalies in the stream. Gaussian process regression provides a flexible approach to approximating components of the stochastic differential equation, allowing for complex modelling of underlying data generation dynamics. The proposed algorithm displays superior discriminative power over different time-series anomaly detection methods for both synthetic and NYC taxi datasets, whilst the introduced bootstrapping method for setting the detection threshold provides control over the false alarm rate of the anomaly detector.

Introduction

Anomaly detection is a key application of streaming data analysis, with examples including quality control, system health monitoring to fraud and intrusion detection. Anomaly detection systems automatically detect anomalous data, and alert human analysts of the problem so that it can be resolved. The improvement of anomaly detection systems can prevent unnecessary human interventions by reducing false positive rates [1], and allow analysts to react quicker to developing problems by reducing detection times.

In [2], a general definition of an anomaly is given as “the measurable consequences of an unexpected change in state of a system which is outside of its local or global norm”. Types of anomalies include Global (a.k.a. point), Collective and Contextual anomalies, detailed in [3]. Detrending and deseasonalizing the data to give the residual process as a preprocessing step provides a way to detect all these types of anomalies. Anomalies appear as a deviation from the normal, and decomposing the data in this way implies a normal behaviour of seasonality and trend, and so anomalous data would be entirely contained within this residual process.

The main purpose of this paper is to combine nonparametric estimation techniques for stochastic differential equations (SDEs), with statistical testing methodology to provide a general algorithm for streaming anomaly detection. SDE estimation is used to extract both the deterministic and stochastic dynamics of the normal data generation process, and statistical testing used to compare incoming data to this learned model to test for the presence of anomalies.

Section snippets

Related works

The nonparametric estimation of SDEs has been used to characterise and distinguish between different data generation processes in many applications. Initial work on SDE estimation focused on learning the dynamics of a system based on noisy state observations [4], [5]. In [4], histogram based learning of the deterministic and stochastic parts of the SDE was applied to differentiating between classes of patient tremors, using neurophysiological time-series data. In [5], human heartbeat dynamics

Methodology

In this section, we outline the relevant theory and extensions that allow us to perform anomaly detection using a GPR based estimate of an SDE. Firstly, SDE estimation using GPR is outlined in Section 3.1, which is followed by the validation method used to set hyperparameters. We apply stochastic process theory to the GP based SDEs to determine a test statistic used for the anomaly detection. We use bootstrapping methodology for setting the decision threshold of the statistical test to control

Experiments

The function approximation capabilities of the GPR for SDE coefficient estimation is demonstrated against polynomial regression. A known SDE is simulated from, and the diffusion coefficient estimated using both techniques, and the resulting approximating functions are compared against each other and the ground truth.

Our SDE-GPR anomaly detection algorithm (Algorithm 1) is evaluated with two examples. Firstly, testing is performed on a synthetic dataset generated to demonstrate the effectiveness

Function approximation

For the function approximation, an example realisation from the bistable SDE process is shown below, along with the polynomial, and Gaussian process regression results for the diffusion function. (See Fig. 1)

Fig. 2 shows the resulting mean functions learned using a GPR with SE kernel, s=1,l=1 and regularization λ=1, and a polynomial regression order n=6. The regularisation used for the Gaussian process regression provides a better behaved estimated function, with the function at the tails

Conclusion

Our proposed method SDE-GPR for anomaly detection is shown to distinguish between normal data generation and anomalous behaviour through automatic comparison of underlying dynamics. SDE-GPR outperformed a baseline method of detection for both synthetic and NYC taxi datasets in terms discriminative power, whilst still providing control over false alarm rates. SDE-GPR also discriminates contextual anomalies better compared to the ARIMA model based on AUC. Possible future work could include

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work has been supported by Splunk Inc. [G106483] PhD scholarship funding. Mark Girolami was supported by Engineering and Physical Sciences Research Council Grants [EP/R034710/1, EP/R018413/1, EP/R004889/1, EP/P020720/1] and a Royal Academy of Engineering Research Chair.

References (20)

There are more references available in the full text version of this article.

Cited by (4)

View full text