Keywords

1 Introduction

Anomaly can be defined in many ways and usually observations that do not follow the expected patterns are considered as anomalies. Anomaly has a lot to do with the specific scene, which makes it difficult to define. The same pattern may be considered to be normal in one scene but considered as anomaly in another. For example, a riding person is considered to be normal among a riding group, but it may be considered to be an anomaly if the person is riding across a group of walking person. Depending on the spatial range of anomalies, anomalies can be categorized into individual behavior anomalies and group behavior anomalies. In this paper, we focus on individual behavior anomaly detection in video. And patterns that do not conform with their surroundings are considered to be anomalies. To be precisely, we only learn a model for the regular patterns, and those patterns that does not follow this model are classified as anomalies.

Due to the complexity of video data, direct detection of anomalies in video is nearly impossible. Most of existing work tries to solve this problem by establishing a reference model on the training videos, and test patterns that does not resemble this model is considered as anomalies. Reconstruction based mothods [4, 6] classify test video with large reconstruction error as anomaly, but when meet with complicated scenes, these methods usually have a unsatisfactory performance because large scale reconstruction is impossible in complicated scenes. Feature based methods try to detect anomalies in feature space [8, 21], video dynamics are encoded into feature in a selective way by the feature extraction module, features are believed to be easier to classify than raw data. These methods need a well chosen representation module and a suitable classifier. But none of existing feature based methods learn features that are targeted at anomaly detection. To handle this problem, in this paper, we propose an anomaly detection framework based on key-region representation learning.

Our anomaly detection framework consists of three parts, the first one is the video preprocessing module, the second one is the representation learning module, and the last one is the normal/abnormal classifier. We will have a brief discussion below and a detailed discussion in Sect. 3.

In this paper, a key-region selection module is proposed as the video preprocessing module. The goal of this module is to find the video patches that anomaly may appear, wiping off the influence of background patches. Subsequent modules only operate on these patches, thus it would enable the representation learning module focus on the patterns of key regions. As for the representation learning module, we adopt a variant of the auto-encoder architecture in [18], which is proposed as a robust feature extractor. Finally, we use the Mahalanobis distance as the normal/abnormal classifier, proposed in [13] as a probabilistic model in image classification task, we demonstrate that this classifier have a better ability to capture data distribution and can be adapted to an online system.

The main contribution of this paper is: (1) we propose a key-region extracting module for both training the auto-encoder and detecting anomalies. (2) we utilize a Mahalanobis distance based classifier to classify whether the test sample is anomalous or not. (3) To allow the framework to be implemented online, we introduce an online updating method for the classifier.

The rest of this paper is organized as follows: In Sect. 2, we introduce related works in anomaly detection area. In Sect. 3, we show the details of our work. In Sect. 4, We show the experimental results. In Sect. 5, we conclude our work with a brief summary.

2 Related Work

Work in this area can be mainly categorized into feature based or not feature based methods. Among the latter, reconstruction based methods were the most popular [5, 11, 20]. A reconstruction model was trained over regular patterns, where reconstruction error for training samples was minimized. At test time, samples with large reconstruction error were classified as anomaly. However, some researchers pointed out that reconstruction based methods tended to just remember the input [12]. While in video anomaly detection, our goal is to learn the temporal and spatial structure of video events. So predicting future frames based methods were proposed to address this problem, these methods were due to the intuition that prediction needs more information about the motion and appearance than simple reconstruction. In fact, they also pointed out that prediction based methods tended to just remember the last few frames so as to predict the next frame. Besides, due to complicated dynamics of video data, reconstruction for the whole video frames was impractical, and in practice, a sophisticated preprocessing procedure was needed.

Feature based methods [16, 21] usually consist of two parts, the representation learning model and the normal/abnormal classier, as reviewed in [14]. The key is representation learning, and usually how well an anomaly detection framework could work mainly depend on the representative power of the learned features. Hand-crafted features were the main choice before neural networks become popular. In recent years, Neural networks have shown their powerful representation learning ability [17]. In anomaly detection area, most existing powerful neural networks can not be used directly. Since most of these networks need to be trained in an supervised way, and we usually do not have labelled data but only positive samples in anomaly detection task. However, auto-encoder was proved to be an effective unsupervised representation learning method, thus it was widely used in anomaly detection [4, 6, 15]. Auto-encoder based methods are essentially reconstruction methods, but we emphasize its representation learning ability here. To the best of our knowledge, they are still the most popular representation learning method used in anomaly detection area. In this paper, we also adopt an auto-encoder as the representation learning module.

When the representation learning is done, a normal/abnormal classifier is needed. Many classifiers were proposed to handle various problems over the years, among them, one-class SVM [9], Gaussian classifier, distance based (cluster, nearest neighbor, etc.) classifier [7] were the most frequently used classifiers. Each of them had a preference for specific data distribution, for example, linear one-class SVM was suitable for data that mainly lies in one side of a plane. In order to capture the data distribution better, we adopt Mahalanobis distance based classifier.

Fig. 1.
figure 1

An overview of our proposed anomaly detection framework.

3 The Proposed Method

Figure 1 is an over view of our proposed anomaly detection framework. It is consisted of three parts, key-region selection module, representation learning module, normal/abnormal classifier. Below we will show the details of the three modules.

3.1 Key-Region Selection Module

Establishing a model over raw video frames is difficult, because model complexity would be extremely high so as to model complicated video dynamics. One widely used method to handle this problem is to divide the video into small spatial-temporal patches [3, 8, 10, 15], and subsequent operations and anomaly detection are conducted over these patches. But this method suffers a lot of problems. First, most of the patches only contain background information, which does no help for individual behavior modeling. Second, as far as our knowledge is concerned, all anomaly detection frameworks that use this method only divide video into uniform patched straightforwardly, without considering moving targets as a whole. We design our key-region selection module for the following consideration.

The key of our anomaly detection framework is the representation learning module, only features with enough representative power are learned will subsequent classifier classify the samples correctly. Reconsider our goal in anomaly detection, we want to model the moving targets’ pattern, however, as shown in Fig. 1, most of the patches only contain background information. Imagine if we train our representation learning model using all these patches, our learned features may lack representative power for what we are concerned with, since most of the training patches have nothing to do with our concern. So if we can train our model only using those patches that contain moving targets, which is of our concern, the learned features will be more targeted at anomaly detection.

The goal of this module is to find the patches that contain moving targets. In this paper, we view this as a foreground/background classification problem, since they share many similarities. This module is designed based on the following two points: (1) The intensity distribution of foreground patches are different from their surrounding patches. (2) The intensity in a foreground patch is diverse while relatively simplex in a background patch.

Fig. 2.
figure 2

An explicit show for the similarity evaluation between a patch and its surrounding patches.

We propose two kind of measures for each of the two points mentioned above respectively. Based on the first point, we use cosine distance [22] to measure the similarity between a patch and its surrounding patches.

$$\begin{aligned} SIM=\frac{P_{1}^T*P_{2}}{\parallel P_{1}\parallel *\parallel P_{2}\parallel } \end{aligned}$$
(1)

where \(P_{1}\) is a column vector, represents the intensity distribution of the patch being considered, and \(P_{2}\) represents the intensity distribution of the surrounding patches. In Fig. 2, we show how a patch is evaluated in explicit way. For the second point, we use the concept entropy in information theory to measure the intensity diversity in a patch.

$$\begin{aligned} E=-\sum _{i=0}^{255}P(i)logP(i) \end{aligned}$$
(2)

And we give every patch a score based on this two measures, the higher the score, the more likely it is a foreground patch.

$$\begin{aligned} Score=E-\lambda *SIM \end{aligned}$$
(3)

where \(\lambda \) is a parameter that measures the relative importance of the two measures.

Fig. 3.
figure 3

Architecture of the representation learning module.

3.2 Representation Learning Module

We utilize a variation of sparse auto-encoder [18] as the representation learning module. The model architecture is shown in Fig. 3. This network transforms the input x \(\in R^{D}\) into hidden representation h \(\in R^{d}\) through the encoder.

$$\begin{aligned} h = f_{\theta }(x)=\sigma (W_{1}x+b_{1}) \end{aligned}$$
(4)

where \(\theta = (W_{1},b_{1})\), \(W_{1}\in R^{D*d}\) is the weight matrix and \(b_{1} \in R^{d}\) is the bias term. \(\sigma \) is the sigmoid activation function.

$$\begin{aligned} \sigma (x) = \frac{1}{1+e^{-x}} \end{aligned}$$
(5)

The hidden representation is then mapped into the output as a reconstruction of the input through the decoder.

$$\begin{aligned} y = g_{\theta ^{'}}(h) = \sigma (W_{2}h+b_{2}) \end{aligned}$$
(6)

where \(\theta ^{'} = (W_{2},b_{2})\), \(W_{2} \in R^{d*D}\) is the weight matrix, \(b_{2} \in R^{d}\) is the bias term. \(\sigma \) is the same as above. For an auto-encoder, we get the optimal parameter \(\theta \) and \(\theta ^{'}\) by minimizing the reconstruction error between the input x and reconstruction y.

$$\begin{aligned} \theta ^{*}, \theta ^{'*} = \mathop {\arg \min }_{\theta ,\theta ^{'*}}E_{p(x)}L(x,y) \end{aligned}$$
(7)

L(x,y) is the reconstruction error between x and corresponding output y. In this paper, we adopt the Euclidean Distance as the reconstruction error.

$$\begin{aligned} L(x,y)=||x-y || \end{aligned}$$
(8)

With a training set of size m, we get \(\theta ^{*}, \theta ^{'*}\) through minimizing the objective function.

$$\begin{aligned} \theta ^{*}, \theta ^{'*} = \mathop {\arg \min }_{\theta ,\theta ^{'*}}\frac{1}{m}\sum _{i=1}^{m}L(x^{i},y^{i})+\sum _{j=1}^{s_{j}}KL(\rho ||\rho _{j}) \end{aligned}$$
(9)

The loss function consists of two parts, where the first part is the reconstruction error introduced above. The second term serves as a sparsity constraint, it can be understood as a prior over the parameter distribution as explained in [18]. \(\rho \) is a preset activation value, usually is a number much smaller than 1. Let \(\rho _{j}\) be the average activation value of hidden state j over m training samples. Equation (9) is optimized through Gradient Descent.

$$\begin{aligned} \rho _{j} = \frac{1}{m}\sum _{i=1}^{m}a_{j}(x^{i}) \end{aligned}$$
(10)
Fig. 4.
figure 4

Each circle represents the points that have the same distance to the center point. As is shown, Mahalanobis distance can capture dependence between different dimensions

3.3 Normal/Abnormal Classifier

Most distance based methods use Euclidean Distance as a measure, these methods are based on the assumption that different dimensions of the feature are independent. However, in practice, it’s usually not the case. Mahalanobis Distance based methods do not need the isotropy assumption. We give a comparison between them in Fig. 4, here we only consider two dimensional case for visualization.

$$\begin{aligned} MahalanobisDistance = (x-\mu )^{T}\varSigma ^{-1}(x-\mu ) \end{aligned}$$
(11)

\(\mu \) is the mean vector of the samples, and \(\varSigma \) is the covariance matrix. Mahalanobis distance can be interpreted intuitively as the Euclidean distance in a non-regular coordinate system. \(\varSigma ^{-1}\) can be decomposed into \(W^{T}diag(\varLambda )W\), we get Eq. (12).

(12)

The usual way of computing \(\mu \) and \(\varSigma \) is

$$\begin{aligned} \mu =\frac{1}{n}\sum _{i=1}^{n}x_{i} \end{aligned}$$
(13)
$$\begin{aligned} \varSigma =\frac{1}{n}\sum _{i=1}^{n}(x_{i}-\mu )(x_{i}-\mu )^{T} \end{aligned}$$
(14)

But this computation method can’t be used in an online way, since we need to compute \(\mu \) and \(\varSigma \) over all samples again when we get new samples, that’s time-consuming and has a huge storage requirement. We introduce the method in [19] to compute them in an efficient way, thus allowing this method to be used in an online system. Denote the mean vector and covariance matrix of previous n−1 samples as \(\mu _{n-1}\) and \(\varSigma _{n-1}\) respectively, when we get a new sample, the new mean vector and covariance matrix is denoted as \(\mu _{n}\) and \(\varSigma _{n}\), then we have the following updating methods.

$$\begin{aligned} \mu _{n} = \mu _{n-1}+\frac{x_{n}-\mu _{n-1}}{n} \end{aligned}$$
(15)
$$\begin{aligned} \varSigma _{n} = \frac{n-1}{n}\varSigma _{n-1}+\frac{1}{n}(x_{n}-\mu _{n-1})(x_{n}-\mu _{n})^{T} \end{aligned}$$
(16)
Fig. 5.
figure 5

Right: the pseudo-color map of score map, yellow color represents high score and blue represents low scorewe crop the regions that have a high score. Left: we crop the corresponding regions that have a relatively high score in original image. (Color figure online)

4 Experiment Validation

4.1 Parameter Setting

The size of video patch is set to be 10 * 10 * 5, as illustrated in [1, 15]. The weight parameter in Eq. (3) is set to be 2. For the representation learning module, the number of hidden state is 100, the sparsity parameter \(\rho \) is 0.1. At the training stage, we first use a normal distribution to initial the parameter. And then the stochastic gradient descent [2] is used to train the representation learning module introduced in Sect. 3.1, and learning rate is set to be 0.5. The network takes about 4 hours to converge on an i7-5960X CPU.

4.2 Experiment Result

The experiment is implemented over the UCSD Ped2 dataset, which is the most popularly used dataset in anomaly detection. This dataset is composed of 16 training sequence and 12 test sequence. The resolution of the video is 240 * 360, gray scale.

Here we show the experiment result of the proposed anomaly detection framework. We validate the effectiveness of our proposed patch selection module in key-region extracting in Fig. 5. As is shown in the figure, our patch selection module assigns a high score for the patch that contains our targets, while a relatively low score for the patches that only contain background information. For better visualization, we normalize the score into [0, 1] using a linear mapping.

For the performance assessment of the whole algorithm, we compare our work with [15], since we share a similar structure. We compare them using the most important evaluation indicator in anomaly detection, ROC and EER, and we also compare their running speed on our machine. The ROC curve is shown in Fig. 6, experiment shows the AUC of our work surpasses Sabokrou’s by 8%. And the results of some other assessment criterions are shown in Table 1, the average processing time for one frame in our work is slightly longer than their’s, since our work need to go through a patch selection procedure. But compared to the gain we get in AUC, this sacrifice is worthwhile. Moreover, in order to emphasize the importance of our key-region selection module, we discard this module and experiment result shows a decline on the performance.

Fig. 6.
figure 6

ROC of our work compared to Sabokrou’s work in [15]

Table 1. Comparison of our work with Sabokrou’s work

5 Conclusion

In this paper, we propose an anomaly detection framework, experiment shows a better performance is achieved compared to work with a similar structure. We owe this achievement to our patch selection module and mahalanobis distance based classifier, and we also introduce an online implementation method for this algorithm.