Key-Region Representation Learning for Anomaly Detection

Yang, Wenfei; Liu, Bin; Yu, Nenghai

doi:10.1007/978-3-319-71607-7_60

Wenfei Yang^16,17,
Bin Liu^16,17 &
Nenghai Yu^16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10666))

Included in the following conference series:

International Conference on Image and Graphics

2465 Accesses

Abstract

Anomaly detection and localization is of great importance for public safety monitoring. In this paper we focus on individual behavior anomaly detection, which remains a challenging problem due to complicated dynamics of video data. We try to solve this problem in a way based on feature extraction, we believe that patterns are easier to classify in feature space. However, different from many works in video analysis, we only extract features from small key-region patches, which allows our feature extraction module to have a simple architecture and be more targeted at anomaly detection. Our anomaly detection framework consists of three parts, the main part is an auto-encoder based representation learning module, and the other two parts, key-region extracting module and Mahalanobis distance based classifier, are specifically designed for anomaly detection in video. Our work has the following advantages: (1) our anomaly detection framework focus only on suspicious regions, and can detect anomalies with high accuracy and speed. (2) Our anomaly detection classifier has a stronger power to capture data distribution for anomaly detection.

You have full access to this open access chapter, Download conference paper PDF

ADGSC: video anomaly detection algorithm based on graph structure change detection in public places

Article 27 March 2023

A Video Anomaly Detection Method Based on Sequence Recognition

A Discriminative Framework for Anomaly Detection in Large Videos

Keywords

1 Introduction

Anomaly can be defined in many ways and usually observations that do not follow the expected patterns are considered as anomalies. Anomaly has a lot to do with the specific scene, which makes it difficult to define. The same pattern may be considered to be normal in one scene but considered as anomaly in another. For example, a riding person is considered to be normal among a riding group, but it may be considered to be an anomaly if the person is riding across a group of walking person. Depending on the spatial range of anomalies, anomalies can be categorized into individual behavior anomalies and group behavior anomalies. In this paper, we focus on individual behavior anomaly detection in video. And patterns that do not conform with their surroundings are considered to be anomalies. To be precisely, we only learn a model for the regular patterns, and those patterns that does not follow this model are classified as anomalies.

Due to the complexity of video data, direct detection of anomalies in video is nearly impossible. Most of existing work tries to solve this problem by establishing a reference model on the training videos, and test patterns that does not resemble this model is considered as anomalies. Reconstruction based mothods [4, 6] classify test video with large reconstruction error as anomaly, but when meet with complicated scenes, these methods usually have a unsatisfactory performance because large scale reconstruction is impossible in complicated scenes. Feature based methods try to detect anomalies in feature space [8, 21], video dynamics are encoded into feature in a selective way by the feature extraction module, features are believed to be easier to classify than raw data. These methods need a well chosen representation module and a suitable classifier. But none of existing feature based methods learn features that are targeted at anomaly detection. To handle this problem, in this paper, we propose an anomaly detection framework based on key-region representation learning.

Our anomaly detection framework consists of three parts, the first one is the video preprocessing module, the second one is the representation learning module, and the last one is the normal/abnormal classifier. We will have a brief discussion below and a detailed discussion in Sect. 3.

In this paper, a key-region selection module is proposed as the video preprocessing module. The goal of this module is to find the video patches that anomaly may appear, wiping off the influence of background patches. Subsequent modules only operate on these patches, thus it would enable the representation learning module focus on the patterns of key regions. As for the representation learning module, we adopt a variant of the auto-encoder architecture in [18], which is proposed as a robust feature extractor. Finally, we use the Mahalanobis distance as the normal/abnormal classifier, proposed in [13] as a probabilistic model in image classification task, we demonstrate that this classifier have a better ability to capture data distribution and can be adapted to an online system.

The main contribution of this paper is: (1) we propose a key-region extracting module for both training the auto-encoder and detecting anomalies. (2) we utilize a Mahalanobis distance based classifier to classify whether the test sample is anomalous or not. (3) To allow the framework to be implemented online, we introduce an online updating method for the classifier.

The rest of this paper is organized as follows: In Sect. 2, we introduce related works in anomaly detection area. In Sect. 3, we show the details of our work. In Sect. 4, We show the experimental results. In Sect. 5, we conclude our work with a brief summary.

2 Related Work

Work in this area can be mainly categorized into feature based or not feature based methods. Among the latter, reconstruction based methods were the most popular [5, 11, 20]. A reconstruction model was trained over regular patterns, where reconstruction error for training samples was minimized. At test time, samples with large reconstruction error were classified as anomaly. However, some researchers pointed out that reconstruction based methods tended to just remember the input [12]. While in video anomaly detection, our goal is to learn the temporal and spatial structure of video events. So predicting future frames based methods were proposed to address this problem, these methods were due to the intuition that prediction needs more information about the motion and appearance than simple reconstruction. In fact, they also pointed out that prediction based methods tended to just remember the last few frames so as to predict the next frame. Besides, due to complicated dynamics of video data, reconstruction for the whole video frames was impractical, and in practice, a sophisticated preprocessing procedure was needed.

Feature based methods [16, 21] usually consist of two parts, the representation learning model and the normal/abnormal classier, as reviewed in [14]. The key is representation learning, and usually how well an anomaly detection framework could work mainly depend on the representative power of the learned features. Hand-crafted features were the main choice before neural networks become popular. In recent years, Neural networks have shown their powerful representation learning ability [17]. In anomaly detection area, most existing powerful neural networks can not be used directly. Since most of these networks need to be trained in an supervised way, and we usually do not have labelled data but only positive samples in anomaly detection task. However, auto-encoder was proved to be an effective unsupervised representation learning method, thus it was widely used in anomaly detection [4, 6, 15]. Auto-encoder based methods are essentially reconstruction methods, but we emphasize its representation learning ability here. To the best of our knowledge, they are still the most popular representation learning method used in anomaly detection area. In this paper, we also adopt an auto-encoder as the representation learning module.

When the representation learning is done, a normal/abnormal classifier is needed. Many classifiers were proposed to handle various problems over the years, among them, one-class SVM [9], Gaussian classifier, distance based (cluster, nearest neighbor, etc.) classifier [7] were the most frequently used classifiers. Each of them had a preference for specific data distribution, for example, linear one-class SVM was suitable for data that mainly lies in one side of a plane. In order to capture the data distribution better, we adopt Mahalanobis distance based classifier.

3 The Proposed Method

Figure 1 is an over view of our proposed anomaly detection framework. It is consisted of three parts, key-region selection module, representation learning module, normal/abnormal classifier. Below we will show the details of the three modules.

3.1 Key-Region Selection Module

Establishing a model over raw video frames is difficult, because model complexity would be extremely high so as to model complicated video dynamics. One widely used method to handle this problem is to divide the video into small spatial-temporal patches [3, 8, 10, 15], and subsequent operations and anomaly detection are conducted over these patches. But this method suffers a lot of problems. First, most of the patches only contain background information, which does no help for individual behavior modeling. Second, as far as our knowledge is concerned, all anomaly detection frameworks that use this method only divide video into uniform patched straightforwardly, without considering moving targets as a whole. We design our key-region selection module for the following consideration.

The key of our anomaly detection framework is the representation learning module, only features with enough representative power are learned will subsequent classifier classify the samples correctly. Reconsider our goal in anomaly detection, we want to model the moving targets’ pattern, however, as shown in Fig. 1, most of the patches only contain background information. Imagine if we train our representation learning model using all these patches, our learned features may lack representative power for what we are concerned with, since most of the training patches have nothing to do with our concern. So if we can train our model only using those patches that contain moving targets, which is of our concern, the learned features will be more targeted at anomaly detection.

The goal of this module is to find the patches that contain moving targets. In this paper, we view this as a foreground/background classification problem, since they share many similarities. This module is designed based on the following two points: (1) The intensity distribution of foreground patches are different from their surrounding patches. (2) The intensity in a foreground patch is diverse while relatively simplex in a background patch.

We propose two kind of measures for each of the two points mentioned above respectively. Based on the first point, we use cosine distance [22] to measure the similarity between a patch and its surrounding patches.

$$\begin{aligned} SIM=\frac{P_{1}^T*P_{2}}{\parallel P_{1}\parallel *\parallel P_{2}\parallel } \end{aligned}$$

(1)

where $P_{1}$ is a column vector, represents the intensity distribution of the patch being considered, and $P_{2}$ represents the intensity distribution of the surrounding patches. In Fig. 2, we show how a patch is evaluated in explicit way. For the second point, we use the concept entropy in information theory to measure the intensity diversity in a patch.

$$\begin{aligned} E=-\sum _{i=0}^{255}P(i)logP(i) \end{aligned}$$

(2)

And we give every patch a score based on this two measures, the higher the score, the more likely it is a foreground patch.

$$\begin{aligned} Score=E-\lambda *SIM \end{aligned}$$

(3)

where $\lambda $ is a parameter that measures the relative importance of the two measures.

3.2 Representation Learning Module

We utilize a variation of sparse auto-encoder [18] as the representation learning module. The model architecture is shown in Fig. 3. This network transforms the input x $\in R^{D}$ into hidden representation h $\in R^{d}$ through the encoder.

$$\begin{aligned} h = f_{\theta }(x)=\sigma (W_{1}x+b_{1}) \end{aligned}$$

(4)

where $\theta = (W_{1},b_{1})$, $W_{1}\in R^{D*d}$ is the weight matrix and $b_{1} \in R^{d}$ is the bias term. $\sigma $ is the sigmoid activation function.

$$\begin{aligned} \sigma (x) = \frac{1}{1+e^{-x}} \end{aligned}$$

(5)

The hidden representation is then mapped into the output as a reconstruction of the input through the decoder.

$$\begin{aligned} y = g_{\theta ^{'}}(h) = \sigma (W_{2}h+b_{2}) \end{aligned}$$

(6)

where $\theta ^{'} = (W_{2},b_{2})$, $W_{2} \in R^{d*D}$ is the weight matrix, $b_{2} \in R^{d}$ is the bias term. $\sigma $ is the same as above. For an auto-encoder, we get the optimal parameter $\theta $ and $\theta ^{'}$ by minimizing the reconstruction error between the input x and reconstruction y.

$$\begin{aligned} \theta ^{*}, \theta ^{'*} = \mathop {\arg \min }_{\theta ,\theta ^{'*}}E_{p(x)}L(x,y) \end{aligned}$$

(7)

L(x,y) is the reconstruction error between x and corresponding output y. In this paper, we adopt the Euclidean Distance as the reconstruction error.

$$\begin{aligned} L(x,y)=||x-y || \end{aligned}$$

(8)

With a training set of size m, we get $\theta ^{*}, \theta ^{'*}$ through minimizing the objective function.

$$\begin{aligned} \theta ^{*}, \theta ^{'*} = \mathop {\arg \min }_{\theta ,\theta ^{'*}}\frac{1}{m}\sum _{i=1}^{m}L(x^{i},y^{i})+\sum _{j=1}^{s_{j}}KL(\rho ||\rho _{j}) \end{aligned}$$

(9)

The loss function consists of two parts, where the first part is the reconstruction error introduced above. The second term serves as a sparsity constraint, it can be understood as a prior over the parameter distribution as explained in [18]. $\rho $ is a preset activation value, usually is a number much smaller than 1. Let $\rho _{j}$ be the average activation value of hidden state j over m training samples. Equation (9) is optimized through Gradient Descent.

$$\begin{aligned} \rho _{j} = \frac{1}{m}\sum _{i=1}^{m}a_{j}(x^{i}) \end{aligned}$$

(10)

3.3 Normal/Abnormal Classifier

Most distance based methods use Euclidean Distance as a measure, these methods are based on the assumption that different dimensions of the feature are independent. However, in practice, it’s usually not the case. Mahalanobis Distance based methods do not need the isotropy assumption. We give a comparison between them in Fig. 4, here we only consider two dimensional case for visualization.

$$\begin{aligned} MahalanobisDistance = (x-\mu )^{T}\varSigma ^{-1}(x-\mu ) \end{aligned}$$

(11)

$\mu $ is the mean vector of the samples, and $\varSigma $ is the covariance matrix. Mahalanobis distance can be interpreted intuitively as the Euclidean distance in a non-regular coordinate system. $\varSigma ^{-1}$ can be decomposed into $W^{T}diag(\varLambda )W$, we get Eq. (12).

(12)

The usual way of computing $\mu $ and $\varSigma $ is

$$\begin{aligned} \mu =\frac{1}{n}\sum _{i=1}^{n}x_{i} \end{aligned}$$

(13)

$$\begin{aligned} \varSigma =\frac{1}{n}\sum _{i=1}^{n}(x_{i}-\mu )(x_{i}-\mu )^{T} \end{aligned}$$

(14)

But this computation method can’t be used in an online way, since we need to compute $\mu $ and $\varSigma $ over all samples again when we get new samples, that’s time-consuming and has a huge storage requirement. We introduce the method in [19] to compute them in an efficient way, thus allowing this method to be used in an online system. Denote the mean vector and covariance matrix of previous n−1 samples as $\mu _{n-1}$ and $\varSigma _{n-1}$ respectively, when we get a new sample, the new mean vector and covariance matrix is denoted as $\mu _{n}$ and $\varSigma _{n}$, then we have the following updating methods.

$$\begin{aligned} \mu _{n} = \mu _{n-1}+\frac{x_{n}-\mu _{n-1}}{n} \end{aligned}$$

(15)

$$\begin{aligned} \varSigma _{n} = \frac{n-1}{n}\varSigma _{n-1}+\frac{1}{n}(x_{n}-\mu _{n-1})(x_{n}-\mu _{n})^{T} \end{aligned}$$

(16)

4 Experiment Validation

4.1 Parameter Setting

The size of video patch is set to be 10 * 10 * 5, as illustrated in [1, 15]. The weight parameter in Eq. (3) is set to be 2. For the representation learning module, the number of hidden state is 100, the sparsity parameter $\rho $ is 0.1. At the training stage, we first use a normal distribution to initial the parameter. And then the stochastic gradient descent [2] is used to train the representation learning module introduced in Sect. 3.1, and learning rate is set to be 0.5. The network takes about 4 hours to converge on an i7-5960X CPU.

4.2 Experiment Result

The experiment is implemented over the UCSD Ped2 dataset, which is the most popularly used dataset in anomaly detection. This dataset is composed of 16 training sequence and 12 test sequence. The resolution of the video is 240 * 360, gray scale.

Here we show the experiment result of the proposed anomaly detection framework. We validate the effectiveness of our proposed patch selection module in key-region extracting in Fig. 5. As is shown in the figure, our patch selection module assigns a high score for the patch that contains our targets, while a relatively low score for the patches that only contain background information. For better visualization, we normalize the score into [0, 1] using a linear mapping.

For the performance assessment of the whole algorithm, we compare our work with [15], since we share a similar structure. We compare them using the most important evaluation indicator in anomaly detection, ROC and EER, and we also compare their running speed on our machine. The ROC curve is shown in Fig. 6, experiment shows the AUC of our work surpasses Sabokrou’s by 8%. And the results of some other assessment criterions are shown in Table 1, the average processing time for one frame in our work is slightly longer than their’s, since our work need to go through a patch selection procedure. But compared to the gain we get in AUC, this sacrifice is worthwhile. Moreover, in order to emphasize the importance of our key-region selection module, we discard this module and experiment result shows a decline on the performance.

Table 1. Comparison of our work with Sabokrou’s work

Full size table

5 Conclusion

In this paper, we propose an anomaly detection framework, experiment shows a better performance is achieved compared to work with a similar structure. We owe this achievement to our patch selection module and mahalanobis distance based classifier, and we also introduce an online implementation method for this algorithm.

References

Bertini, M., Del Bimbo, A., Seidenari, L.: Multi-scale and real-time non-parametric approach for anomaly detection and localization. Comput. Vis. Image Underst. 116(3), 320–329 (2012)
Article Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010, pp. 177–186. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Chapter Google Scholar
Cheng, K.W., Chen, Y.T., Fang, W.H.: Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2909–2917 (2015)
Google Scholar
Chong, Y.S., Tay, Y.H.: Abnormal event detection in videos using spatiotemporal autoencoder. arXiv preprint arXiv:1701.01546 (2017)
Cong, Y., Yuan, J., Liu, J.: Sparse reconstruction cost for abnormal event detection. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3449–3456. IEEE (2011)
Google Scholar
Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–742 (2016)
Google Scholar
Hu, X., Hu, S., Zhang, X., Zhang, H., Luo, L.: Anomaly detection based on local nearest neighbor distance descriptor in crowded scenes. Sci. World J. 2014 (2014)
Google Scholar
Leyva, R., Sanchez, V., Li, C.T.: Video anomaly detection with compact feature sets for online performance. IEEE Trans. Image Process. (2017)
Google Scholar
Li, K.L., Huang, H.K., Tian, S.F., Xu, W.: Improving one-class SVM for anomaly detection. In: International Conference on Machine Learning and Cybernetics, vol. 5, pp. 3077–3081. IEEE (2003)
Google Scholar
Li, W., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 18–32 (2014)
Article Google Scholar
Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in MATLAB. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2720–2727 (2013)
Google Scholar
Medel, J.R., Savakis, A.: Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv preprint arXiv:1612.00390 (2016)
Mensink, T., Verbeek, J., Perronnin, F., Csurka, G.: Distance-based image classification: generalizing to new classes at near-zero cost. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2624–2637 (2013)
Article Google Scholar
Popoola, O.P., Wang, K.: Video-based abnormal human behavior recognitiona review. IEEE Trans. Syst. Man. Cybern. Part C (Appl. Rev.) 42(6), 865–878 (2012)
Article Google Scholar
Sabokrou, M., Fathy, M., Hoseini, M., Klette, R.: Real-time anomaly detection and localization in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 56–62 (2015)
Google Scholar
Sabokrou, M., Fayyaz, M., Fathy, M., et al.: Fully convolutional neural network for fast anomaly detection in crowded scenes. arXiv preprint arXiv:1609.00866 (2016)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Google Scholar
Welford, B.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962)
Article MathSciNet Google Scholar
Wu, C., Guo, Y., Ma, Y.: Adaptive anomalies detection with deep network, pp. 181–186 (2015)
Google Scholar
Xu, D., Ricci, E., Yan, Y., Song, J., Sebe, N.: Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015)
Zhang, D., Lu, G.: Evaluation of similarity measurement for image retrieval. In: Proceedings of the 2003 International Conference on Neural Networks and Signal Processing, vol. 2, pp. 928–931. IEEE (2003)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (Grant No. 61371192), the Key Laboratory Foundation of the Chinese Academy of Sciences (CXJJ-17S044) and the Fundamental Research Funds for the Central Universities (WK2100330002).

Author information

Authors and Affiliations

Key Laboratory of Electromagnetic Space Information, Chinese Academy of Science, Hefei, China
Wenfei Yang, Bin Liu & Nenghai Yu
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Wenfei Yang, Bin Liu & Nenghai Yu

Authors

Wenfei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Nenghai Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Liu .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
Dalian University of Technology, Dalian, China
Xiangwei Kong
UNSW, Sydney, New South Wales, Australia
David Taubman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, W., Liu, B., Yu, N. (2017). Key-Region Representation Learning for Anomaly Detection. In: Zhao, Y., Kong, X., Taubman, D. (eds) Image and Graphics. ICIG 2017. Lecture Notes in Computer Science(), vol 10666. Springer, Cham. https://doi.org/10.1007/978-3-319-71607-7_60

Download citation

DOI: https://doi.org/10.1007/978-3-319-71607-7_60
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71606-0
Online ISBN: 978-3-319-71607-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)