Keywords

1 Introduction

Abnormal event detection in video is a challenging task in computer vision, as the definition of what an abnormal event looks like depends very much on the context. For instance, a car driving by on the street is regarded as a normal event, but if the car enters a pedestrian area, this is regarded as an abnormal event. A person jogging on the beach (normal event) versus running outside from a bank (abnormal event) is another example. Although what is considered abnormal depends on the context, we can generally agree that abnormal behaviour should be represented by unexpected events that occur less often than familiar (normal) events. As it is generally impossible to find a sufficiently representative set of anomalies, the use of traditional supervised learning methods is usually ruled out. Hence, most abnormal event detection approaches [1, 4, 11, 13,14,15,16, 25, 27] learn a model of familiarity from a given training video and label events as abnormal if they deviate from the model. We approach abnormal behavior detection in a similar manner, and propose to build a model of normality by using a one-class Support Vector Machines (SVM) [20] classifier. The outliers detected by our approach will be labeled as abnormal events at test time. Although it seems straightforward to apply one-class SVM, related works have adopted different approaches, for example dictionary learning [4, 5, 7, 14, 17] or locality sensitive hashing filters [26]. Nevertheless, we show in this paper that we can achieve state-of-the-art results by using one-class SVM. Before training our normality model, we extract deep features by using convolutional neural networks (CNN) pre-trained on the ILSVRC benchmark [18]. Deep learning models reach impressive performance levels on object recognition from images [3, 9, 21, 23]. Although the features learned by CNN models are not particularly designed for computer vision tasks outside the original purpose, the knowledge embedded in the CNN features is quite general and it can easily be transferred to various tasks, for example to the task of predicting the difficulty of an image [10]. To the best of our knowledge, we are the first to transfer pre-trained CNN features to the task of abnormal behavior detection in video.

We perform abnormal event detection experiments on the Avenue [14] and the UMN [16] data sets in order to compare our approach with several state-of-the-art methods [5, 6, 14, 16, 19, 22, 26]. The empirical results on the Avenue data set indicate that our model is able to surpass the state-of-the-art methods [6, 14] for this data set. As for the UMN data set, we are able to reach state-of-the-art performance on two of the three video scenes. Although we show that we can obtain better performance by employing deeper models [9, 21, 23] for feature extraction, we choose to use the VGG-f model [3] which allows us to process the video in real-time at 20 frames per second on a standard CPU.

We organize the paper as follows. We present related work on abnormal event detection in Sect. 2. We describe our learning framework in Sect. 3. We present the abnormal event detection experiments in Sect. 4. Finally, we draw our conclusions in Sect. 5.

2 Related Work

Abnormal event detection is usually formalized as an outlier detection task [1, 4, 5, 7, 11, 13,14,15,16,17, 22, 25,26,27], in which the general approach is to learn a model of normality from training data and consider the detected outliers as abnormal events. Some abnormal event detection approaches [4, 5, 7, 14, 17] are based on learning a dictionary of normal events, and label the events not represented by the dictionary as abnormal. Other approaches have employed deep features [25] or locality sensitive hashing filters [26] to achieve better results.

Interestingly, there have been some approaches that employ unsupervised steps for abnormal event detection [7, 17, 22, 25]. The approach presented in [7] is to build a model of familiar events from training data and incrementally update the model in an unsupervised manner as new patterns are observed in the test data. In similar fashion, Sun et al. [22] train a Growing Neural Gas model starting from training videos and continue the training process as they analyze the test videos for anomaly detection. Ren et al. [17] use an unsupervised approach, spectral clustering, to build a dictionary of atoms, each representing one type of normal behavior. Their approach requires training videos of normal events to construct the dictionary. Xu et al. [25] use Stacked Denoising Auto-Encoders to learn deep feature representations in a unsupervised way. However, they still employ multiple one-class SVM models to predict the anomaly scores. The approach proposed in [6] is to detect changes on a sequence of data from the video to see which frames are distinguishable from all the previous frames. As the authors want to build an approach independent of temporal ordering, they create shuffles of the data by permuting the frames before running each instance of the change detection. As we employ pre-trained CNN features, our feature extraction step is also unsupervised with respect to the approached task.

3 Method

3.1 Feature Extraction

In many computer vision tasks, higher level features, such as the ones learned with convolutional neural networks (CNN) [12] are the most effective. To build our appearance features, we consider a pre-trained CNN architecture able to process the frames as fast as possible, namely VGG-f [3]. Considering that we want our detection framework to work in real-time on a standard desktop computer, not equipped with expensive GPU, the VGG-f [3] is an excellent choice as it can process about 20 frames per second on CPU. We hereby note that better anomaly detection performance can be achieved by employing deeper CNN architectures, such as VGG-verydeep [21], GoogLeNet [23] or ResNet [9].

The VGG-f model is trained on the ILSVRC benchmark [18]. We use the pre-trained CNN model to extract deep features as follows. Given the input video, we resize the frames to \(224 \times 224\) pixels. We then subtract the mean imagine from each frame and provide it as input to the VGG-f model. We remove the fully-connected layers (identified as fc6, fc7 and softmax) and consider the activation maps of the last convolutional layer (conv5) as appearance features. While the fully-connected layers are adapted for object recognition, the last convolutional layer contains valuable appearance and pose information which is more useful for our anomaly detection task. Ideally, we would like to have at least slightly different representations for a person walking versus a person running. Interestingly, Feichtenhofer et al. [8] have also found that the conv5 features are more suitable for action recognition in video, a task closely related to ours.

Finally, we reshape each activation map into an 169 dimensional vector and concatenate the vectors corresponding to the 256 filters of the conv5 layer into a single feature vector of 43264 (\(13 \times 13 \times 256\)) components. The resulted feature vectors are normalized using the \(L_2\)-norm. It is important to note that unlike other approaches [4, 25], we apply the same steps in order to extract features from video, irrespective of the data set.

3.2 Learning Model

We use the one-class SVM approach of Schölkopf et al. [20] to detect abnormal events in video. The training data in our case is composed of a few videos representing only normal events. We consider each video frame as an individual and independent sample, disregarding the temporal relations between video frames. Let \(\mathcal {X} = \{ x_1, x_2, \ldots ,x_n \,|\, x_i \in \mathbb {R}^m\}\) denote the set of training frames. In this formulation, our one-class SVM model will learn to separate a small region capturing most of the normal frames from the rest of feature space, by maximizing the distance from the separating hyperplane to the origin. This results in a binary classification function g which captures regions in the input space where the probability density of normal events lives:

$$\begin{aligned} g(x) = sign \left( \sum _{i=1}^{n} \alpha _i k(x, x_i) - \rho \right) , \end{aligned}$$
(1)

where x is a test frame that needs to be classified either as normal or abnormal, \(x_i \in \mathcal {X}\) is a training frame, k is a kernel function, \(\alpha _i\) are the weights assigned to the support vectors \(x_i\), and \(\rho \) is the distance from the hyperplane to the origin. If we desire a score reflecting the abnormality level of a frame, we can simply remove the (sign) transfer function from Eq. (1). The coefficients \(\alpha _i\) are found as the solution of the dual problem:

$$\begin{aligned} {\begin{matrix} \min _{\alpha } \frac{1}{2} \sum _{i=1}^n \sum _{j=1}^n \alpha _i \alpha _j k(x_i, x_j) \text { subject to } 0 \le \alpha _i \le \frac{1}{\nu n}, \sum _{i=1}^n \alpha _i = 1, \end{matrix}} \end{aligned}$$
(2)

where \(\nu \in [0,1]\) is a regularization parameter that controls the percentage of outliers to be excluded by the learned model. As noted by Schölkopf et al. [20], the offset \(\rho \) can be recovered by exploiting that for any \(\alpha _i\) that is not at the lower or upper bound, the corresponding sample \(x_i\) satisfies:

$$\begin{aligned} \rho = \sum _{j=1}^{n} \alpha _j k(x_j, x_i). \end{aligned}$$
(3)

Since we already represent the frames in a high dimensional space (\(m = 43264\)) by extracting CNN features, we no longer have to embed the samples into a higher dimensional space. Hence, we decide to use the linear kernel function in our one-class SVM model, which corresponds to the feature map \(\phi (x) = x\):

$$\begin{aligned} k(x,z) = \left\langle x, z \right\rangle . \end{aligned}$$
(4)

4 Experiments

4.1 Data Sets

We show abnormal event detection results on two benchmark data sets.

Avenue. We first consider the Avenue data set [14], which contains 16 training and 21 test videos. In total, there are 15328 frames in the training set and 15324 frames in the test set. Each frame is \(640 \times 360\) pixels. Locations of anomalies are annotated in ground truth pixel-level masks for each frame in the testing videos.

UMN. The UMN Unusual Crowd Activity data set [16] consists of three different crowded scenes, each with 1453, 4144, and 2144 frames, respectively. The resolution of each frame is \(320 \times 240\) pixels. In the normal settings people walk around in the scene, and the abnormal behavior is defined as people running in all directions. As in [5], we use the first 400 frames in each scene for training.

4.2 Evaluation

We employ ROC curves and the corresponding area under the curve (AUC) as the evaluation metric, computed with respect to ground truth frame-level annotations, and, when available (only for the Avenue data set), pixel-level annotations. We define the frame-level and pixel-level AUC as in previous works [5, 6, 14, 15]. At the frame-level, a frame is considered a correct detection if it contains at least one abnormal pixel. At the pixel-level, the corresponding frame is considered as being correctly detected if more than \(40\%\) of truly anomalous pixels are detected. We use the same approach as [6, 14] to compute the pixel-level AUC. The approach consists of resizing each frame to \(160 \times 120\) pixels, and uniformly partitioning each frame to a set of non-overlapping \(10 \times 10\) patches. Corresponding patches in 5 consecutive frames are stacked together to form a spatio-temporal cube, each with resolution \(10 \times 10 \times 5\). We remove the cubes with less than 5 non-zero values as [6, 14]. The frame-level scores produced by our framework are assigned to the remaining spatio-temporal cubes. The results are smoothed with the same filter useb by [6, 14] in order to obtain our final pixel-level detections.

Although many works [5, 7, 14, 15, 25, 26] include the Equal Error Rate (EER) as evaluation metric, we agree with [6] that metrics such as the EER can be misleading in a realistic anomaly detection setting, in which abnormal events are expected to be very rare. Thus, we do not use the EER in our evaluation.

4.3 Implementation Details

We extract deep appearance features from the training and the test video sequences. We consider the pre-trained VGG-f [3] and VGG-verydeep [21] models provided in MatConvNet [24]. To learn a model of normality, we employ the one-class SVM implementation from LibSVM [2]. In all the experiments, we set the regularization parameter of one-class SVM to 0.2, which means that the model will have to single out \(80\%\) of the training frames as normal (the other \(20\%\) are outliers). Setting such a high value for the regularization parameter ensures that our model will not overfit the training data.

In Table 1, we present preliminary results on the Avenue data set to provide empirical evidence in favor of the CNN features that we choose for the subsequent experiments. The results indicate that better performance can be obtained with the conv5 features rather than the fc6 or fc7 features. For the speed evaluation, we measure the time required to extract features and to predict the anomaly scores on a computer with Intel Core i7 2.3 GHz processor and 8 GB of RAM using a single core. We present the number of frames per second (FPS) in Table 1. Although, we are able to report better results with the VGG-verydeep [21] architecture, we choose the shallower VGG-f [3] architecture for the rest of the experiments, as its processing time is about 10 times shorter. Using a single core, our final model is able to process the test videos in real-time at nearly 20 FPS.

Table 1. Abnormal event detection results in terms of frame-level and pixel-level AUC on the Avenue data set. We show results with different CNN architectures and features from different layers. The number of frames per second (FPS) is computed by running the models on a computer with Intel Core i7 2.3 GHz processor and 8 GB of RAM using a single core.
Table 2. Abnormal event detection results in terms of frame-level and pixel-level AUC on the Avenue data set. Our framework is compared with two state-of-the-art approaches [6, 14].

4.4 Results on the Avenue Data Set

We first compare our abnormal behavior detection framework based on deep features with two state-of-the-art approaches [6, 14]. The frame-level and pixel-level AUC metrics computed on the Avenue data set are presented in Table 2. Compared to the method of Del Giorno et al. [6], our framework yields an improvement of \(6.3\%\), in terms of frame-level AUC, and an improvement of \(2.5\%\), in terms of pixel-level AUC. We also obtain better results than Lu et al. [14], as our framework gains \(3.7\%\) in terms of frame-level AUC and \(0.6\%\) in terms of pixel-level AUC. Overall, our method is able to surpass the performance of both state-of-the-art methods.

Fig. 1.
figure 1

Frame-level anomaly detection scores (between 0 and 1) provided by our framework for test video 4 in the Avenue data set. The video has 947 frames. Ground-truth abnormal events are represented in cyan, and our scores are illustrated in red. Best viewed in color.

Figure 1 illustrates the frame-level anomaly scores, for test video 4 in the Avenue data set, produced by our framework based on VGG-f features and one-class SVM. According to the ground-truth anomaly labels, there are two abnormal events in this video. In Fig. 1, we notice that our scores correlate well to the ground-truth labels, and we can easily identify both abnormal events by setting a threshold of around 0.4, without including any false positive detections.

Fig. 2.
figure 2

True positive (top row) versus false positive (bottom row) detections of our framework based on VGG-f features and one-class SVM. Examples are selected from the Avenue data set. Best viewed in color.

We show some examples of true positive and false positive detections in Fig. 2. The true positive abnormal events are a person throwing an object and a person running, while false positive detections are two persons walking synchronously and a person carrying a backpack.

Table 3. Abnormal event detection results in terms of frame-level AUC on the UMN data set. Our framework is compared with several state-of-the-art supervised methods [5, 16, 19, 22, 26].
Fig. 3.
figure 3

Frame-level anomaly detection scores (between 0 and 1) provided by our framework for the second scene in the UMN data set. The test sequence has 3854 frames. Ground-truth abnormal events are represented in cyan, and our scores are illustrated in red. Best viewed in color.

4.5 Results on the UMN Data Set

On the UMN data set, we compare our framework with several supervised methods [5, 16, 19, 22, 26]. In Table 3, we present the frame-level AUC score for each individual scene, as well as the average score for all the three scenes. On the first scene and the last scene, the performance of our one-class SVM framework based on deep features is on par with the state-of-the-art approaches. For the third scene, we are able to surpass the performance reported by [5, 26]. However, we obtain a much lower performance for the second scene, perhaps due the significant illumination changes in this scene. Our approach yields an overall frame-level AUC of \(97.1\%\), which represents an improvement of \(1.1\%\) over the approach of Mehran et al. [16]. However, the best approach [22] on the UMN data set is nearly \(2.6\%\) better than our approach.

Fig. 4.
figure 4

True positive (top row) versus false positive (bottom row) detections of our framework based on VGG-f features and one-class SVM. Examples are selected from the second scene of the UMN data set. Best viewed in color.

As illustrated in Fig. 3, our approach is able to correctly identify the abnormal events in the second scene without any false positives, by applying a threshold of around 0.4. However, our approach does not detect the abnormal events right from the beginning. We believe that the changes in illumination when people enter the room have a negative impact on our approach. These observations are also applicable when we analyze the false positive detections presented in Fig. 4. Indeed, the example in the bottom right corner of Fig. 4 illustrates that our method triggers a false detection when a significant amount of light enters the room as the door opens. The true positive examples in Fig. 4 represent people running around in all directions.

5 Conclusion and Future Work

In this work, we have proposed a novel framework for abnormal event detection in video that is based on extracting deep features from pre-trained CNN models, and on using one-class SVM to learn a model of normality. We have conducted abnormal event detection experiments on two data sets in order to compare our approach with several state-of-the-art approaches [5, 6, 14, 16, 19, 22, 26]. The empirical results indicate that our approach gives better performance than some of these approaches [6, 14, 16], while processing the video online at 20 FPS.

Although our model can reach very good results, it completely disregards motion information and the temporal structure in video. In future work, we aim to improve our performance by including motion features into our framework. One possible approach would be to employ convolutional two-stream networks [8] to extract both motion and appearance features. We also aim to evaluate our framework on other data sets.