Elsevier

Pattern Recognition

Volume 105, September 2020, 107394
Pattern Recognition

Video anomaly detection and localization using motion-field shape description and homogeneity testing

https://doi.org/10.1016/j.patcog.2020.107394Get rights and content

Highlights

  • We introduce a histogram-based shape descriptor to motion field in each local patch.

  • The motion descriptor captures the motion trend and details in local patches.

  • We propose a similarity-based statistical model to detect spatio-temporal anomalies.

  • The statistical model relies on unsupervised learning without any prior assumption.

  • The method can adapt to the whole scene with tolerance to perspective distortion.

Abstract

Detection and localization of abnormal behaviors in surveillance videos of crowded scenes is challenging, where high-density people and various objects performing highly unpredictable motions lead to severe occlusions, making object segmentation and tracking extremely difficult. We associate the optical flows between multiple frames to capture short-term trajectories and introduce the histogram-based shape descriptor to describe such short-term trajectories, which reflects faithfully the motion trend and details in local patches. Furthermore, we propose a method to detect anomalies over time and space by judging whether the similarities between the testing sample and the retrieved K-NN samples follow the pattern distribution of homogeneous intra-class similarities, which is unsupervised one-class learning requiring no clustering nor prior assumption. Such a scheme can adapt to the whole scene, since the probability is used to judge and the calculation of probability is not affected by motion distortions arising from perspective distortion, which gains advantage over the existing solutions. We conduct experiments on real-world surveillance videos, and the results demonstrate that the proposed method can reliably detect and locate the abnormal events in video sequences, outperforming the state-of-the-art approaches.

Introduction

Due to the arising demand for public security issues and the widely equipped surveillance machines in public places, it is urgent to develop an automated system that can monitor and percept human activities to alarm abnormal events. In surveillance videos, the dominant activities occurring frequently are referred to as normal behaviors, which are in general not of concern. Apart from the normal activities, the most important and challenging task of an intelligent video surveillance system is to detect and localize anomalous events, which are defined as those to occur with a low probability [1]. In general, an abnormal event appears rarely and disappears in a short time. The goal of anomaly detection and localization is to identify the small time span and the spatial region covering the anomalous activities in an automatic manner [2], [3].

In surveillance videos of public spaces, high-density people and various objects performing highly random motions [2] make anomaly detection especially challenging in crowded scenes. The traditional object-based approaches deem crowd as a collection of individuals. As this kind of methods conduct anomaly detection based on objects’ appearances and trajectories, its performance is directly dependent on the accuracy of object extraction [4] and object tracking [5]. Unfortunately, capturing the single individuals is nearly impossible in crowded scenes, because of the high density of people and the various objects performing irregular motions to incur frequent and severe occlusions [2]. Aside from the aforementioned difficulties, tracking multiple objects is quite time-consuming [6].

To avoid the difficulty of segmenting individuals in crowded scenes, the latest trend in terms of anomaly detection is shifted to partition the surveillance videos into a couple of spatio-temporal volumes of a fixed size to focus on local scenes of a short time duration [7]. Then, the volume-based detection model in temporal and spatial contexts is established to discriminate whether the local scenes correspond to abnormal events or not, where the anomalies refer to such patterns that have never appeared at a specified site in contrast to the historical records or deviate remarkably from those of their neighborhoods at the same time [3]. In the literature, the unsupervised framework that makes use of normal volumes only for training has drawn considerable attention, since anomalies are always rare and differ from one to another with unpredictable variations, making it almost impossible to model all the abnormal types [8].

We review two major categories of unsupervised approaches applied in anomaly detection in the following:

(1) A straightforward way to detect anomalous event is applying clustering methods to find outliers as anomalies [3]. In fact, such a scheme has been widely used in the existing works [2], [7]. However, how to determine the number of clusters remains unsolved yet, which prevents its usage from being extended to a broad spectrum of practical applications. Classical clustering algorithms such as k-means and Gaussian mixture model (GMM) [3], [7] require the number of prototypical patterns to be known a priori [2]. In crowded scenes, however, motion patterns are changing continuously and randomly such that some of them cannot be foreseen, which leads to uncertainty in regard to the number of prototypical patterns. Thus, it is impracticable to define the number of prototypical patterns in advance.

An alternative solution is to perform clustering based on a distance threshold so as to determine whether a sample belongs to an existing prototypical pattern or corresponds to a new prototype that should be created [9], [10] as well as whether two clusters should be merged or not [8]. This kind of methods does not require the number of prototypical patterns to be known in advance but a specific distance threshold applicable to the whole scene to perform clustering does not exist due to the size variation of the object of interest, which is subject to the distance to the camera, namely, perspective distortion, which causes motion distortions. This gives rise to the same problem in defining the number of prototypes. For example, as shown in Fig. 1(a), the size of the skater in red color is much smaller than that of the one in blue color, in association with which the enlarged view of such objects is illustrated in Fig. 1(c) to enable an intuitive insight into the perspective distortion. In the case as shown in Fig. 1, it is impossible to define a uniform threshold to group the motion trajectories represented by any descriptor into reasonable clusters on account of the varying sizes of the objects caused by perspective distortion.

Due to the aforementioned object and motion distortion problem in surveillance scenarios, that is, the target size and motion step becomes larger when approaching more closely to the camera, some endeavors aiming to tackle such challenging issue have been made. Chen and Lai [11] use thermal diffusion processing and perspective transformation to construct a coherent motion flow field, and then establish a physical characteristic descriptor of crowd motion to model the crowd motion state of the flow field. However, the correction coefficient calculated for perspective transform requires manual selection of two parallel lines from each scene, which makes it difficult to deploy in practice. Leyva et al. [12] divide the scene into size-varying cells to adapt to the change of target size caused by scene’ s perspective, and then extract foreground occupancy and optical flow features from these cells to detect abnormal events. However, the distortion extents are different for various scenes, so the setting of the changing rate of cells’ sizes in a scene is difficult.

(2) The other category of methods is reconstruction-based approaches, for example, the method referred to as sparse representation cost [13]. Yang et al. [13] reconstruct testing samples from the normal samples of previous or surrounding volumes that act as the dictionary, and identify the samples with large reconstruction errors that exceed a predefined threshold as anomalies. However, once a very small number of abnormal samples are mixed into the dictionary, it will fail to detect the same kinds of abnormal behaviors due to the corruption on the dictionary. Besides, it is impossible to find a threshold applicable to the whole scene on account of the perspective distortion imposed inhomogeneity of the reconstruction errors for local regions of different positions.

In view of the weakness of the aforementioned approaches, we propose a motion-field shape descriptor along with a K-NN (K-nearest neighbors) similarity-based statistical model to detect anomalies over time and space, where clustering or prior assumption are not needed. First, we associate the optical flows across multiple frames to capture the short-term trajectories in a video clip. The short-term trajectories characterize the motions in consecutive multi-frames and thus enhance motion pattern description. Hereafter, we introduce the histogram-based shape descriptor referred to as shape contexts [14] to figure out the short-term trajectories within each patch in a statistical sense, which reflects faithfully the motion trend and details in every local patch. To the best of our knowledge, this is the first attempt to apply shape description to quantize trajectories as motion features for anomaly detection in crowded scenes. Then, we propose to compute the K-NN similarity-based statistical model for anomaly detection as follows: First, we retrieve the K-NN samples from the training set in regard to the testing sample, and then use the similarities between every pair of the K-NN samples in the training set to construct a Gaussian model. Finally, the probabilities of the similarities from the testing sample to theK-NN samples under the Gaussian model are calculated in the form of a joint probability to check whether they are compatible with the Gaussian model. Abnormal events can be detected by judging whether the joint probability is below predefined thresholds in temporal and spatial contexts, separately. The major advantage is: The anomaly detection through probability can adapt to the whole scene, since the probability computed as such is not affected by the so-called perspective distortion. We carried out extensive experiments on three benchmarks with real-world scenes, UMN dataset [15], Subway dataset [16], and UCSDped1 dataset [3], for anomaly detection and localization, and the results validate the effectiveness and robustness of the proposed method.

The remainder of the paper is organized as follows: Section 2 reviews related work on anomaly detection. In Section 3, we introduce the histogram-based shape description method to characterize the short-term trajectories. Then, we propose the K-NN similarity-based model to detect anomalies in Section 4. In Section 5, we introduce the spatio-temporal anomaly detection scheme. We evaluate the performance of the proposed method in detecting and locating abnormal behaviors in Section 6. In Section 7, we draw conclusions.

Section snippets

Related work

Many methods detect anomalies by judging individual behaviors. For example, Hinami et al. [17] train a generic convolutional neural network (CNN) model on large datasets to learn individual objects’ attributes and action features and then detect and recount abnormal events based on these features. For this kind of methods, the major challenge for abnormal event detection in crowded scenes is that the high density of the presence of objects makes detecting and tracking individual objects

Short-term trajectory feature

Most of the existing motion-based approaches employ optical flow features [16], [20], e.g., HOF, which capture motions between two successive frames only but fail to associate motions over multiple frames. In view of such limit, we associate the optical flows between multiple frames to capture short-term trajectories and employ a histogram-based shape descriptor, namely, shape contexts [14], to characterize such short-term trajectories.

Statistical modeling of K-NN similarities

First, we use χ2 test to measure the similarity between a testing sample and the normal training data, and retrieve the K-NN samples from the given training set in regard to the testing sample. Then, we establish a Gaussian model for the K retrieved samples to characterize the similarities between them in a statistical sense.

Note that there are two seemingly natural but in practice error-prone solutions in decision making: (1) Training the detection model such as the probabilistic model of

Spatio-temporal anomaly detection

As stated previously, we are interested in detecting abnormal temporal and spatial activities. For temporal anomaly detection at a given location, the training data are composed of the patches from the same location of a long history. For spatial anomaly detection, the training samples are from the surrounding patches. The training data sets for temporal and spatial anomaly detection are used in the K-NN similarity-based statistical model to infer the occurrence probabilities of the testing

Experiments

The proposed method is tested on public real-world datasets: The UMN dataset [15], Subway dataset [16] and UCSDped1 dataset [3] with varying densities of people. The challenge is that the scenes in the datasets are not only crowded but also with some extent of perspective distortion.

Conclusions

The contribution of this paper is two-fold: (1) The representation of video contents is one key issue for anomaly detection. We transfer the problem into shape description on short-term motion trajectories by associating the optical flows between multiple frames. The advantage is that the low-level short-term trajectory feature does not rely on unreliable object segment and tracking in crowded scenes while preserve the motion information of object parts, which is a promise of robustness.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported by NSFC (grant No. 61801417), Natural Science Research of Jiangsu Higher Education Institutions of China (No. 18KJB520051), Shanghai Science and Technology Commission (grant No. 17511104203), National Key R&D Program (No. 2018YFE0116700), and the Shandong Provincial Natural Science Foundation (No. ZR2019MF049).

Xinfeng Zhang is a lecturer in College of Information Engineering at Yangzhou University. He received a Bachelor degree in electronic and information engineering from Hebei University, a Master degree in signal and information processing from Shantou University, and Ph.D. in computer science from Fudan University. His research interests are computer vision and multi-perception information processing.

References (34)

  • K.W. Cheng et al.

    Video anomaly detection and localization using hierarchical feature representation and gaussian process regression

    IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • M.J. Roshtkhari et al.

    An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions

    Comput. Vis. Image Underst.

    (2013)
  • S. Wu et al.

    Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes

    IEEE Conference on Computer Vision and Pattern Recognition

    (2010)
  • R. Leyva et al.

    Video anomaly detection with compact feature sets for online performance

    IEEE Trans. Image Process.

    (2017)
  • C. Yang et al.

    Sparse reconstruction cost for abnormal event detection

    IEEE Conference on Computer Vision and Pattern Recognition

    (2011)
  • S.J. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • R. Mehran et al.

    Abnormal crowd behavior detection using social force model

    Conference on Computer Vision and Pattern Recognition

    (2009)
  • Cited by (50)

    • ℓ<inf>p</inf>-Norm Support Vector Data Description

      2022, Pattern Recognition
      Citation Excerpt :

      OCC stands apart from the conventional two-/multi-class classification paradigm [1] in that it primarily uses observations from a single, very often the target class for training. One-class classification acts as an essential building block in a diverse range of practical systems including presentation attack detection in biometrics [2], audio or video surveillance [3,4], intrusion detection [5], social network [6], etc. As with many other machine learning problems, state-of-the-art OCC algorithms are built on the premise of deep learning methodology [7] using massive labelled datasets, typically containing millions of samples.

    View all citing articles on Scopus

    Xinfeng Zhang is a lecturer in College of Information Engineering at Yangzhou University. He received a Bachelor degree in electronic and information engineering from Hebei University, a Master degree in signal and information processing from Shantou University, and Ph.D. in computer science from Fudan University. His research interests are computer vision and multi-perception information processing.

    Su Yang is a full professor in School of Computer Science at Fudan University. His main research interest is pattern recognition and its applications in media processing and smart cities. His works in symbol recognition and feature selection were widely cited. He received the best paper award from CPSCom 2010 and chaired the 7thSocialComin Beijing, 2014.

    Jiulong Zhang is an associate professor in School of Computer Science at Xi’an University of Technology. His current research interests are computer vision, image processing, affective computing, and human computer interaction. He has published over 40 papers.

    Weishan Zhang is a full professor, and deputy head for research of Department of Software Engineering, China University of Petroleum. His current research interests are big data processing, pervasive and service oriented computing. He has published over 100 papers. According to Google Scholar, his current total citations are over 1200, H-index is 19, and I10-index is 37.

    View full text