Detecting abnormality with separated foreground and background: Mutual Generative Adversarial Networks for video abnormal event detection
Introduction
Video-level abnormal event detection refers to the identification of events that do not conform to expected behavior. It is a challenging problem due to the complexity of “anomaly” as well as the cluttered backgrounds, objects, and motions in the real-world video scenes (Zhao et al., 2017). In this task, we are given a set of normal training videos samples, and determine whether or not a test video contains an anomaly on these samples (Saligrama and Chen, 2012). In general, such anomalies can include unusual motion patterns and unusual objects on usual/unusual locations (Saligrama and Chen, 2012). In the past couple of years, the anomaly detection task has drawn much attention as a core problem of video modeling, and related technologies have widely been used in public places, e.g., streets, squares, and shopping centers, etc., to increase public safety (Sultani et al., 2018). Recently, deep learning-based methods provide state-of-the-art results for the task of video abnormal event detection. Popular frameworks include auto-encoders (Xu et al., 2015, Hasan et al., 2016, Luo et al., 2017a, Hinami et al., 2017, Ionescu et al., 2019) and generative adversarial networks (Ravanbakhsh et al., 2017, Liu et al., 2018, Lee et al., 2018).
Though lots of efforts have been made, the problem is still open. It is considered that human operators are more robust to scene changes, precisely locate abnormal events, and work well even in cases where given scenes are different from those in the training set. We have launched a survey on the Amazon Mechanical Turk (MTurk) to study how human operators perform abnormal event detection (Section 2). The survey results support that the detection performance gap may be caused by the different detection processes between human operators and existing methods: (i) Human operators tend to focus on moving objects instead of the static background. (ii) Human operators measure anomaly by high-level features instead of the low-level visual primitives.
Based on the observations, each video frame is separated into the foreground and background in this paper. Dynamic objects are directly related to the events. Thus, these objects are considered as the foreground. Stationary objects or other surroundings are not involved in events but they often provide conditions for the happening events. Naturally, they are regarded as background. Then, generative adversarial networks are designed to learn normal foreground patterns under the condition of background on the training dataset. To model the motion and appearance of the foreground, two mutual generative adversarial networks are proposed to transform raw-pixel images in optical-flow representations and vice versa. For an abnormal scene in the test stage, since the training set does not contain any abnormal samples, the networks are supposed to produce distorted reconstructions. By measuring the distortion with extracted high-level features instead of low-level visual primitives, the semantic anomalies can be easily captured.
In summary, the contributions of this paper can be highlighted as: (i) The proposed Foreground–Background Separation Mutual Generative Adversarial Network (FSM-GAN) permits the separation of video frames into foreground and background. By regarding the background as known conditions, the proposed FSM-GAN is able to focus on the foreground to detect abnormal events. (ii) The high-level features are proposed to learn to represent the foreground events. In the test stage, the feature-based anomaly metrics can take the place of low-level visual primitives to measure abnormality, forcing the model to capture abnormal semantics. (iii) we carry out extensive experiments to demonstrate the good generalization ability of our proposed FSM-GAN across three benchmarking datasets.
Section snippets
Human feedback in abnormal event detection
To support the insight of our proposed methods qualitatively, we investigate on how human operators perform abnormal event detection and explore the different detection processes between human operators and existing methods that may cause detection performance gap. Here, a survey has been launched based on the MTurk crowdsourcing platform, in which workers must read the instructions of video abnormal event detection and then finish seven tasks, as shown in Table 1:
The qualification requirement
Related work
In recent years, abnormal event detection in the video has gained attention from computer vision researchers and artificial intelligence application developers. Boosted by the recent success of neural networks (Jiang et al., 2011, Calderara et al., 2011, Dan et al., 2017, Fan et al., 2020), deep learning-based methods have outperformed hand-crafted feature engineering (Zaharescu and Wildes, 2010, Rota et al., 2012, Roshtkhari and Levine, 2013, Wiliem et al., 2012, Jeong et al., 2011, Zhu et
Overall framework
In this paper, we propose a FSM-GAN to detect anomalies in videos. Fig. 1 describes the architecture of FSM-GAN, which contains three parts, i.e., the foreground extractor, the motion branch, and the appearance branch. The foreground extractor attempts to decouple foreground and background from given scenes producing the foreground mask and background . Then, the motion branch and the appearance branch are proposed based on the architecture of generative adversarial networks to reconstruct
Overall performance
In this section, we validate the proposed FSM-GAN for anomaly detection and conduct experiments on the three most commonly-used benchmark datasets: Avenue (Lu et al., 2013), UCSD Ped2 (Mahadevan et al., 2010), and ShanghaiTech (Luo et al., 2017b). Each dataset is composed of two subsets: the training set and the test set. Training videos contain only normal events, while test videos have both normal and abnormal events.
Avenue dataset is one of the most commonly used datasets and is usually
Conclusion and future work
In this paper, we propose the Foreground–Background Separation Mutual Generative Adversarial Network for abnormal event detection. To capture the nuances of abnormal events and suppress various background noise, we are the first to propose a GAN-based framework that permits the decomposition of the foreground and background to guide the modeling of events. For the first step, the foreground extractor permits to separate the foreground from the given scenes. Then, mutual generative adversarial
CRediT authorship contribution statement
Zhi Zhang: Methodology, Software, Formal analysis, Investigation, Writing – original draft, Visualization. Sheng-hua Zhong: Conceptualization, Validation, Resources, Data curation, Writing – review & editing, Project administration, Funding acquisition. Ahmed Fares: Writing – review & editing. Yan Liu: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Natural Science Foundation of Guangdong Province, China [No. 2019A1515011181], the Science and Technology Innovation Commission of Shenzhen, China under Grant [No. JCYJ20190808162613130], the Shenzhen high-level talents program, China, and the Open Research Fund, China from Guangdong Laboratory of Artificial Intelligence & Digital Economy (SZ) under Grant [No. GML-KF-22-28].
References (41)
- et al.
Detecting anomalies in people’s trajectories using spectral graph analysis
Comput. Vis. Image Underst.
(2011) - et al.
Video anomaly detection and localization via Gaussian mixture fully convolutional variational autoencoder
Comput. Vis. Image Underst.
(2020) - et al.
The THUMOS challenge on action recognition for videos ”in the wild”
Comput. Vis. Image Underst.
(2017) - et al.
Anomalous video event detection using spatiotemporal context
Comput. Vis. Image Underst.
(2011) - et al.
A suspicious behaviour detection using a context space model for smart surveillance systems
Comput. Vis. Image Underst.
(2012) - et al.
Sparse representation for robust abnormality detection in crowded scenes
Pattern Recognit.
(2014) - et al.
Mixmatch: A holistic approach to semi-supervised learning
(2019) - et al.
Albumentations: fast and flexible image augmentations
Information
(2020) - Cohen, W.W., Schapire, R.E., Singer, Y., 1998. Learning to order things. In: NIPS. pp....
- et al.
Detecting anomalous events in videos by learning deep representations of appearance and motion
Comput. Vis. Image Underst.
(2017)
Variant Grassmann manifolds: A representation augmentation method for action recognition
TKDD
Cited by (8)
An effective framework of human abnormal behaviour recognition and tracking using multiscale dilated assisted residual attention network
2024, Expert Systems with ApplicationsEnhancing video anomaly detection with learnable memory network: A new approach to memory-based auto-encoders
2024, Computer Vision and Image UnderstandingAdversarial composite prediction of normal video dynamics for anomaly detection
2023, Computer Vision and Image UnderstandingCrowd abnormal event detection based on motion entropy and dual support vector data description
2023, International Journal of Modern Physics C