Detecting abnormality with separated foreground and background: Mutual Generative Adversarial Networks for video abnormal event detection

https://doi.org/10.1016/j.cviu.2022.103416Get rights and content

Abstract

As one of the most important tasks in intelligent video analysis, video abnormal event detection has been extensively studied. Prior arts have made a great process in designing frameworks to capture spatio-temporal features of video frames. However, video frames usually consist of various objects. It is challenging to grasp the nuances of anomalies against noisy backgrounds. To tackle the bottleneck, we propose a novel Foreground–Background Separation Mutual Generative Adversarial Network (FSM-GAN) framework. The FSM-GAN permits the separation of video frames into the foreground and background. The separated foreground and background are utilized as the input of mutual generative adversarial networks, which transform raw-pixel images in optical-flow representations and vice versa. In the networks, the background is regarded as known conditions and the model focuses on learning the high-level spatio-temporal foreground features to represent the event with the given conditions during the mutual adversarial training. In the test stage, these high-level features instead of low-level visual primitives are utilized to measure the abnormality in the semantic level. Compared with state-of-the-art methods and other abnormal event detection approaches, the proposed framework demonstrates its effectiveness and reliability across various scenes and events.

Introduction

Video-level abnormal event detection refers to the identification of events that do not conform to expected behavior. It is a challenging problem due to the complexity of “anomaly” as well as the cluttered backgrounds, objects, and motions in the real-world video scenes (Zhao et al., 2017). In this task, we are given a set of normal training videos samples, and determine whether or not a test video contains an anomaly on these samples (Saligrama and Chen, 2012). In general, such anomalies can include unusual motion patterns and unusual objects on usual/unusual locations (Saligrama and Chen, 2012). In the past couple of years, the anomaly detection task has drawn much attention as a core problem of video modeling, and related technologies have widely been used in public places, e.g., streets, squares, and shopping centers, etc., to increase public safety (Sultani et al., 2018). Recently, deep learning-based methods provide state-of-the-art results for the task of video abnormal event detection. Popular frameworks include auto-encoders (Xu et al., 2015, Hasan et al., 2016, Luo et al., 2017a, Hinami et al., 2017, Ionescu et al., 2019) and generative adversarial networks (Ravanbakhsh et al., 2017, Liu et al., 2018, Lee et al., 2018).

Though lots of efforts have been made, the problem is still open. It is considered that human operators are more robust to scene changes, precisely locate abnormal events, and work well even in cases where given scenes are different from those in the training set. We have launched a survey on the Amazon Mechanical Turk (MTurk) to study how human operators perform abnormal event detection (Section 2). The survey results support that the detection performance gap may be caused by the different detection processes between human operators and existing methods: (i) Human operators tend to focus on moving objects instead of the static background. (ii) Human operators measure anomaly by high-level features instead of the low-level visual primitives.

Based on the observations, each video frame is separated into the foreground and background in this paper. Dynamic objects are directly related to the events. Thus, these objects are considered as the foreground. Stationary objects or other surroundings are not involved in events but they often provide conditions for the happening events. Naturally, they are regarded as background. Then, generative adversarial networks are designed to learn normal foreground patterns under the condition of background on the training dataset. To model the motion and appearance of the foreground, two mutual generative adversarial networks are proposed to transform raw-pixel images in optical-flow representations and vice versa. For an abnormal scene in the test stage, since the training set does not contain any abnormal samples, the networks are supposed to produce distorted reconstructions. By measuring the distortion with extracted high-level features instead of low-level visual primitives, the semantic anomalies can be easily captured.

In summary, the contributions of this paper can be highlighted as: (i) The proposed Foreground–Background Separation Mutual Generative Adversarial Network (FSM-GAN) permits the separation of video frames into foreground and background. By regarding the background as known conditions, the proposed FSM-GAN is able to focus on the foreground to detect abnormal events. (ii) The high-level features are proposed to learn to represent the foreground events. In the test stage, the feature-based anomaly metrics can take the place of low-level visual primitives to measure abnormality, forcing the model to capture abnormal semantics. (iii) we carry out extensive experiments to demonstrate the good generalization ability of our proposed FSM-GAN across three benchmarking datasets.

Section snippets

Human feedback in abnormal event detection

To support the insight of our proposed methods qualitatively, we investigate on how human operators perform abnormal event detection and explore the different detection processes between human operators and existing methods that may cause detection performance gap. Here, a survey has been launched based on the MTurk crowdsourcing platform, in which workers must read the instructions of video abnormal event detection and then finish seven tasks, as shown in Table 1:

The qualification requirement

Related work

In recent years, abnormal event detection in the video has gained attention from computer vision researchers and artificial intelligence application developers. Boosted by the recent success of neural networks (Jiang et al., 2011, Calderara et al., 2011, Dan et al., 2017, Fan et al., 2020), deep learning-based methods have outperformed hand-crafted feature engineering (Zaharescu and Wildes, 2010, Rota et al., 2012, Roshtkhari and Levine, 2013, Wiliem et al., 2012, Jeong et al., 2011, Zhu et

Overall framework

In this paper, we propose a FSM-GAN to detect anomalies in videos. Fig. 1 describes the architecture of FSM-GAN, which contains three parts, i.e., the foreground extractor, the motion branch, and the appearance branch. The foreground extractor attempts to decouple foreground and background from given scenes producing the foreground mask Mt and background B. Then, the motion branch and the appearance branch are proposed based on the architecture of generative adversarial networks to reconstruct

Overall performance

In this section, we validate the proposed FSM-GAN for anomaly detection and conduct experiments on the three most commonly-used benchmark datasets: Avenue (Lu et al., 2013), UCSD Ped2 (Mahadevan et al., 2010), and ShanghaiTech (Luo et al., 2017b). Each dataset is composed of two subsets: the training set and the test set. Training videos contain only normal events, while test videos have both normal and abnormal events.

Avenue dataset is one of the most commonly used datasets and is usually

Conclusion and future work

In this paper, we propose the Foreground–Background Separation Mutual Generative Adversarial Network for abnormal event detection. To capture the nuances of abnormal events and suppress various background noise, we are the first to propose a GAN-based framework that permits the decomposition of the foreground and background to guide the modeling of events. For the first step, the foreground extractor permits to separate the foreground from the given scenes. Then, mutual generative adversarial

CRediT authorship contribution statement

Zhi Zhang: Methodology, Software, Formal analysis, Investigation, Writing – original draft, Visualization. Sheng-hua Zhong: Conceptualization, Validation, Resources, Data curation, Writing – review & editing, Project administration, Funding acquisition. Ahmed Fares: Writing – review & editing. Yan Liu: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of Guangdong Province, China [No. 2019A1515011181], the Science and Technology Innovation Commission of Shenzhen, China under Grant [No. JCYJ20190808162613130], the Shenzhen high-level talents program, China, and the Open Research Fund, China from Guangdong Laboratory of Artificial Intelligence & Digital Economy (SZ) under Grant [No. GML-KF-22-28].

References (41)

  • Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d., 2019. Memorizing normality to detect...
  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017. Improved Training of Wasserstein GANs. In:...
  • Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S., 2016. Learning temporal regularity in video...
  • Hinami, R., Mei, T., Satoh, S., 2017. Joint detection and recounting of abnormal events by learning deep generic...
  • HongJ. et al.

    Variant Grassmann manifolds: A representation augmentation method for action recognition

    TKDD

    (2019)
  • Ionescu, R.T., Khan, F.S., Georgescu, M.-I., Shao, L., 2019. Object-centric auto-encoders and dummy anomalies for...
  • Jeong, H., Chang, H.J., Choi, J.Y., 2011. Modeling of moving object trajectory by spatio-temporal learning for abnormal...
  • Lee, S., Kim, H.G., Ro, Y.M., 2018. STAN: Spatio-temporal adversarial networks for abnormal event detection. In:...
  • Liu, W., Luo, W., Lian, D., Gao, S., 2018. Future frame prediction for anomaly detection–a new baseline. In: CVPR. pp....
  • Lu, C., Shi, J., Jia, J., 2013. Abnormal event detection at 150 fps in matlab. In: ICCV. pp....
  • Cited by (8)

    View all citing articles on Scopus
    View full text