skip to main content
10.1145/3573942.3574104acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Multi-instance learning anomaly event detection based on Transformer

Published: 16 May 2023 Publication History

Abstract

Multi-instance learning (MIL) is the dominant approach for weakly supervised anomaly detection in surveillance videos. The shortcomings of using the features extracted by networks such as Convolutional 3D (C3D) or inflated 3D-ConvNet (I3D) alone to extract video context features have prompted the emergence of various abnormal event detection algorithms based on attention mechanisms. Vision Transformer (ViT) applies transformer to the field of computer vision for the first time and demonstrates its superior performance. In this paper, we propose a multi-instance learning anomaly event detection method based on Transformer, called MIL-ViT, which uses an inflated I3D pre-training model to extract Spatio-temporal features, and then inputs features into the ViT encoder to extract the particular salient pieces of information, and the anomaly scores are obtained. Furthermore, we introduce the MIL ranking loss and the center loss function for better training. The experimental results on two benchmark datasets (i.e. ShanghaiTech and UCF-Crime) show that the AUC value of our method is significantly improved compared with several state-of-the-art methods in recent years.

References

[1]
A World With a Billion Cameras Watching You Is Just Around the . Retrieved May 1, 2022 from https://www.wsj.com/articles/a-billion-surveillance-cameras-forecast-to-be-watching-within-two-years-11575565402
[2]
Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-World Anomaly Detection in Surveillance Videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[3]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://doi.org/10.48550/arXiv.2010.11929
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5]
Mujtaba Asad, He Jiang, Jie Yang, Enmei Tu, and Aftab Ahmad Malik. 2021. Multi-Stream 3D latent feature clustering for abnormality detection in videos. Applied Intelligence 52, 1 (2021), 1126-1143.
[6]
Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, and Larry S. Davis. 2016. Learning Temporal Regularity in Video Sequences. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7]
Yong Shean Chong and Yong Haur Tay. 2017. Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder. Advances in Neural Networks - ISNN 2017, 189-196.
[8]
Trong Nguyen Nguyen and Jean Meunier. 2019. Anomaly Detection in Video Sequence With Appearance-Motion Correspondence. 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[9]
Herman Prawiro, Jian-Wei Peng, Tse-Yu Pan, and Min-Chun Hu. 2020. Abnormal Event Detection in Surveillance Videos Using Two-Stream Decoder. 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).
[10]
Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei. 2020. Weakly Supervised Video Anomaly Detection via Center-Guided Discriminative Learning. 2020 IEEE International Conference on Multimedia and Expo (ICME).
[11]
Xiao Jinsheng, Shen Mengyao, Jiang Mingjun, Lei Junfeng, Bao Zhenyu. 2021. Detection of abnormal behavior in surveillance video with packet attention mechanism. Journal of Automation: 1-10[2021-12-18]. https://doi.org/10.16383/j.aas.c190805
[12]
Shikha Dubey, Abhijeet Boragule, and Moongu Jeon. 2019. 3D ResNet with Ranking Loss Function for Abnormal Activity Detection in Videos. 2019 International Conference on Control, Automation and Information Sciences (ICCAIS).
[13]
Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. 2021. Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[14]
Zhenghua Zhang, Zhangjie Gong, and Qingqing Hong. 2021. A Survey on: Application of Transformer in Computer Vision. The Proceedings of The 8th International Conference on Intelligent Systems and Image Processing 2021 (2021).
[15]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021).
[16]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. Computer Vision – ECCV 2020, 213-229.
[17]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017.The kinetics human action video dataset. arXiv.1705.06950. Retrieved from https://doi.org/10.48550/arXiv.1705.06950
[18]
Weichao Zhang, Guanjun Wang, Mengxing Huang, Hongyu Wang, and Shaoping Wen. 2021. Generative Adversarial Networks for Abnormal Event Detection in Videos Based on Self-Attention Mechanism. IEEE Access 9, 124847-124860.
[19]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
[20]
Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future Frame Prediction for Anomaly Detection - A New Baseline. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[21]
Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, and Ge Li. 2019. Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22]
Jiangong Zhang, Laiyun Qing, and Jun Miao. 2019. Temporal Convolutional Network with Complementary Inner Bag Loss for Weakly Supervised Anomaly Detection. 2019 IEEE International Conference on Image Processing (ICIP).
[23]
Ammar Mansoor Kamoona, Amirali Khodadadian Gosta, Alireza Bab-Hadiashar, and Reza Hoseinnezhad. 2020. Multiple Instance-Based Video Anomaly Detection using Deep Temporal Encoding-Decoding. arXiv:2007.01548 Retrieved from https://doi.org/10.48550/arXiv.2007.01548
[24]
Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision. Computer Vision – ECCV 2020 (2020), 322-339.
[25]
Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv:1905.10650 Retrieved from https://doi.org/10.48550/arXiv.1905.10650
[26]
Snehashis Majhi, Srijan Das, and Francois Bremond. 2021. DAM: Dissimilarity Attention Module for Weakly-supervised Video Anomaly Detection. 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

Index Terms

  1. Multi-instance learning anomaly event detection based on Transformer

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Anomaly detection
    2. multi-instance learning
    3. surveillance videos
    4. vision transformer

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 59
      Total Downloads
    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media