skip to main content
research-article

EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency

Published: 12 September 2024 Publication History

Abstract

Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this article proposes a video anomaly detection framework with Enhanced Object Information and Global Temporal Dependencies (EOGT) and the main novelties are: (1) A Local Object Anomaly Stream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a Diffusion-based Object Reconstruction Network (DORN) with multimodal conditions detects anomalies with object RGB information; and an Object Pose Anomaly Refiner (OPA) discovers anomalies with human pose information. (2) A Global Temporal Strengthening Stream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.

References

[1]
Bharathkumar Ramachandra, Michael J. Jones, and Ranga Raju Vatsavai. 2020. A survey of single-scene video anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5 (2020), 2293–2312.
[2]
Trong-Nguyen Nguyen and Jean Meunier. 2019. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1273–1283.
[3]
Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1705–1714.
[4]
Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–A new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6536–6545.
[5]
Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13588–13597.
[6]
Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 FPS in MATLAB. In Proceedings of the IEEE International Conference on Computer Vision. 2720–2727.
[7]
Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2013. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1 (2013), 18–32.
[8]
Ahmed Elgammal, David Harwood, and Larry Davis. 2000. Non-parametric model for background subtraction. In Proceedings of the European Conference on Computer Vision. Springer, 751–767.
[9]
Mahmudul Hasan and Amit K. Roy-Chowdhury. 2015. A continuous learning framework for activity recognition using deep hybrid feature models. IEEE Trans. Multim. 17, 11 (2015), 1909–1922.
[10]
Tal Reiss and Yedid Hoshen. 2022. Attribute-based representations for accurate and interpretable video anomaly detection. arXiv preprint arXiv:2212.00789 (2022).
[11]
Or Hirschorn and Shai Avidan. 2023. Normalizing flows for human pose anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13545–13554.
[12]
Allison Del Giorno, J. Andrew Bagnell, and Martial Hebert. 2016. A discriminative framework for anomaly detection in large videos. In Proceedings of the European Conference on Computer Vision. Springer, 334–349.
[13]
Radu Tudor Ionescu, Sorina Smeureanu, Bogdan Alexe, and Marius Popescu. 2017. Unmasking the abnormal events in video. In Proceedings of the IEEE International Conference on Computer Vision. 2895–2903.
[14]
S. Wang, Y. Zeng, L. Qiang, C. Zhu, E. Zhu, and J. Yin. 2018. Detecting abnormality without knowing normality: A two-stage approach for unsupervised video abnormal event detection. In Proceedings of the 26th ACM International Conference on Multimedia. 636–644.
[15]
Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).
[16]
Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. 2018. Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3379–3388.
[17]
Hung Vu, Tu Dinh Nguyen, Trung Le, Wei Luo, and Dinh Phung. 2019. Robust anomaly detection in videos using multilevel representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5216–5223.
[18]
M. Ye, X. Peng, W. Gan, W. Wu, and Y. Qiao. 2019. AnoPCN: Video anomaly detection via deep predictive coding network. In Proceedings of the 27th ACM International Conference on Multimedia. 1805–1813.
[19]
Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. 2021. Appearance-motion memory consistency network for video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 938–946.
[20]
Peng Wu, Jing Liu, and Fang Shen. 2019. A deep one-class neural network for anomalous event detection in complex scenes. IEEE Trans. Neural Netw. Learn. Syst. 31, 7 (2019), 2609–2622.
[21]
Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4975–4986.
[22]
Congqi Cao, Yue Lu, and Yanning Zhang. 2022. Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. arXiv preprint arXiv:2209.02899 (2022).
[23]
Weixin Luo, Wen Liu, Dongze Lian, and Shenghua Gao. 2021. Future frame prediction network for video anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 11 (2021), 7505–7520.
[24]
Sangmin Lee, Hak Gu Kim, and Yong Man Ro. 2019. BMAN: Bidirectional multi-scale aggregation networks for abnormal event detection. IEEE Trans. Image Process. 29 (2019), 2395–2408.
[25]
Sijia Zhang, Maoguo Gong, Yu Xie, A. K. Qin, Hao Li, Yuan Gao, and Yew-Soon Ong. 2022. Influence-aware attention networks for anomaly detection in surveillance videos. IEEE Trans. Circ. Syst. Video Technol. 32, 8 (2022), 5427–5437. DOI:
[26]
Yunpeng Chang, Zhigang Tu, Wei Xie, Bin Luo, Shifu Zhang, Haigang Sui, and Junsong Yuan. 2022. Video anomaly detection with spatio-temporal dissociation. Pattern Recog. 122 (2022), 108213.
[27]
Zhiwei Yang, Jing Liu, Zhaoyang Wu, Peng Wu, and Xiaotao Liu. 2023. Video event restoration based on keyframes for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14592–14601.
[28]
Zhigang Tu, Hongyan Li, Dejun Zhang, Justin Dauwels, Baoxin Li, and Junsong Yuan. 2019. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Trans. Image Process. 28, 6 (2019), 2799–2812.
[29]
Taiyi Su, Hanli Wang, and Lei Wang. 2023. Multi-level content-aware boundary detection for temporal action proposal generation. IEEE Trans. Image Process. 32 (2023), 6090–6101.
[30]
Jason M. Grant and Patrick J. Flynn. 2017. Crowd scene understanding from video: A survey. ACM Trans. Multim. Comput., Commun. Applic. 13, 2 (2017), 1–23.
[31]
Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12742–12752.
[32]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.
[33]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.
[34]
Yang Li. 2023. Detection of moving object using superpixel fusion network. ACM Trans. Multim. Comput., Commun. Applic. 19, 5 (2023), 1–15.
[35]
My Kieu, Andrew D. Bagdanov, and Marco Bertini. 2021. Bottom-up and layerwise domain adaptation for pedestrian detection in thermal images. ACM Trans. Multim. Comput., Commun. Applic. 17, 1 (2021), 1–19.
[36]
Mohammad Baradaran and Robert Bergevin. 2023. Multi-task learning based video anomaly detection with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2885–2895.
[37]
Yuxing Yang, Zeyu Fu, and Syed Mohsen Naqvi. 2023. Abnormal event detection for video surveillance using an enhanced two-stream fusion method. Neurocomputing 553 (2023), 126561.
[38]
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc.: Series B (Methodol.) 39, 1 (1977), 1–22.
[39]
Jian Xiao, Tianyuan Liu, and Genlin Ji. 2023. Divide and conquer in video anomaly detection: A comprehensive review and new approach. arXiv preprint arXiv:2309.14622 (2023).
[40]
Laurent Dinh, David Krueger, and Yoshua Bengio. 2014. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014).
[41]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[42]
Alessandro Flaborea, Luca Collorone, Guido Maria D’Amely di Melendugno, Stefano D’Arrigo, Bardh Prenkaj, and Fabio Galasso. 2023. Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10318–10329.
[43]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33 (2020), 6840–6851.
[44]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34 (2021), 8780–8794.
[45]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. 2022. Video diffusion models. arXiv:2204.03458 (2022).
[46]
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. 2022. Make-a-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022).
[47]
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2022. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022).
[48]
Zijun Deng, Xiangteng He, Yuxin Peng, Xiongwei Zhu, and Lele Cheng. 2023. MV-Diffusion: Motion-aware video diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia. 7255–7263.
[49]
Saeed Saadatnejad, Ali Rasekh, Mohammadreza Mofayezi, Yasamin Medghalchi, Sara Rajabzadeh, Taylor Mordan, and Alexandre Alahi. 2023. A generic diffusion-based approach for 3D human pose prediction in the wild. In Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 8246–8253.
[50]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. 2022. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022).
[51]
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. 2023. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010.
[52]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2021. Action-conditioned 3D human motion synthesis with transformer VAE. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10985–10995.
[53]
Cheng Yan, Shiyu Zhang, Yang Liu, Guansong Pang, and Wenjun Wang. 2023. Feature prediction diffusion model for video anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5527–5537.
[54]
Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, and Elisa Ricci. 2023. Unsupervised video anomaly detection with diffusion models conditioned on compact motion representations. In Proceedings of the International Conference on Image Analysis and Processing. Springer, 49–62.
[55]
Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7464–7475.
[56]
Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2023. More control for free! Image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 289–299.
[57]
Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. arXiv preprint arXiv:1906.07510 (2019).
[58]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.
[59]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
[60]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6299–6308.
[61]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. 3–19.
[62]
Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. 2023. AlphaPose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 45, 6 (2023), 7157–7173.
[63]
Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, and Larry S. Davis. 2016. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 733–742.
[64]
Weixin Luo, Wen Liu, and Shenghua Gao. 2017. A revisit of sparse coding based anomaly detection in stacked RNN framework. In Proceedings of the IEEE International Conference on Computer Vision. 341–349.
[65]
Yunpeng Chang, Zhigang Tu, Wei Xie, and Junsong Yuan. 2020. Clustering driven deep autoencoder for video anomaly detection. In Proceedings of the European Conference on Computer Vision. Springer, 329–345.
[66]
Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. 2019. AnoPCN: Video anomaly detection via deep predictive coding network. In Proceedings of the 27th ACM International Conference on Multimedia. 1805–1813.
[67]
Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. 2020. Integrating prediction and reconstruction for anomaly detection. Pattern Recog. Lett. 129 (2020), 123–130.
[68]
Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, and Jian Yang. 2021. Learning normal dynamics in videos with meta prototype network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15425–15434.
[69]
Mariana Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. 2021. A background-agnostic framework with adversarial training for abnormal event detection in video. IEEE Trans. Pattern Anal. Mach. Intell. 44, 9 (2021), 4505–4523.
[70]
Nicolae-Cătălin Ristea, Neelu Madan, Radu Tudor Ionescu, Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B. Moeslund, and Mubarak Shah. 2022. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13576–13586.
[71]
Chengwei Chen, Yuan Xie, Shaohui Lin, Angela Yao, Guannan Jiang, Wei Zhang, Yanyun Qu, Ruizhi Qiao, Bo Ren, and Lizhuang Ma. 2022. Comprehensive regularization in a bi-directional predictive network for video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 230–238.
[72]
Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, and Di Huang. 2022. Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In Proceedings of the European Conference on Computer Vision. Springer, 494–511.
[73]
Zhiwei Yang, Peng Wu, Jing Liu, and Xiaotao Liu. 2022. Dynamic local aggregation network with adaptive clusterer for anomaly detection. In Proceedings of the European Conference on Computer Vision. Springer, 404–421.
[74]
Antonio Barbalau, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, Jacob Dueholm, Bharathkumar Ramachandra, Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B. Moeslund, and Mubarak Shah. 2023. SSMTL++: Revisiting self-supervised multi-task learning for video anomaly detection. Comput. Vis. Image Underst. 229 (2023), 103656.
[75]
Ashish Singh, Michael J. Jones, and Erik G. Learned-Miller. 2023. EVAL: Explainable video anomaly localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18717–18726.

Index Terms

  1. EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 10
    October 2024
    729 pages
    EISSN:1551-6865
    DOI:10.1145/3613707
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 September 2024
    Online AM: 06 May 2024
    Accepted: 12 April 2024
    Revised: 12 March 2024
    Received: 08 December 2023
    Published in TOMM Volume 20, Issue 10

    Check for updates

    Author Tags

    1. Video anomaly detection
    2. object information enhancement
    3. diffusion-based network with multimodal conditions
    4. global temporal dependency

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 409
      Total Downloads
    • Downloads (Last 12 months)409
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media