Abstract
Human abnormal behavior detection is important to ensure public safety and prevent unwanted incidents. Currently, recognition systems for human abnormal behavior adopt neural network models and perform standard 1-of-N majority voting procedures. However, recognizing human abnormal behaviors can be challenging due to lengthy and numerous video datasets and the limitations of existing methods that rely on predefined categories and scenarios. This study proposed a novel method named Visual Text Contrastive Learning (VTCL) for identifying abnormal human behavior in campus settings. The proposed model emphasizes semantic information from automatically labeled properties text and videos of abnormal behaviors, moving beyond simple numerical representations. The proposed method integrates the cross and multi-frame methods within the visual branch to improve spatial and temporal performance. In the textual branch, the proposed prompting technique captures the contextual backdrop of abnormal behaviors to enrich supervision with behavioral semantic information. Then, the model learns the visual-text features to enhance the learning process through contrastive learning techniques. In addition, this work also presented a new study to explore zero-shot campus abnormal behavior recognition (CABR). It lays the foundation for unlocking the implementation of highly available and robust CABR for multiple and even new scenarios. The proposed VTCL model demonstrated a Top-1 accuracy of 86.92% and a Top-5 accuracy of 98.14% on the CABR50 dataset, including fifty abnormal behaviors on campus, with competitive computational complexity. Furthermore, the zero-shot performance of the proposed model showed competitive outcomes when evaluated on additional datasets, including CABRZ6 and UCF-101.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Our data and code will be published and can be found at the following link: https://github.com/LiuHaiChuan0/2021-Deep-learning/tree/main/VTCL.
References
Zhou J, Herencsar N (2023) Abnormal behavior determination model of multimedia classroom students based on multi-task deep learning. Mobile Netw Appl. https://doi.org/10.1007/s11036-023-02187-7
Yun SS, Nguyen Q, Choi J (2019) Recognition of emergency situations using audio-visual perception sensor network for ambient assistive living. J Ambient Intell Humaniz Comput 10:41–55. https://doi.org/10.1007/s12652-017-0597-y
Lu M, Li D, Xu F (2022) Recognition of students’ abnormal behaviors in english learning and analysis of psychological stress based on deep learning. Front Psychol 13. https://doi.org/10.3389/fpsyg.2022.1025304
Mo J, Zhu R, Yuan H et al (2023) Student behavior recognition based on multitask learning. Multimed Tools Appl 82(12):19,091-19,108. https://doi.org/10.1007/s11042-022-14100-7
Rashmi M, Ashwin TS, Guddeti RMR (2021) Surveillance video analysis for student action recognition and localization inside computer laboratories of a smart campus. Multimed Tools Appl 80(2):2907–2929. https://doi.org/10.1007/s11042-020-09741-5
Xie Y, Zhang S, Liu Y (2021) Abnormal behavior recognition in classroom pose estimation of college students based on spatiotemporal representation learning. Trait Signal 38(1):89–95. https://doi.org/10.18280/ts.380109
Banerjee S, Ashwin TS, Guddeti RMR (2020) Multimodal behavior analysis in computer-enabled laboratories using nonverbal cues. Signal Image Video Process 14(8):1617–1624. https://doi.org/10.1007/s11760-020-01705-4
Chen G, Liu P, Liu Z et al (2021) Neuroaed: Towards efficient abnormal event detection in visual surveillance with neuromorphic vision sensor. IEEE Trans Inf Forensic Secur 16:923–936. https://doi.org/10.1109/TIFS.2020.3023791
Sun B, Wu Y, Zhao K et al (2021) Student class behavior dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput Appl 33(14, SI):8335–8354. https://doi.org/10.1007/s00521-020-05587-y
Liu HC, Chuah JH, Khairuddin ASM et al (2023) Campus abnormal behavior recognition with temporal segment transformers. IEEE Access 11:38,471-38,484. https://doi.org/10.1109/ACCESS.2023.3266440
Ni B, Peng H, Chen M et al (2022). Expanding language-image pretrained models for general video recognition. In:17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19772-7_1
Wang M, Xing J, Mei J et al (2023) Actionclip: Adapting language-image pretrained models for video action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3331841
Luo H, Ji L, Zhong M et al (2022) Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293–304. https://doi.org/10.1016/j.neucom.2022.07.028
Lin Z, Geng S, Zhang R et al (2022). Frozen clip models are efficient video learners. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_23
Estevam V, Pedrini H, Menotti D (2021) Zero-shot action recognition in videos: A survey. Neurocomputing 439:159–175. https://doi.org/10.1016/j.neucom.2021.01.036
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24,142-24,156. https://doi.org/10.1007/s10489-023-04808-w
Qi Q, Wang H, Su T et al (2022) Learning temporal information and object relation for zero-shot action recognition. Displays 73. https://doi.org/10.1016/j.displa.2022.102177
Xia X, Dong G, Li F et al (2023) When clip meets cross-modal hashing retrieval: A new strong baseline. Inf Fusion 100. https://doi.org/10.1016/j.inffus.2023.101968
Sun B, Kong D, Wang S et al (2022) Gan for vision, kg for relation: A two-stage network for zero-shot action recognition. Pattern Recognit 126. https://doi.org/10.1016/j.patcog.2022.108563
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), ELECTR NETWORK, JUL 18-24, 2021
Su T, Wang H, Qi Q et al (2023) Transductive learning with prior knowledge for generalized zero-shot action recognition. IEEE T Circ Syst Vid 34(1):260–273. https://doi.org/10.1109/TCSVT.2023.3284977
Tan Z, Wu Y, Liu Q et al (2024) Exploring the application of large-scale pre-trained models on adverse weather removal. IEEE T Image Process 33:1683–1698. https://doi.org/10.1109/TIP.2024.3368961
Wu Z, Weng Z, Peng W et al (2024) Building an open-vocabulary video clip model with better architectures, optimization and data. IEEE Trans Pattern Anal Mach Intell 46(7):4747–4762. https://doi.org/10.1109/TPAMI.2024.3357503
Chen D, Wu Z, Liu F et al (2023) Protoclip: Prototypical contrastive language image pretraining. IEEE T Neur Net Lear. https://doi.org/10.1109/TNNLS.2023.3335859
Gao J, Hou Y, Guo Z et al (2023) Learning spatio-temporal semantics and cluster relation for zero-shot action recognition. IEEE T Circ Syst Vid 33(11):6519–6530. https://doi.org/10.1109/TCSVT.2023.3272627
Qi C, Feng Z, Xing M et al (2023) Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans Multimedia 25:1940–1953. https://doi.org/10.1109/TMM.2023.3264847
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24,142-24,156. https://doi.org/10.1007/s10489-023-04808-w
Xu B, Shu X, Zhang J et al (2023) Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3247103
Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348. https://doi.org/10.1007/s11263-022-01653-1
Gao P, Geng S, Zhang R et al (2023) Clip-adapter: Better vision-language models with feature adapters. Int J Comput Vis. https://doi.org/10.1007/s11263-023-01891-x
Zhang R, Zhang W, Fang R, et al (2022) Tip-adapter: Training-free adaption of clip for few-shot classification. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_29
Wang L, Huang B, Zhao Z, et al (2023) Videomae v2: Scaling video masked autoencoders with dual masking. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, CANADA, JUN 17-24, 2023
Ju C, Han T, Zheng K et al (2022). Prompting visual-language models for efficient video understanding. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_7
Qi C, Feng Z, Xing M et al (2023) Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans Multimed 25:1940–1953. https://doi.org/10.1109/TMM.2023.3264847
Sener F, Saraf R, Yao A (2023) Transferring knowledge from text to video: Zero-shot anticipation for procedural actions. IEEE Trans Pattern Anal Mach Intell 45(6):7836–7852. https://doi.org/10.1109/TPAMI.2022.3218596
Zara G, Roy S, Rota P, et al (2023) Autolabel: Clip-based framework for open-set video domain adaptation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, CANADA, JUN 17-24, 2023. https://doi.org/10.1109/CVPR52729.2023.01107
Guzhov A, Raue F, Hees J, et al (2022) Audioclip: Extending clip to image, text and audio. In: 47th IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, SINGAPORE, MAY 22-27, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747631
Deng L, Deng F, Zhou K et al (2024) Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition. Eng Appl Artif Intell 128. https://doi.org/10.1016/j.engappai.2023.107439
Gu X, Lin TY, Kuo W, et al (2022) Open-vocabulary object detection via vision and language knowledge distillation. In: 10th International conference on learning representations (ICLR), April 25-29 , 2022
Wang L, Xiong Y, Wang Z, et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: 14th European conference on computer vision (ECCV), Amsterdam, NETHERLANDS, OCT 08-16, 2016
Wei D, Tian Y, Wei L, et al (2022) Efficient dual attention slowfast networks for video action recognition 222. https://doi.org/10.1016/j.cviu.2022.103484
Liu Z, Ning J, Cao Y, et al (2022) Video swin transformer. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), New Orleans, LA, JUN 18-24, 2022. https://doi.org/10.1109/CVPR52688.2022.00320
Acknowledgements
The research is supported by Universiti Malaya, Malaysia under project number ST018-2023.
Author information
Authors and Affiliations
Contributions
Hai Chuan Liu proposed methods, conducted the experimentation, generated and analyzed the results, and was responsible for writing the paper. Anis Salwa Mohd Khairuddin reviewed the approach and the results to further improve the quality of the paper. Joon Huang Chuah supervised the writing to improve the paper. Xian Min Zhao demonstrated experiments and analysis again. Xiao Dan Wang analyzed the results and helped in the writing of the paper. Li Ming Fang checked the paper’s approach and data’s effectiveness.Si Bo Kong checked the paper’s approach and format’s effectiveness.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics approval
The research conducted in this study adhered to ethical guidelines and principles. The data used in this study are sourced from publicly available, open-access datasets. Data used in research comply with provider terms, and no identifiable information was used. This study did not require informed consent since the data was already de-identified and publicly accessible.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, H.C., Mohd Khairuddin, A.S., Chuah, J.H. et al. Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition. Appl Intell 55, 110 (2025). https://doi.org/10.1007/s10489-024-05994-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05994-x