Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition

Liu, Hai Chuan; Mohd Khairuddin, Anis Salwa; Chuah, Joon Huang; Zhao, Xian Min; Wang, Xiao Dan; Fang, Li Ming; Kong, Si Bo

doi:10.1007/s10489-024-05994-x

Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition

Published: 09 December 2024

Volume 55, article number 110, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Hai Chuan Liu^1,2,3,
Anis Salwa Mohd Khairuddin¹,
Joon Huang Chuah^1,4,
Xian Min Zhao^2,3,
Xiao Dan Wang³,
Li Ming Fang¹ &
…
Si Bo Kong⁵

132 Accesses
Explore all metrics

Abstract

Human abnormal behavior detection is important to ensure public safety and prevent unwanted incidents. Currently, recognition systems for human abnormal behavior adopt neural network models and perform standard 1-of-N majority voting procedures. However, recognizing human abnormal behaviors can be challenging due to lengthy and numerous video datasets and the limitations of existing methods that rely on predefined categories and scenarios. This study proposed a novel method named Visual Text Contrastive Learning (VTCL) for identifying abnormal human behavior in campus settings. The proposed model emphasizes semantic information from automatically labeled properties text and videos of abnormal behaviors, moving beyond simple numerical representations. The proposed method integrates the cross and multi-frame methods within the visual branch to improve spatial and temporal performance. In the textual branch, the proposed prompting technique captures the contextual backdrop of abnormal behaviors to enrich supervision with behavioral semantic information. Then, the model learns the visual-text features to enhance the learning process through contrastive learning techniques. In addition, this work also presented a new study to explore zero-shot campus abnormal behavior recognition (CABR). It lays the foundation for unlocking the implementation of highly available and robust CABR for multiple and even new scenarios. The proposed VTCL model demonstrated a Top-1 accuracy of 86.92% and a Top-5 accuracy of 98.14% on the CABR50 dataset, including fifty abnormal behaviors on campus, with competitive computational complexity. Furthermore, the zero-shot performance of the proposed model showed competitive outcomes when evaluated on additional datasets, including CABRZ6 and UCF-101.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised anomaly detection with multi-level contextual modeling

Article 25 April 2023

FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation

Enhancing public safety: a hybrid Conv_Trans-OptBiSVM approach for real-time abnormal behavior detection in crowded environments

Article 04 September 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

Our data and code will be published and can be found at the following link: https://github.com/LiuHaiChuan0/2021-Deep-learning/tree/main/VTCL.

References

Zhou J, Herencsar N (2023) Abnormal behavior determination model of multimedia classroom students based on multi-task deep learning. Mobile Netw Appl. https://doi.org/10.1007/s11036-023-02187-7
Yun SS, Nguyen Q, Choi J (2019) Recognition of emergency situations using audio-visual perception sensor network for ambient assistive living. J Ambient Intell Humaniz Comput 10:41–55. https://doi.org/10.1007/s12652-017-0597-y
Article MATH Google Scholar
Lu M, Li D, Xu F (2022) Recognition of students’ abnormal behaviors in english learning and analysis of psychological stress based on deep learning. Front Psychol 13. https://doi.org/10.3389/fpsyg.2022.1025304
Mo J, Zhu R, Yuan H et al (2023) Student behavior recognition based on multitask learning. Multimed Tools Appl 82(12):19,091-19,108. https://doi.org/10.1007/s11042-022-14100-7
Article MATH Google Scholar
Rashmi M, Ashwin TS, Guddeti RMR (2021) Surveillance video analysis for student action recognition and localization inside computer laboratories of a smart campus. Multimed Tools Appl 80(2):2907–2929. https://doi.org/10.1007/s11042-020-09741-5
Article MATH Google Scholar
Xie Y, Zhang S, Liu Y (2021) Abnormal behavior recognition in classroom pose estimation of college students based on spatiotemporal representation learning. Trait Signal 38(1):89–95. https://doi.org/10.18280/ts.380109
Article MATH Google Scholar
Banerjee S, Ashwin TS, Guddeti RMR (2020) Multimodal behavior analysis in computer-enabled laboratories using nonverbal cues. Signal Image Video Process 14(8):1617–1624. https://doi.org/10.1007/s11760-020-01705-4
Article MATH Google Scholar
Chen G, Liu P, Liu Z et al (2021) Neuroaed: Towards efficient abnormal event detection in visual surveillance with neuromorphic vision sensor. IEEE Trans Inf Forensic Secur 16:923–936. https://doi.org/10.1109/TIFS.2020.3023791
Article MATH Google Scholar
Sun B, Wu Y, Zhao K et al (2021) Student class behavior dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput Appl 33(14, SI):8335–8354. https://doi.org/10.1007/s00521-020-05587-y
Article MATH Google Scholar
Liu HC, Chuah JH, Khairuddin ASM et al (2023) Campus abnormal behavior recognition with temporal segment transformers. IEEE Access 11:38,471-38,484. https://doi.org/10.1109/ACCESS.2023.3266440
Article Google Scholar
Ni B, Peng H, Chen M et al (2022). Expanding language-image pretrained models for general video recognition. In:17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19772-7_1
Wang M, Xing J, Mei J et al (2023) Actionclip: Adapting language-image pretrained models for video action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3331841
Luo H, Ji L, Zhong M et al (2022) Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293–304. https://doi.org/10.1016/j.neucom.2022.07.028
Article MATH Google Scholar
Lin Z, Geng S, Zhang R et al (2022). Frozen clip models are efficient video learners. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_23
Estevam V, Pedrini H, Menotti D (2021) Zero-shot action recognition in videos: A survey. Neurocomputing 439:159–175. https://doi.org/10.1016/j.neucom.2021.01.036
Article MATH Google Scholar
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24,142-24,156. https://doi.org/10.1007/s10489-023-04808-w
Article MATH Google Scholar
Qi Q, Wang H, Su T et al (2022) Learning temporal information and object relation for zero-shot action recognition. Displays 73. https://doi.org/10.1016/j.displa.2022.102177
Xia X, Dong G, Li F et al (2023) When clip meets cross-modal hashing retrieval: A new strong baseline. Inf Fusion 100. https://doi.org/10.1016/j.inffus.2023.101968
Sun B, Kong D, Wang S et al (2022) Gan for vision, kg for relation: A two-stage network for zero-shot action recognition. Pattern Recognit 126. https://doi.org/10.1016/j.patcog.2022.108563
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), ELECTR NETWORK, JUL 18-24, 2021
Su T, Wang H, Qi Q et al (2023) Transductive learning with prior knowledge for generalized zero-shot action recognition. IEEE T Circ Syst Vid 34(1):260–273. https://doi.org/10.1109/TCSVT.2023.3284977
Article MATH Google Scholar
Tan Z, Wu Y, Liu Q et al (2024) Exploring the application of large-scale pre-trained models on adverse weather removal. IEEE T Image Process 33:1683–1698. https://doi.org/10.1109/TIP.2024.3368961
Article MATH Google Scholar
Wu Z, Weng Z, Peng W et al (2024) Building an open-vocabulary video clip model with better architectures, optimization and data. IEEE Trans Pattern Anal Mach Intell 46(7):4747–4762. https://doi.org/10.1109/TPAMI.2024.3357503
Article MATH Google Scholar
Chen D, Wu Z, Liu F et al (2023) Protoclip: Prototypical contrastive language image pretraining. IEEE T Neur Net Lear. https://doi.org/10.1109/TNNLS.2023.3335859
Article Google Scholar
Gao J, Hou Y, Guo Z et al (2023) Learning spatio-temporal semantics and cluster relation for zero-shot action recognition. IEEE T Circ Syst Vid 33(11):6519–6530. https://doi.org/10.1109/TCSVT.2023.3272627
Article MATH Google Scholar
Qi C, Feng Z, Xing M et al (2023) Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans Multimedia 25:1940–1953. https://doi.org/10.1109/TMM.2023.3264847
Article MATH Google Scholar
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24,142-24,156. https://doi.org/10.1007/s10489-023-04808-w
Article MATH Google Scholar
Xu B, Shu X, Zhang J et al (2023) Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3247103
Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348. https://doi.org/10.1007/s11263-022-01653-1
Article MATH Google Scholar
Gao P, Geng S, Zhang R et al (2023) Clip-adapter: Better vision-language models with feature adapters. Int J Comput Vis. https://doi.org/10.1007/s11263-023-01891-x
Zhang R, Zhang W, Fang R, et al (2022) Tip-adapter: Training-free adaption of clip for few-shot classification. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_29
Wang L, Huang B, Zhao Z, et al (2023) Videomae v2: Scaling video masked autoencoders with dual masking. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, CANADA, JUN 17-24, 2023
Ju C, Han T, Zheng K et al (2022). Prompting visual-language models for efficient video understanding. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_7
Qi C, Feng Z, Xing M et al (2023) Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans Multimed 25:1940–1953. https://doi.org/10.1109/TMM.2023.3264847
Article MATH Google Scholar
Sener F, Saraf R, Yao A (2023) Transferring knowledge from text to video: Zero-shot anticipation for procedural actions. IEEE Trans Pattern Anal Mach Intell 45(6):7836–7852. https://doi.org/10.1109/TPAMI.2022.3218596
Article MATH Google Scholar
Zara G, Roy S, Rota P, et al (2023) Autolabel: Clip-based framework for open-set video domain adaptation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, CANADA, JUN 17-24, 2023. https://doi.org/10.1109/CVPR52729.2023.01107
Guzhov A, Raue F, Hees J, et al (2022) Audioclip: Extending clip to image, text and audio. In: 47th IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, SINGAPORE, MAY 22-27, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747631
Deng L, Deng F, Zhou K et al (2024) Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition. Eng Appl Artif Intell 128. https://doi.org/10.1016/j.engappai.2023.107439
Gu X, Lin TY, Kuo W, et al (2022) Open-vocabulary object detection via vision and language knowledge distillation. In: 10th International conference on learning representations (ICLR), April 25-29 , 2022
Wang L, Xiong Y, Wang Z, et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: 14th European conference on computer vision (ECCV), Amsterdam, NETHERLANDS, OCT 08-16, 2016
Wei D, Tian Y, Wei L, et al (2022) Efficient dual attention slowfast networks for video action recognition 222. https://doi.org/10.1016/j.cviu.2022.103484
Liu Z, Ning J, Cao Y, et al (2022) Video swin transformer. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), New Orleans, LA, JUN 18-24, 2022. https://doi.org/10.1109/CVPR52688.2022.00320

Download references

Acknowledgements

The research is supported by Universiti Malaya, Malaysia under project number ST018-2023.

Author information

Authors and Affiliations

Department of Electrical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur, 50603, Malaysia
Hai Chuan Liu, Anis Salwa Mohd Khairuddin, Joon Huang Chuah & Li Ming Fang
Chongqing Key Laboratory of Public Big Data Security Technology, Chongqing, 401420, China
Hai Chuan Liu & Xian Min Zhao
Chongqing College of Mobile Communication, Chongqing, 401520, China
Hai Chuan Liu, Xian Min Zhao & Xiao Dan Wang
Faculty of Engineering and Information Technology, Southern University College, Skudai, Johor, 81300, Malaysia
Joon Huang Chuah
School of Integrated Circuits(Artificial Intelligence), Beijing Polytechnic, Beijing, 00176, China
Si Bo Kong

Authors

Hai Chuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Anis Salwa Mohd Khairuddin
View author publications
You can also search for this author in PubMed Google Scholar
Joon Huang Chuah
View author publications
You can also search for this author in PubMed Google Scholar
Xian Min Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Dan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Li Ming Fang
View author publications
You can also search for this author in PubMed Google Scholar
Si Bo Kong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Hai Chuan Liu proposed methods, conducted the experimentation, generated and analyzed the results, and was responsible for writing the paper. Anis Salwa Mohd Khairuddin reviewed the approach and the results to further improve the quality of the paper. Joon Huang Chuah supervised the writing to improve the paper. Xian Min Zhao demonstrated experiments and analysis again. Xiao Dan Wang analyzed the results and helped in the writing of the paper. Li Ming Fang checked the paper’s approach and data’s effectiveness.Si Bo Kong checked the paper’s approach and format’s effectiveness.

Corresponding author

Correspondence to Anis Salwa Mohd Khairuddin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

The research conducted in this study adhered to ethical guidelines and principles. The data used in this study are sourced from publicly available, open-access datasets. Data used in research comply with provider terms, and no identifiable information was used. This study did not require informed consent since the data was already de-identified and publicly accessible.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, H.C., Mohd Khairuddin, A.S., Chuah, J.H. et al. Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition. Appl Intell 55, 110 (2025). https://doi.org/10.1007/s10489-024-05994-x

Download citation

Accepted: 30 September 2024
Published: 09 December 2024
DOI: https://doi.org/10.1007/s10489-024-05994-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly supervised anomaly detection with multi-level contextual modeling

FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation

Enhancing public safety: a hybrid Conv_Trans-OptBiSVM approach for real-time abnormal behavior detection in crowded environments

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly supervised anomaly detection with multi-level contextual modeling

FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation

Enhancing public safety: a hybrid Conv_Trans-OptBiSVM approach for real-time abnormal behavior detection in crowded environments

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation