Skip to main content

Advertisement

Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human abnormal behavior detection is important to ensure public safety and prevent unwanted incidents. Currently, recognition systems for human abnormal behavior adopt neural network models and perform standard 1-of-N majority voting procedures. However, recognizing human abnormal behaviors can be challenging due to lengthy and numerous video datasets and the limitations of existing methods that rely on predefined categories and scenarios. This study proposed a novel method named Visual Text Contrastive Learning (VTCL) for identifying abnormal human behavior in campus settings. The proposed model emphasizes semantic information from automatically labeled properties text and videos of abnormal behaviors, moving beyond simple numerical representations. The proposed method integrates the cross and multi-frame methods within the visual branch to improve spatial and temporal performance. In the textual branch, the proposed prompting technique captures the contextual backdrop of abnormal behaviors to enrich supervision with behavioral semantic information. Then, the model learns the visual-text features to enhance the learning process through contrastive learning techniques. In addition, this work also presented a new study to explore zero-shot campus abnormal behavior recognition (CABR). It lays the foundation for unlocking the implementation of highly available and robust CABR for multiple and even new scenarios. The proposed VTCL model demonstrated a Top-1 accuracy of 86.92% and a Top-5 accuracy of 98.14% on the CABR50 dataset, including fifty abnormal behaviors on campus, with competitive computational complexity. Furthermore, the zero-shot performance of the proposed model showed competitive outcomes when evaluated on additional datasets, including CABRZ6 and UCF-101.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Our data and code will be published and can be found at the following link: https://github.com/LiuHaiChuan0/2021-Deep-learning/tree/main/VTCL.

References

  1. Zhou J, Herencsar N (2023) Abnormal behavior determination model of multimedia classroom students based on multi-task deep learning. Mobile Netw Appl. https://doi.org/10.1007/s11036-023-02187-7

  2. Yun SS, Nguyen Q, Choi J (2019) Recognition of emergency situations using audio-visual perception sensor network for ambient assistive living. J Ambient Intell Humaniz Comput 10:41–55. https://doi.org/10.1007/s12652-017-0597-y

    Article  MATH  Google Scholar 

  3. Lu M, Li D, Xu F (2022) Recognition of students’ abnormal behaviors in english learning and analysis of psychological stress based on deep learning. Front Psychol 13. https://doi.org/10.3389/fpsyg.2022.1025304

  4. Mo J, Zhu R, Yuan H et al (2023) Student behavior recognition based on multitask learning. Multimed Tools Appl 82(12):19,091-19,108. https://doi.org/10.1007/s11042-022-14100-7

    Article  MATH  Google Scholar 

  5. Rashmi M, Ashwin TS, Guddeti RMR (2021) Surveillance video analysis for student action recognition and localization inside computer laboratories of a smart campus. Multimed Tools Appl 80(2):2907–2929. https://doi.org/10.1007/s11042-020-09741-5

    Article  MATH  Google Scholar 

  6. Xie Y, Zhang S, Liu Y (2021) Abnormal behavior recognition in classroom pose estimation of college students based on spatiotemporal representation learning. Trait Signal 38(1):89–95. https://doi.org/10.18280/ts.380109

    Article  MATH  Google Scholar 

  7. Banerjee S, Ashwin TS, Guddeti RMR (2020) Multimodal behavior analysis in computer-enabled laboratories using nonverbal cues. Signal Image Video Process 14(8):1617–1624. https://doi.org/10.1007/s11760-020-01705-4

    Article  MATH  Google Scholar 

  8. Chen G, Liu P, Liu Z et al (2021) Neuroaed: Towards efficient abnormal event detection in visual surveillance with neuromorphic vision sensor. IEEE Trans Inf Forensic Secur 16:923–936. https://doi.org/10.1109/TIFS.2020.3023791

    Article  MATH  Google Scholar 

  9. Sun B, Wu Y, Zhao K et al (2021) Student class behavior dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput Appl 33(14, SI):8335–8354. https://doi.org/10.1007/s00521-020-05587-y

    Article  MATH  Google Scholar 

  10. Liu HC, Chuah JH, Khairuddin ASM et al (2023) Campus abnormal behavior recognition with temporal segment transformers. IEEE Access 11:38,471-38,484. https://doi.org/10.1109/ACCESS.2023.3266440

    Article  Google Scholar 

  11. Ni B, Peng H, Chen M et al (2022). Expanding language-image pretrained models for general video recognition. In:17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19772-7_1

  12. Wang M, Xing J, Mei J et al (2023) Actionclip: Adapting language-image pretrained models for video action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3331841

  13. Luo H, Ji L, Zhong M et al (2022) Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508:293–304. https://doi.org/10.1016/j.neucom.2022.07.028

    Article  MATH  Google Scholar 

  14. Lin Z, Geng S, Zhang R et al (2022). Frozen clip models are efficient video learners. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_23

  15. Estevam V, Pedrini H, Menotti D (2021) Zero-shot action recognition in videos: A survey. Neurocomputing 439:159–175. https://doi.org/10.1016/j.neucom.2021.01.036

    Article  MATH  Google Scholar 

  16. Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24,142-24,156. https://doi.org/10.1007/s10489-023-04808-w

    Article  MATH  Google Scholar 

  17. Qi Q, Wang H, Su T et al (2022) Learning temporal information and object relation for zero-shot action recognition. Displays 73. https://doi.org/10.1016/j.displa.2022.102177

  18. Xia X, Dong G, Li F et al (2023) When clip meets cross-modal hashing retrieval: A new strong baseline. Inf Fusion 100. https://doi.org/10.1016/j.inffus.2023.101968

  19. Sun B, Kong D, Wang S et al (2022) Gan for vision, kg for relation: A two-stage network for zero-shot action recognition. Pattern Recognit 126. https://doi.org/10.1016/j.patcog.2022.108563

  20. Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), ELECTR NETWORK, JUL 18-24, 2021

  21. Su T, Wang H, Qi Q et al (2023) Transductive learning with prior knowledge for generalized zero-shot action recognition. IEEE T Circ Syst Vid 34(1):260–273. https://doi.org/10.1109/TCSVT.2023.3284977

    Article  MATH  Google Scholar 

  22. Tan Z, Wu Y, Liu Q et al (2024) Exploring the application of large-scale pre-trained models on adverse weather removal. IEEE T Image Process 33:1683–1698. https://doi.org/10.1109/TIP.2024.3368961

    Article  MATH  Google Scholar 

  23. Wu Z, Weng Z, Peng W et al (2024) Building an open-vocabulary video clip model with better architectures, optimization and data. IEEE Trans Pattern Anal Mach Intell 46(7):4747–4762. https://doi.org/10.1109/TPAMI.2024.3357503

    Article  MATH  Google Scholar 

  24. Chen D, Wu Z, Liu F et al (2023) Protoclip: Prototypical contrastive language image pretraining. IEEE T Neur Net Lear. https://doi.org/10.1109/TNNLS.2023.3335859

    Article  Google Scholar 

  25. Gao J, Hou Y, Guo Z et al (2023) Learning spatio-temporal semantics and cluster relation for zero-shot action recognition. IEEE T Circ Syst Vid 33(11):6519–6530. https://doi.org/10.1109/TCSVT.2023.3272627

    Article  MATH  Google Scholar 

  26. Qi C, Feng Z, Xing M et al (2023) Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans Multimedia 25:1940–1953. https://doi.org/10.1109/TMM.2023.3264847

    Article  MATH  Google Scholar 

  27. Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53(20):24,142-24,156. https://doi.org/10.1007/s10489-023-04808-w

    Article  MATH  Google Scholar 

  28. Xu B, Shu X, Zhang J et al (2023) Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2023.3247103

  29. Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348. https://doi.org/10.1007/s11263-022-01653-1

    Article  MATH  Google Scholar 

  30. Gao P, Geng S, Zhang R et al (2023) Clip-adapter: Better vision-language models with feature adapters. Int J Comput Vis. https://doi.org/10.1007/s11263-023-01891-x

  31. Zhang R, Zhang W, Fang R, et al (2022) Tip-adapter: Training-free adaption of clip for few-shot classification. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_29

  32. Wang L, Huang B, Zhao Z, et al (2023) Videomae v2: Scaling video masked autoencoders with dual masking. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, CANADA, JUN 17-24, 2023

  33. Ju C, Han T, Zheng K et al (2022). Prompting visual-language models for efficient video understanding. In: 17th European conference on computer vision (ECCV), Tel Aviv, ISRAEL, OCT 23-27, 2022. https://doi.org/10.1007/978-3-031-19833-5_7

  34. Qi C, Feng Z, Xing M et al (2023) Energy-based temporal summarized attentive network for zero-shot action recognition. IEEE Trans Multimed 25:1940–1953. https://doi.org/10.1109/TMM.2023.3264847

    Article  MATH  Google Scholar 

  35. Sener F, Saraf R, Yao A (2023) Transferring knowledge from text to video: Zero-shot anticipation for procedural actions. IEEE Trans Pattern Anal Mach Intell 45(6):7836–7852. https://doi.org/10.1109/TPAMI.2022.3218596

    Article  MATH  Google Scholar 

  36. Zara G, Roy S, Rota P, et al (2023) Autolabel: Clip-based framework for open-set video domain adaptation. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, CANADA, JUN 17-24, 2023. https://doi.org/10.1109/CVPR52729.2023.01107

  37. Guzhov A, Raue F, Hees J, et al (2022) Audioclip: Extending clip to image, text and audio. In: 47th IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, SINGAPORE, MAY 22-27, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747631

  38. Deng L, Deng F, Zhou K et al (2024) Multi-level attention network: Mixed time-frequency channel attention and multi-scale self-attentive standard deviation pooling for speaker recognition. Eng Appl Artif Intell 128. https://doi.org/10.1016/j.engappai.2023.107439

  39. Gu X, Lin TY, Kuo W, et al (2022) Open-vocabulary object detection via vision and language knowledge distillation. In: 10th International conference on learning representations (ICLR), April 25-29 , 2022

  40. Wang L, Xiong Y, Wang Z, et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: 14th European conference on computer vision (ECCV), Amsterdam, NETHERLANDS, OCT 08-16, 2016

  41. Wei D, Tian Y, Wei L, et al (2022) Efficient dual attention slowfast networks for video action recognition 222. https://doi.org/10.1016/j.cviu.2022.103484

  42. Liu Z, Ning J, Cao Y, et al (2022) Video swin transformer. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), New Orleans, LA, JUN 18-24, 2022. https://doi.org/10.1109/CVPR52688.2022.00320

Download references

Acknowledgements

The research is supported by Universiti Malaya, Malaysia under project number ST018-2023.

Author information

Authors and Affiliations

Authors

Contributions

Hai Chuan Liu proposed methods, conducted the experimentation, generated and analyzed the results, and was responsible for writing the paper. Anis Salwa Mohd Khairuddin reviewed the approach and the results to further improve the quality of the paper. Joon Huang Chuah supervised the writing to improve the paper. Xian Min Zhao demonstrated experiments and analysis again. Xiao Dan Wang analyzed the results and helped in the writing of the paper. Li Ming Fang checked the paper’s approach and data’s effectiveness.Si Bo Kong checked the paper’s approach and format’s effectiveness.

Corresponding author

Correspondence to Anis Salwa Mohd Khairuddin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

The research conducted in this study adhered to ethical guidelines and principles. The data used in this study are sourced from publicly available, open-access datasets. Data used in research comply with provider terms, and no identifiable information was used. This study did not require informed consent since the data was already de-identified and publicly accessible.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, H.C., Mohd Khairuddin, A.S., Chuah, J.H. et al. Novel multimodal contrast learning framework using zero-shot prediction for abnormal behavior recognition. Appl Intell 55, 110 (2025). https://doi.org/10.1007/s10489-024-05994-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-05994-x

Keywords