Skip to main content

Multimodal Violent Video Recognition Based on Mutual Distillation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Abstract

Violent video recognition is a challenging task in the field of computer vision and multimodal methods have always been an important part of it. Due to containing sensitive content, it is not easy to collect violent videos and resulting in a lack of big public datasets. Existing methods of learning violent video representations are limited by small datasets and lack efficient multimodal fusion models. According to the situation, firstly, we propose to effectively transfer information from large datasets to small violent datasets based on mutual distillation with the self-supervised pretrained model for the vital RGB feature. Secondly, the multimodal attention fusion network (MAF-Net) is proposed to fuse the obtained RGB feature with flow and audio feature to recognize violent videos with multi-modal information. Thirdly, we build a new violent dataset, named Violent Clip Dataset (VCD), which is on a large scale and contains complete audio information. We performed experiments on the public VSD dataset and the self-built VCD dataset. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods on both datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mohammadi, H., Nazerfard, E.: SSHA: Video Violence Recognition and Localization using a Semi-Supervised Hard Attention Model (2022)

    Google Scholar 

  2. Ding, C., Fan, S., Ming, Z., et al.: Violence detection in video by using 3D convolutional neural networks. In: International Symposium on Visual Computing. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14364-4_53

  3. Samuel, D., Fnil, E., Manogaran, G., et al.: Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM. Computer Networks 151(MAR.14), 191–200 (2019)

    Google Scholar 

  4. Hanson, A., Pnvr, K., Krishnagopal, S., et al.: Bidirectional Convolutional LSTM for the Detection of Violence in Videos. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11012-3_24

  5. Abdali, A.R.: Data Efficient Video Transformer for Violence Detection. In: 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT). IEEE (2021)

    Google Scholar 

  6. Nievas, E.B., Suarez, O.D., Gloria Bueno García, et al.: Violence detection in video using computer vision techniques. In: International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23678-5_39

  7. Elesawy, M., Hussein, M., Mina, A.E.M.: https://www.kaggle.com/mohamedmustafa/real-life-violence-situations-dataset

  8. Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp. 4183–4190 (2021)

    Google Scholar 

  9. Zhang, Y., Xiang, T., Hospedales, T.M., et al.: Deep Mutual Learning (2017)

    Google Scholar 

  10. Islam, Z., Rukonuzzaman, M., Ahmed, R., et al.: Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM (2021)

    Google Scholar 

  11. Xu, Q., See, J., Lin, W.: Localization guided fight action detection in surveillance videos. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, pp. 568–573 (2019)

    Google Scholar 

  12. Dai, Q., Zhao, R.W., Wu, Z., et al.: Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning (2015)

    Google Scholar 

  13. Peixoto, B., Lavi, B., Martin, J.P.P., et al.: Toward subjective violence detection in videos. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8276–8280 (2019)

    Google Scholar 

  14. Pang, W.F., He, Q.H., Hu, Y., et al.: Violence detection in videos based on fusing visual and audio information. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2260–2264 (2021)

    Google Scholar 

  15. Perez, M., Kot, A.C., Rocha, A.: Detection of real-world fights in surveillance videos. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE (2019)

    Google Scholar 

  16. Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: Computer Vision & Pattern Recognition Workshops. IEEE (2012)

    Google Scholar 

  17. Demarty, C.H., et al.: VSD: A public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Multimedia Tools Appl. 74(17), 7379–7404 (2015)

    Google Scholar 

  18. Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6488 (2018)

    Google Scholar 

  19. Wu, P., Liu, J., Shi, Y., et al.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European Conference on Computer Vision. Springer, Cham, pp. 322–339 (2020). https://doi.org/10.1007/978-3-030-58577-8_20

  20. Halder, R., Chatterjee, R.: CNN-BiLSTM model for violence detection in smart surveillance. SN Computer Sci. 1(4), 1–9 (2020)

    Google Scholar 

  21. Sargana, A.B.: Fast learning through deep multi-net CNN model for violence recognition in video surveillance. The Computer Journal (2020)

    Google Scholar 

  22. Song, W., Zhang, D., Zhao, X., et al.: A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access, pp. 39172–39179 (2019)

    Google Scholar 

  23. Xu, X., Wu, X., Wang, G., et al.: Violent video classification based on spatial-temporal cues using deep learning. In: 2018 11th International Symposium on Computational Intelligence and Design (ISCID) (2018)

    Google Scholar 

  24. Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans. Pattern Analysis Machine Intell. 43(11), 4037–4058 (2020)

    Google Scholar 

  25. Alwassel, H., Mahajan, D., Torresani, L., et al.: Self-Supervised Learning by Cross-Modal Audio-Video Clustering (2019)

    Google Scholar 

  26. Morgado, P., Vasconcelos, N., Misra, I.: Audio-Visual Instance Discrimination with Cross-Modal Agreement (2020)

    Google Scholar 

  27. Sarkar, P., Etemad, A.: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity (2021)

    Google Scholar 

  28. Alayrac, J.B., Recasens, A., Schneider, R., et al.: Self-Supervised MultiModal Versatile Networks (2020)

    Google Scholar 

  29. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Computer Science 14(7), 38–39 (2015)

    Google Scholar 

  30. Kay, W., Carreira, J., Simonyan, K., et al.: The Kinetics Human Action Video Dataset (2017)

    Google Scholar 

  31. Gemmeke, J.F., Ellis, D., Freedman, D., et al.: AudioSet: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics. IEEE (2017)

    Google Scholar 

  32. Lin, J., Gan, C., Han, S.: TSM: Temporal Shift Module for Efficient Video Understanding (2018)

    Google Scholar 

  33. Kong, Q., Cao, Y., Iqbal, T., et al.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 28, 2880–2894 (2020)

    Article  Google Scholar 

  34. Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)

    Google Scholar 

  35. Du, T., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)

    Google Scholar 

  36. Miech, A., Zhukov, D., Alayrac, J.B., et al.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  37. Liu, H.: Violence recognition based on auditory-visual fusion of autoencoder mapping. Electronics 10(21), 2654 (2021)

    Google Scholar 

  38. Zheng, Z., Zhong, W., Ye, L., et al.: Violent scene detection of film videos based on multi-task learning of temporal-spatial features. In: 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, pp. 360–365 (2021)

    Google Scholar 

  39. Gu, C., Wu, X., Wang, S.: Violent video detection based on semantic correspondence. IEEE Access 8, 85958–85967 (2020)

    Article  Google Scholar 

  40. Wu, X., Gu, C., Wang, S.: Multi-modal feature fusion and multi-task learning for special video classification. Opt. Precis. Eng. 28(5), 10 (2020)

    Google Scholar 

  41. Gu, C.: Research on Violent Video Recognition based on Multi-Modal Feature and Multi-Task Learning, pp. 1–53. Library of Communication University of China, Beijing (2021)

    Google Scholar 

  42. Liu, R., Wu, X.: Multimodal attention network for violence detection. In: 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 503–506 (2022). https://doi.org/10.1109/ICCECE54139.2022.9712676

  43. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yimeng Shang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shang, Y., Wu, X., Liu, R. (2022). Multimodal Violent Video Recognition Based on Mutual Distillation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18913-5_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18912-8

  • Online ISBN: 978-3-031-18913-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics