Multimodal Violent Video Recognition Based on Mutual Distillation

Shang, Yimeng; Wu, Xiaoyu; Liu, Rui

doi:10.1007/978-3-031-18913-5_48

Yimeng Shang¹⁵,
Xiaoyu Wu¹⁵ &
Rui Liu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1700 Accesses
3 Citations

Abstract

Violent video recognition is a challenging task in the field of computer vision and multimodal methods have always been an important part of it. Due to containing sensitive content, it is not easy to collect violent videos and resulting in a lack of big public datasets. Existing methods of learning violent video representations are limited by small datasets and lack efficient multimodal fusion models. According to the situation, firstly, we propose to effectively transfer information from large datasets to small violent datasets based on mutual distillation with the self-supervised pretrained model for the vital RGB feature. Secondly, the multimodal attention fusion network (MAF-Net) is proposed to fuse the obtained RGB feature with flow and audio feature to recognize violent videos with multi-modal information. Thirdly, we build a new violent dataset, named Violent Clip Dataset (VCD), which is on a large scale and contains complete audio information. We performed experiments on the public VSD dataset and the self-built VCD dataset. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods on both datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mohammadi, H., Nazerfard, E.: SSHA: Video Violence Recognition and Localization using a Semi-Supervised Hard Attention Model (2022)
Google Scholar
Ding, C., Fan, S., Ming, Z., et al.: Violence detection in video by using 3D convolutional neural networks. In: International Symposium on Visual Computing. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14364-4_53
Samuel, D., Fnil, E., Manogaran, G., et al.: Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM. Computer Networks 151(MAR.14), 191–200 (2019)
Google Scholar
Hanson, A., Pnvr, K., Krishnagopal, S., et al.: Bidirectional Convolutional LSTM for the Detection of Violence in Videos. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11012-3_24
Abdali, A.R.: Data Efficient Video Transformer for Violence Detection. In: 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT). IEEE (2021)
Google Scholar
Nievas, E.B., Suarez, O.D., Gloria Bueno García, et al.: Violence detection in video using computer vision techniques. In: International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23678-5_39
Elesawy, M., Hussein, M., Mina, A.E.M.: https://www.kaggle.com/mohamedmustafa/real-life-violence-situations-dataset
Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp. 4183–4190 (2021)
Google Scholar
Zhang, Y., Xiang, T., Hospedales, T.M., et al.: Deep Mutual Learning (2017)
Google Scholar
Islam, Z., Rukonuzzaman, M., Ahmed, R., et al.: Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM (2021)
Google Scholar
Xu, Q., See, J., Lin, W.: Localization guided fight action detection in surveillance videos. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, pp. 568–573 (2019)
Google Scholar
Dai, Q., Zhao, R.W., Wu, Z., et al.: Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning (2015)
Google Scholar
Peixoto, B., Lavi, B., Martin, J.P.P., et al.: Toward subjective violence detection in videos. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8276–8280 (2019)
Google Scholar
Pang, W.F., He, Q.H., Hu, Y., et al.: Violence detection in videos based on fusing visual and audio information. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2260–2264 (2021)
Google Scholar
Perez, M., Kot, A.C., Rocha, A.: Detection of real-world fights in surveillance videos. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE (2019)
Google Scholar
Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: Computer Vision & Pattern Recognition Workshops. IEEE (2012)
Google Scholar
Demarty, C.H., et al.: VSD: A public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Multimedia Tools Appl. 74(17), 7379–7404 (2015)
Google Scholar
Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6488 (2018)
Google Scholar
Wu, P., Liu, J., Shi, Y., et al.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European Conference on Computer Vision. Springer, Cham, pp. 322–339 (2020). https://doi.org/10.1007/978-3-030-58577-8_20
Halder, R., Chatterjee, R.: CNN-BiLSTM model for violence detection in smart surveillance. SN Computer Sci. 1(4), 1–9 (2020)
Google Scholar
Sargana, A.B.: Fast learning through deep multi-net CNN model for violence recognition in video surveillance. The Computer Journal (2020)
Google Scholar
Song, W., Zhang, D., Zhao, X., et al.: A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access, pp. 39172–39179 (2019)
Google Scholar
Xu, X., Wu, X., Wang, G., et al.: Violent video classification based on spatial-temporal cues using deep learning. In: 2018 11th International Symposium on Computational Intelligence and Design (ISCID) (2018)
Google Scholar
Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans. Pattern Analysis Machine Intell. 43(11), 4037–4058 (2020)
Google Scholar
Alwassel, H., Mahajan, D., Torresani, L., et al.: Self-Supervised Learning by Cross-Modal Audio-Video Clustering (2019)
Google Scholar
Morgado, P., Vasconcelos, N., Misra, I.: Audio-Visual Instance Discrimination with Cross-Modal Agreement (2020)
Google Scholar
Sarkar, P., Etemad, A.: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity (2021)
Google Scholar
Alayrac, J.B., Recasens, A., Schneider, R., et al.: Self-Supervised MultiModal Versatile Networks (2020)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Computer Science 14(7), 38–39 (2015)
Google Scholar
Kay, W., Carreira, J., Simonyan, K., et al.: The Kinetics Human Action Video Dataset (2017)
Google Scholar
Gemmeke, J.F., Ellis, D., Freedman, D., et al.: AudioSet: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics. IEEE (2017)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: Temporal Shift Module for Efficient Video Understanding (2018)
Google Scholar
Kong, Q., Cao, Y., Iqbal, T., et al.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 28, 2880–2894 (2020)
Article Google Scholar
Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)
Google Scholar
Du, T., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Google Scholar
Miech, A., Zhukov, D., Alayrac, J.B., et al.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Liu, H.: Violence recognition based on auditory-visual fusion of autoencoder mapping. Electronics 10(21), 2654 (2021)
Google Scholar
Zheng, Z., Zhong, W., Ye, L., et al.: Violent scene detection of film videos based on multi-task learning of temporal-spatial features. In: 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, pp. 360–365 (2021)
Google Scholar
Gu, C., Wu, X., Wang, S.: Violent video detection based on semantic correspondence. IEEE Access 8, 85958–85967 (2020)
Article Google Scholar
Wu, X., Gu, C., Wang, S.: Multi-modal feature fusion and multi-task learning for special video classification. Opt. Precis. Eng. 28(5), 10 (2020)
Google Scholar
Gu, C.: Research on Violent Video Recognition based on Multi-Modal Feature and Multi-Task Learning, pp. 1–53. Library of Communication University of China, Beijing (2021)
Google Scholar
Liu, R., Wu, X.: Multimodal attention network for violence detection. In: 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 503–506 (2022). https://doi.org/10.1109/ICCECE54139.2022.9712676
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Communication University of China, 1 Dingfuzhuang East Street, Chaoyang District, Beijing, China
Yimeng Shang, Xiaoyu Wu & Rui Liu

Authors

Yimeng Shang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Rui Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yimeng Shang .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shang, Y., Wu, X., Liu, R. (2022). Multimodal Violent Video Recognition Based on Mutual Distillation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_48

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_48
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics