Abstract
Violent video recognition is a challenging task in the field of computer vision and multimodal methods have always been an important part of it. Due to containing sensitive content, it is not easy to collect violent videos and resulting in a lack of big public datasets. Existing methods of learning violent video representations are limited by small datasets and lack efficient multimodal fusion models. According to the situation, firstly, we propose to effectively transfer information from large datasets to small violent datasets based on mutual distillation with the self-supervised pretrained model for the vital RGB feature. Secondly, the multimodal attention fusion network (MAF-Net) is proposed to fuse the obtained RGB feature with flow and audio feature to recognize violent videos with multi-modal information. Thirdly, we build a new violent dataset, named Violent Clip Dataset (VCD), which is on a large scale and contains complete audio information. We performed experiments on the public VSD dataset and the self-built VCD dataset. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods on both datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mohammadi, H., Nazerfard, E.: SSHA: Video Violence Recognition and Localization using a Semi-Supervised Hard Attention Model (2022)
Ding, C., Fan, S., Ming, Z., et al.: Violence detection in video by using 3D convolutional neural networks. In: International Symposium on Visual Computing. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14364-4_53
Samuel, D., Fnil, E., Manogaran, G., et al.: Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM. Computer Networks 151(MAR.14), 191–200 (2019)
Hanson, A., Pnvr, K., Krishnagopal, S., et al.: Bidirectional Convolutional LSTM for the Detection of Violence in Videos. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11012-3_24
Abdali, A.R.: Data Efficient Video Transformer for Violence Detection. In: 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT). IEEE (2021)
Nievas, E.B., Suarez, O.D., Gloria Bueno GarcÃa, et al.: Violence detection in video using computer vision techniques. In: International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23678-5_39
Elesawy, M., Hussein, M., Mina, A.E.M.: https://www.kaggle.com/mohamedmustafa/real-life-violence-situations-dataset
Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, pp. 4183–4190 (2021)
Zhang, Y., Xiang, T., Hospedales, T.M., et al.: Deep Mutual Learning (2017)
Islam, Z., Rukonuzzaman, M., Ahmed, R., et al.: Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM (2021)
Xu, Q., See, J., Lin, W.: Localization guided fight action detection in surveillance videos. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, pp. 568–573 (2019)
Dai, Q., Zhao, R.W., Wu, Z., et al.: Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning (2015)
Peixoto, B., Lavi, B., Martin, J.P.P., et al.: Toward subjective violence detection in videos. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8276–8280 (2019)
Pang, W.F., He, Q.H., Hu, Y., et al.: Violence detection in videos based on fusing visual and audio information. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2260–2264 (2021)
Perez, M., Kot, A.C., Rocha, A.: Detection of real-world fights in surveillance videos. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE (2019)
Hassner, T., Itcher, Y., Kliper-Gross, O.: Violent flows: real-time detection of violent crowd behavior. In: Computer Vision & Pattern Recognition Workshops. IEEE (2012)
Demarty, C.H., et al.: VSD: A public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Multimedia Tools Appl. 74(17), 7379–7404 (2015)
Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6488 (2018)
Wu, P., Liu, J., Shi, Y., et al.: Not only look, but also listen: Learning multimodal violence detection under weak supervision. In: European Conference on Computer Vision. Springer, Cham, pp. 322–339 (2020). https://doi.org/10.1007/978-3-030-58577-8_20
Halder, R., Chatterjee, R.: CNN-BiLSTM model for violence detection in smart surveillance. SN Computer Sci. 1(4), 1–9 (2020)
Sargana, A.B.: Fast learning through deep multi-net CNN model for violence recognition in video surveillance. The Computer Journal (2020)
Song, W., Zhang, D., Zhao, X., et al.: A novel violent video detection scheme based on modified 3d convolutional neural networks. IEEE Access, pp. 39172–39179 (2019)
Xu, X., Wu, X., Wang, G., et al.: Violent video classification based on spatial-temporal cues using deep learning. In: 2018 11th International Symposium on Computational Intelligence and Design (ISCID) (2018)
Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans. Pattern Analysis Machine Intell. 43(11), 4037–4058 (2020)
Alwassel, H., Mahajan, D., Torresani, L., et al.: Self-Supervised Learning by Cross-Modal Audio-Video Clustering (2019)
Morgado, P., Vasconcelos, N., Misra, I.: Audio-Visual Instance Discrimination with Cross-Modal Agreement (2020)
Sarkar, P., Etemad, A.: Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity (2021)
Alayrac, J.B., Recasens, A., Schneider, R., et al.: Self-Supervised MultiModal Versatile Networks (2020)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. Computer Science 14(7), 38–39 (2015)
Kay, W., Carreira, J., Simonyan, K., et al.: The Kinetics Human Action Video Dataset (2017)
Gemmeke, J.F., Ellis, D., Freedman, D., et al.: AudioSet: an ontology and human-labeled dataset for audio events. In: IEEE International Conference on Acoustics. IEEE (2017)
Lin, J., Gan, C., Han, S.: TSM: Temporal Shift Module for Efficient Video Understanding (2018)
Kong, Q., Cao, Y., Iqbal, T., et al.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 28, 2880–2894 (2020)
Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)
Du, T., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Miech, A., Zhukov, D., Alayrac, J.B., et al.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Liu, H.: Violence recognition based on auditory-visual fusion of autoencoder mapping. Electronics 10(21), 2654 (2021)
Zheng, Z., Zhong, W., Ye, L., et al.: Violent scene detection of film videos based on multi-task learning of temporal-spatial features. In: 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, pp. 360–365 (2021)
Gu, C., Wu, X., Wang, S.: Violent video detection based on semantic correspondence. IEEE Access 8, 85958–85967 (2020)
Wu, X., Gu, C., Wang, S.: Multi-modal feature fusion and multi-task learning for special video classification. Opt. Precis. Eng. 28(5), 10 (2020)
Gu, C.: Research on Violent Video Recognition based on Multi-Modal Feature and Multi-Task Learning, pp. 1–53. Library of Communication University of China, Beijing (2021)
Liu, R., Wu, X.: Multimodal attention network for violence detection. In: 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 503–506 (2022). https://doi.org/10.1109/ICCECE54139.2022.9712676
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shang, Y., Wu, X., Liu, R. (2022). Multimodal Violent Video Recognition Based on Mutual Distillation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_48
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)