Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features

Mu, Guankun; Cao, Haibing; Jin, Qin

doi:10.1007/978-981-10-3005-5_37

Guankun Mu¹⁶,
Haibing Cao¹⁶ &
Qin Jin¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

Chinese Conference on Pattern Recognition

2584 Accesses
18 Citations

Abstract

Violent scene detection (VSD) in videos has practical significance in various applications, such as film rating and child protection against violent behavior. Most of previous VSD systems have mainly used visual cues in the video although acoustic or audio cues can also help to detect violent scenes especially when visual cues are not reliable. In this paper, we focus on exploring acoustic information for violent scene detection. Convolutional Neural Networks (CNNs) have achieved the state-of-the-art performance in visual content processing tasks. We therefore investigate using CNNs for violent scene detection based on acoustic information in videos. We apply CNNs in two ways: as a classifier directly or as a deep acoustic feature extractor. Experimental results on the MediaEval 2015 evaluation dataset show that CNNs are effective both as classifiers and as acoustic feature extractors. Furthermore, fusion of acoustic and visual information significantly improves violent scene detection performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jin, Q., Schulam, P.F., Rawat, S., Burger, S., Ding, D., Metze, F.: Event-based video retrieval using audio. In: Proceedings of INTERSPEECH, p. 2085 (2012)
Google Scholar
Snoek, C.G., Worring, M.: Concept-based video retrieval. Found. Trends Inf. Retrieval 2(4), 215–322 (2008)
Article Google Scholar
Chang, S.F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A.C., Luo, J.: Large-scale multimodal semantic concept detection for consumer video. In: Proceedings of the International Workshop on Multimedia Information Retrieval, pp. 255–264. ACM (2007)
Google Scholar
Demarty, C.H., Ionescu, B., Jiang, Y.G., Quang, V.L., Schedl, M., Penet, C.: Benchmarking violent scenes detection in movies. In: 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–6. IEEE (2014)
Google Scholar
Dai, Q., Zhao, R.W., Wu, Z., Wang, X., Gu, Z., Wu, W., Jiang, Y.G.: Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 461–470. ACM (2015)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Yi, Y., Wang, H., Zhang, B., Yu, J.: MIC-TJU in MediaEval 2015 Affective Impact of Movies Task (2015)
Google Scholar
Lam, V., Phan, S., Le, D.D., Satoh, S.I., Duong, D.A.: NII-UIT at MediaEval 2015 Affective Impact of Movies Task (2015)
Google Scholar
Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195(1), 215–243 (1968)
Article Google Scholar
Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis, p. 958. IEEE (2003)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)
Google Scholar
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Article Google Scholar
Jin, Q., Li, X., Cao, H., Huo, Y., Liao, S., Yang, G., Xu, J.: RUCMM at MediaEval 2015 Affective Impact of Movies Task: Fusion of Audio and Visual Cues (2015)
Google Scholar
Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)
Google Scholar
Sjberg, M., Baveye, Y., Wang, H., Quang, V.L., Ionescu, B., Dellandra, E., Chen, L.: The mediaeval 2015 affective impact of movies task. In: MediaEval 2015 Workshop (2015)
Google Scholar
Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18(1), 63–77 (2006)
Article Google Scholar
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MATH MathSciNet Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Li, T.L., Chan, A.B., Chun, A.: Automatic musical pattern feature extraction using convolutional neural network. In: Proceedings of the International Conference on Data Mining and Applications (2010)
Google Scholar
Ullrich, K., Schlter, J., Grill, T.: Boundary detection in music structure analysis using convolutional neural networks. In: ISMIR, pp. 417–422 (2014)
Google Scholar
Li, X., Snoek, C.G., Worring, M., Koelma, D., Smeulders, A.W.: Bootstrapping visual categorization with relevant negatives. IEEE Trans. Multimedia 15(4), 933–945 (2013)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Beijing Natural Science Foundation (No. 4142029), the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China (No. 14XNLQ01), and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.

Author information

Authors and Affiliations

Multimedia Computing Lab, School of Information, Renmin University of China, Beijing, China
Guankun Mu, Haibing Cao & Qin Jin

Authors

Guankun Mu
View author publications
You can also search for this author in PubMed Google Scholar
Haibing Cao
View author publications
You can also search for this author in PubMed Google Scholar
Qin Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China
Xuelong Li
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Xilin Chen
Tsinghua University , Beijing, China
Jie Zhou
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
University of Electronic Science and Technology, Chengdu, Sichuan, China
Hong Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mu, G., Cao, H., Jin, Q. (2016). Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_37

Download citation

DOI: https://doi.org/10.1007/978-981-10-3005-5_37
Published: 22 October 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics