Abstract
Violent scene detection (VSD) in videos has practical significance in various applications, such as film rating and child protection against violent behavior. Most of previous VSD systems have mainly used visual cues in the video although acoustic or audio cues can also help to detect violent scenes especially when visual cues are not reliable. In this paper, we focus on exploring acoustic information for violent scene detection. Convolutional Neural Networks (CNNs) have achieved the state-of-the-art performance in visual content processing tasks. We therefore investigate using CNNs for violent scene detection based on acoustic information in videos. We apply CNNs in two ways: as a classifier directly or as a deep acoustic feature extractor. Experimental results on the MediaEval 2015 evaluation dataset show that CNNs are effective both as classifiers and as acoustic feature extractors. Furthermore, fusion of acoustic and visual information significantly improves violent scene detection performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jin, Q., Schulam, P.F., Rawat, S., Burger, S., Ding, D., Metze, F.: Event-based video retrieval using audio. In: Proceedings of INTERSPEECH, p. 2085 (2012)
Snoek, C.G., Worring, M.: Concept-based video retrieval. Found. Trends Inf. Retrieval 2(4), 215–322 (2008)
Chang, S.F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A.C., Luo, J.: Large-scale multimodal semantic concept detection for consumer video. In: Proceedings of the International Workshop on Multimedia Information Retrieval, pp. 255–264. ACM (2007)
Demarty, C.H., Ionescu, B., Jiang, Y.G., Quang, V.L., Schedl, M., Penet, C.: Benchmarking violent scenes detection in movies. In: 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), pp. 1–6. IEEE (2014)
Dai, Q., Zhao, R.W., Wu, Z., Wang, X., Gu, Z., Wu, W., Jiang, Y.G.: Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 461–470. ACM (2015)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Yi, Y., Wang, H., Zhang, B., Yu, J.: MIC-TJU in MediaEval 2015 Affective Impact of Movies Task (2015)
Lam, V., Phan, S., Le, D.D., Satoh, S.I., Duong, D.A.: NII-UIT at MediaEval 2015 Affective Impact of Movies Task (2015)
Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195(1), 215–243 (1968)
Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis, p. 958. IEEE (2003)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(10), 1533–1545 (2014)
Jin, Q., Li, X., Cao, H., Huo, Y., Liao, S., Yang, G., Xu, J.: RUCMM at MediaEval 2015 Affective Impact of Movies Task: Fusion of Audio and Visual Cues (2015)
Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)
Sjberg, M., Baveye, Y., Wang, H., Quang, V.L., Ionescu, B., Dellandra, E., Chen, L.: The mediaeval 2015 affective impact of movies task. In: MediaEval 2015 Workshop (2015)
Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18(1), 63–77 (2006)
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Li, T.L., Chan, A.B., Chun, A.: Automatic musical pattern feature extraction using convolutional neural network. In: Proceedings of the International Conference on Data Mining and Applications (2010)
Ullrich, K., Schlter, J., Grill, T.: Boundary detection in music structure analysis using convolutional neural networks. In: ISMIR, pp. 417–422 (2014)
Li, X., Snoek, C.G., Worring, M., Koelma, D., Smeulders, A.W.: Bootstrapping visual categorization with relevant negatives. IEEE Trans. Multimedia 15(4), 933–945 (2013)
Acknowledgements
This work was supported by the Beijing Natural Science Foundation (No. 4142029), the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China (No. 14XNLQ01), and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Mu, G., Cao, H., Jin, Q. (2016). Violent Scene Detection Using Convolutional Neural Networks and Deep Audio Features. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_37
Download citation
DOI: https://doi.org/10.1007/978-981-10-3005-5_37
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)