Abstract
Violence detection in videos has numerous applications, ranging from parental control and children protection to multimedia filtering and retrieval. A number of approaches have been proposed to detect vital clues for violent actions, among which most methods prefer employing trajectory based action recognition techniques. However, these methods can only model general characteristics of human actions, thus cannot well capture specific high order information of violent actions. Therefore, they are not suitable for detecting violence, which is typically intense and correlated with specific scenes. In this paper, we propose a novel framework, i.e., multi-stream deep convolutional neural networks, for person to person violence detection in videos. In addition to conventional spatial and temporal streams, we develop an acceleration stream to capture the important intense information usually involved in violent actions. Moreover, a simple and effective score-level fusion strategy is proposed to integrate multi-stream information. We demonstrate the effectiveness of our method on the typical violence dataset and extensive experimental results show its superiority over state-of-the-art methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Claire-Heilene, D.: VSD, a public dataset for the detection of violent scences in movies: design, annotation, ananlysis and evaluation. In: The Handbook of Brain Theory and Neural Networks, vol. 3361 (1995)
Dai, Q., Tu, J., Shi, Z., Jiang, Y.G., Xue, X.: Fudan at mediaeval 2013: violent scenes detection using motion features and part-level attributes. In: MediaEval (2013)
Dai, Q., Wu, Z., Jiang, Y.G., Xue, X., Tang, J.: Fudan-NJUST at mediaeval 2014: violent scenes detection using deep neural networks. In: MediaEval (2014)
Demarty, C.-H., Penet, C., Gravier, G., Soleymani, M.: A benchmarking campaign for the multimodal detection of violent scenes in movies. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7585, pp. 416–425. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33885-4_42
Ding, C., Fan, S., Zhu, M., Feng, W., Jia, B.: Violence detection in video by using 3D convolutional neural networks. In: Bebis, G., et al. (eds.) ISVC 2014. LNCS, vol. 8888, pp. 551–558. Springer, Heidelberg (2014). doi:10.1007/978-3-319-14364-4_53
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)
Hahn, M., Chen, S., Dehghan, A.: Deep tracking: visual tracking using deep convolutional networks (2015). arXiv preprint arXiv:1512.03993
Ionescu, B., Schlüter, J., Mironica, I., Schedl, M.: A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: Proceedings of the 3rd ACM International Conference on Multimedia Retrieval, pp. 215–222. ACM (2013)
Jiang, Y.G., Dai, Q., Tan, C.C., Xue, X., Ngo, C.W.: The Shanghai-Hongkong team at mediaeval 2012: violent scene detection using trajectory-based features. In: MediaEval (2012)
Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-based modeling of human actions with motion reference points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33715-4_31
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Martin, V., Glotin, H., Paris, S., Halkias, X., Prevot, J.M.: Violence detection in video by large scale multi-scale local binary pattern dynamics. In: MediaEval. Citeseer (2012)
Mohamed, A.r., Seide, F., Yu, D., Droppo, J., Stolcke, A., Zweig, G., Penn, G.: Deep bi-directional recurrent networks over spectral windows. In: ASRU (2015)
Na, S.H.: Deep learning for natural language processing and machine translation (2015)
Bermejo Nievas, E., Deniz Suarez, O., Bueno García, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23678-5_39
Raj, A., Maturana, D., Scherer, S.: Multi-scale convolutional architecture for semantic segmentation (2015)
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv:1212.0402
de Souza, F.D., Chávez, G.C., do Valle, E.A., Araujo, A.D.: Violence detection in video using spatio-temporal features. In: 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 224–230. IEEE (2010)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks (2014). arXiv preprint arXiv:1412.0767
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)
Weninger, F., Bergmann, J., Schuller, B.: Introducing current: the Munich open-source CUDA recurrent neural network toolkit. J. Mach. Learn. Res. 16(1), 547–551 (2015)
Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2577–2584 (2014)
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pp. 461–470. ACM (2015)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Acknowledgment
This work was supported by the Hong Kong, Macao and Taiwan Science Technology Cooperation Program of China (No. L2015TGA9004), and the National Natural Science Foundation of China (No. 61573045).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Dong, Z., Qin, J., Wang, Y. (2016). Multi-stream Deep Networks for Person to Person Violence Detection in Videos. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 662. Springer, Singapore. https://doi.org/10.1007/978-981-10-3002-4_43
Download citation
DOI: https://doi.org/10.1007/978-981-10-3002-4_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3001-7
Online ISBN: 978-981-10-3002-4
eBook Packages: Computer ScienceComputer Science (R0)