Abstract
This chapter introduces a benchmark evaluation targeting the detection of violent scenes in Hollywood movies. The evaluation was implemented in 2011 and 2012 as an affect task in the framework of the international MediaEval benchmark initiative. We report on these 2 years of evaluation, providing a detailed description of the dataset created, describing the state of the art by studying the results achieved by participants and providing a detailed analysis of two of the best performing multimodal systems. We elaborate on the lessons learned after 2 years to provide insights on future work emphasizing multimodal modeling and fusion.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
The development data is intended for designing and training the approaches.
- 5.
The test set data is intended for the official benckmarking.
- 6.
The Yaafe toolkit for audio feature extraction was used.
References
Acar E, Albayrak S (2012) Dai lab at mediaeval 2012 affect task: the detection of violent scenes using affective features. In: MediaEval 2012, multimedia benchmark workshop
Acar E, Spiegel S, Albayrak S (2011) Mediaeval 2011 affect task: Violent scene detection combining audio and visual features with svm. In: MediaEval 2011, multimedia benchmark workshop
Baveye Y, Urban F, Chamaret C, Demoulin V, Hellier P (2013) Saliency-guided consistent color harmonization. Computational color imaging, Lecture notes in computer science, vol 7786. Springer, Berlin, pp 105–118
Chen LH, Hsu HW, Wang LY, Su CW (2011) Violence detection in movies. In: 8th IEEE international conference on computer graphics, imaging and visualization (CGIV 2011), pp 119–124
Chen LH, Su CW, Weng CF, Liao HYM (2009) Action Scene Detection With Support Vector Machines. J Multimedia 4:248–253. doi:10.4304/jmm.4.4.248-253
Cheng WH, Chu WT, Wu JL (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the 5th ACM SIGMM international workshop on multimedia information retrieval, pp 109–115
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347. http://dx.doi.org/10.1007/BF00994110
Datta A, Shah M, Da Vitoria Lobo N (2002) Person-on-person violence detection in video data. In: Proceedings of 16th IEEE international conference on pattern recognition, vol 1. pp 433–438
Demarty CH, Penet C, Gravier G, Soleymani M (2012) A benchmarking campaign for the multimodal detection of violent scenes in movies. In: Computer Vision-ECCV 2012. Workshops and demonstrations, Springer, pp 416–425
Derbas N, Thollard F, Safadi B, Quénot G (2012) Lig at mediaeval 2012 affect task: use of a generic method. In: MediaEval 2012, multimedia benchmark workshop
de Souza FDM, Chávez GC, do Valle E, de A Araujo A (2010) Violence detection in video using spatio-temporal features. In: 23rd IEEE conference on graphics, patterns and images (SIBGRAPI 2010), pp 224–230
de Weijer JV, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world applications. IEEE Trans Image Process 18(7):1512–1523
Eyben F, Weninger F, Lehment N, Rigoll G, Schuller B (2012) Violent scenes detection with large, brute-forced acoustic and visual feature sets. In: MediaEval 2012, multimedia benchmark workshop
Giannakopoulos T, Makris A, Kosmopoulos D, Perantonis S, Theodoridis S (2010) Audio-visual fusion for detecting violent scenes in videos. In: Konstantopoulos S et al (eds) Artificial intelligence: theories, models and applications, Lecture notes in computer scienc, vol 6040. Springer, pp 91–100
Glotin H, Razik J, Paris S, Prevot JM (2011) Real-time entropic unsupervised violent scenes detection in hollywood movies - dyni @ mediaeval affect task 2011. In: MediaEval 2011, multimedia benchmark workshop
Gninkoun G, Soleymani M (2011) Automatic violence scenes detection: a multi-modal approach. In: MediaEval 2011, multimedia benchmark workshop
Gong Y, Wang W, Jiang S, Huang Q, Gao W (2008) Detecting violent scenes in movies by auditory and visual cues. In: Huang YM et al (eds) Advances in multimedia information processing - (PCM 2008), Lecture notes in computer science, vol 5353. Springer, pp 317–326
Gravier G, Demarty CH, Baghdadi S, Gros P (2012) Classification-oriented structure learning in bayesian networks for multimodal event detection in videos. Multimedia tools and applications, pp 1–17. doi: 10.1007/s11042-012-1169-y, http://dx.doi.org/10.1007/s11042-012-1169-y
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. http://arxiv.org/abs/1207.0580
Ionescu B, Buzuloiu V, Lambert P, Coquin D (2006) Improved cut detection for the segmentation of animation movies. In: IEEE international conference on acoustics, speech, and signal processing
Ionescu B, Schlüter J, Mironică I, Schedl M (2013) A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: Proceedings of the 3rd ACM international conference on multimedia retrieval, pp 215–222
Jiang YG, Dai Q, Tan CC, Xue X, Ngo CW (2012) The shanghai-hongkong team at mediaeval2012: Violent scene detection using trajectory-based features. In: MediaEval 2012, multimedia benchmark workshop
Kriegel B (2003) La violence à la télévision. rapport de la mission d’évaluation, d’analyse et de propositions relative aux représentations violentes à la télévision. Technical report, Ministère de la Culture et de la Communication, Paris
Krug EG, Mercy JA, Dahlberg LL, Zwi AB (2002) The world report on violence and health. The Lancet 360(9339):1083–1088 (2002). doi: 10.1016/S0140-6736(02)11133-0. http://www.sciencedirect.com/science/article/pii/S0140673602111330
Lam V, Le DD, Le SP, Satoh S, Duong DA (2012) Nii, Japan at mediaeval 2012 violent scenes detection affect task. In: MediaEval 2011, multimedia benchmark workshop
Lam V, Le DD, Satoh S, Duong, DA (2011) Nii, Japan at mediaeval 2011 violent scenes detection task. In: MediaEval 2011, multimedia benchmark workshop
Lin J, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. In: Advances in multimedia information processing-PCM 2009, Springer, pp 930–935
Lucas P (2002) Restricted Bayesian network structure learning. In: Advances in Bayesian networks, studies in fuzziness and soft computing, pp 217–232
Ludwig O, Delgado D, Goncalves V, Nunes U (2009) Trainable classifier-fusion schemes: An application to pedestrian detection. In: IEEE internation conference on intelligent transportation systems, pp 432–437
Martin V, Glotin H, Paris S, Halkias X, Prevot JM (2012) Violence detection in video by large scale multi-scale local binary pattern dynamics. In: MediaEval 2012, multimedia benchmark workshop
Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Proceedings of IEEE international conference on image processing (ICIP-98), vol 1. pp 353–357
Nievas EB, Suarez OD, García GB, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Computer analysis of images and patterns, Springer, pp 332–339
Penet C, Demarty CH, Gravier G, Gros P (2011) Technicolor and inria/irisa at mediaeval 2011: learning temporal modality integration with bayesian networks. In: MediaEval 2011, Multimedia Benchmark Workshop, CEUR Workshop Proceedings, vol 807. http://CEUR-WS.org
Penet C, Demarty CH, Gravier G, Gros P (2013) Audio event detection in movies using multiple audio words and contextual Bayesian networks. In: Workshop on content-based multimedia indexing
Penet C, Demarty CH, Soleymani M, Gravier G, Gros P (2012) Technicolor/inria/imperial college london at the mediaeval 2012 violent scene detection task. In: MediaEval 2012, multimedia benchmark workshop
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Safadi B, Quéenot G (2011) Lig at mediaeval 2011 affect task: use of a generic method. In: MediaEval 2011, multimedia benchmark, workshop
Schlüter J, Ionescu B, Mironică I, Schedl M (2012) Arf @ mediaeval 2012: an uninformed approach to violence detection in hollywood movies. In: MediaEval 2012, multimedia benchmark, workshop
Violence (1996) A public health priority. Technical Report, World Health Organization, Geneva, WHO/EHA/SPI.POA.2
Zajdel W, Krijnders JD, Andringa T, Gavrila DM (2007) Cassandra: audio-video sensor fusion for aggression detection. In: IEEE conference on advanced video and signal based surveillance (AVSS 2007), pp 200–205
Acknowledgments
This work was partially supported by the Quaero Program. We would also like to acknowledge the MediaEval Multimedia Benchmark for providing the framework to evaluate the task of violent scene detection. We also greatly appreciate our participants for giving us their consent to describe their systems and results in this paper. More information about the MediaEval campaign is available at: http://www.multimediaeval.org/. The working note proceedings of the MediaEval 2011 and 2012 which included the participants’ contributions can be found online at http://www.ceur-ws.org/Vol-807 and http://www.ceur-ws.org/Vol-927, respectively.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Demarty, CH., Penet, C., Ionescu, B., Gravier, G., Soleymani, M. (2014). Multimodal Violence Detection in Hollywood Movies: State-of-the-Art and Benchmarking. In: Ionescu, B., Benois-Pineau, J., Piatrik, T., Quénot, G. (eds) Fusion in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-05696-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-05696-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05695-1
Online ISBN: 978-3-319-05696-8
eBook Packages: Computer ScienceComputer Science (R0)