Multimodal Violence Detection in Hollywood Movies: State-of-the-Art and Benchmarking

Demarty, Claire-Hélène; Penet, Cédric; Ionescu, Bogdan; Gravier, Guillaume; Soleymani, Mohammad

doi:10.1007/978-3-319-05696-8_8

Multimodal Violence Detection in Hollywood Movies: State-of-the-Art and Benchmarking

Claire-Hélène Demarty⁷,
Cédric Penet⁷,
Bogdan Ionescu⁸,
Guillaume Gravier⁹ &
…
Mohammad Soleymani¹⁰

Chapter
First Online: 01 January 2014

1798 Accesses
4 Citations
1 Altmetric

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

Abstract

This chapter introduces a benchmark evaluation targeting the detection of violent scenes in Hollywood movies. The evaluation was implemented in 2011 and 2012 as an affect task in the framework of the international MediaEval benchmark initiative. We report on these 2 years of evaluation, providing a detailed description of the dataset created, describing the state of the art by studying the results achieved by participants and providing a detailed analysis of two of the best performing multimodal systems. We elaborate on the lessons learned after 2 years to provide insights on future work emphasizing multimodal modeling and fusion.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.multimediaeval.org/
2.
http://www.technicolor.com/
3.
http://www.nist.gov/itl/iad/mig/sed.cfm
4.
The development data is intended for designing and training the approaches.
5.
The test set data is intended for the official benckmarking.
6.
The Yaafe toolkit for audio feature extraction was used.

References

Acar E, Albayrak S (2012) Dai lab at mediaeval 2012 affect task: the detection of violent scenes using affective features. In: MediaEval 2012, multimedia benchmark workshop
Google Scholar
Acar E, Spiegel S, Albayrak S (2011) Mediaeval 2011 affect task: Violent scene detection combining audio and visual features with svm. In: MediaEval 2011, multimedia benchmark workshop
Google Scholar
Baveye Y, Urban F, Chamaret C, Demoulin V, Hellier P (2013) Saliency-guided consistent color harmonization. Computational color imaging, Lecture notes in computer science, vol 7786. Springer, Berlin, pp 105–118
Google Scholar
Chen LH, Hsu HW, Wang LY, Su CW (2011) Violence detection in movies. In: 8th IEEE international conference on computer graphics, imaging and visualization (CGIV 2011), pp 119–124
Google Scholar
Chen LH, Su CW, Weng CF, Liao HYM (2009) Action Scene Detection With Support Vector Machines. J Multimedia 4:248–253. doi:10.4304/jmm.4.4.248-253
Google Scholar
Cheng WH, Chu WT, Wu JL (2003) Semantic context detection based on hierarchical audio models. In: Proceedings of the 5th ACM SIGMM international workshop on multimedia information retrieval, pp 109–115
Google Scholar
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347. http://dx.doi.org/10.1007/BF00994110
Datta A, Shah M, Da Vitoria Lobo N (2002) Person-on-person violence detection in video data. In: Proceedings of 16th IEEE international conference on pattern recognition, vol 1. pp 433–438
Google Scholar
Demarty CH, Penet C, Gravier G, Soleymani M (2012) A benchmarking campaign for the multimodal detection of violent scenes in movies. In: Computer Vision-ECCV 2012. Workshops and demonstrations, Springer, pp 416–425
Google Scholar
Derbas N, Thollard F, Safadi B, Quénot G (2012) Lig at mediaeval 2012 affect task: use of a generic method. In: MediaEval 2012, multimedia benchmark workshop
Google Scholar
de Souza FDM, Chávez GC, do Valle E, de A Araujo A (2010) Violence detection in video using spatio-temporal features. In: 23rd IEEE conference on graphics, patterns and images (SIBGRAPI 2010), pp 224–230
Google Scholar
de Weijer JV, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world applications. IEEE Trans Image Process 18(7):1512–1523
Article MathSciNet Google Scholar
Eyben F, Weninger F, Lehment N, Rigoll G, Schuller B (2012) Violent scenes detection with large, brute-forced acoustic and visual feature sets. In: MediaEval 2012, multimedia benchmark workshop
Google Scholar
Giannakopoulos T, Makris A, Kosmopoulos D, Perantonis S, Theodoridis S (2010) Audio-visual fusion for detecting violent scenes in videos. In: Konstantopoulos S et al (eds) Artificial intelligence: theories, models and applications, Lecture notes in computer scienc, vol 6040. Springer, pp 91–100
Google Scholar
Glotin H, Razik J, Paris S, Prevot JM (2011) Real-time entropic unsupervised violent scenes detection in hollywood movies - dyni @ mediaeval affect task 2011. In: MediaEval 2011, multimedia benchmark workshop
Google Scholar
Gninkoun G, Soleymani M (2011) Automatic violence scenes detection: a multi-modal approach. In: MediaEval 2011, multimedia benchmark workshop
Google Scholar
Gong Y, Wang W, Jiang S, Huang Q, Gao W (2008) Detecting violent scenes in movies by auditory and visual cues. In: Huang YM et al (eds) Advances in multimedia information processing - (PCM 2008), Lecture notes in computer science, vol 5353. Springer, pp 317–326
Google Scholar
Gravier G, Demarty CH, Baghdadi S, Gros P (2012) Classification-oriented structure learning in bayesian networks for multimodal event detection in videos. Multimedia tools and applications, pp 1–17. doi: 10.1007/s11042-012-1169-y, http://dx.doi.org/10.1007/s11042-012-1169-y
Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. http://arxiv.org/abs/1207.0580
Ionescu B, Buzuloiu V, Lambert P, Coquin D (2006) Improved cut detection for the segmentation of animation movies. In: IEEE international conference on acoustics, speech, and signal processing
Google Scholar
Ionescu B, Schlüter J, Mironică I, Schedl M (2013) A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In: Proceedings of the 3rd ACM international conference on multimedia retrieval, pp 215–222
Google Scholar
Jiang YG, Dai Q, Tan CC, Xue X, Ngo CW (2012) The shanghai-hongkong team at mediaeval2012: Violent scene detection using trajectory-based features. In: MediaEval 2012, multimedia benchmark workshop
Google Scholar
Kriegel B (2003) La violence à la télévision. rapport de la mission d’évaluation, d’analyse et de propositions relative aux représentations violentes à la télévision. Technical report, Ministère de la Culture et de la Communication, Paris
Google Scholar
Krug EG, Mercy JA, Dahlberg LL, Zwi AB (2002) The world report on violence and health. The Lancet 360(9339):1083–1088 (2002). doi: 10.1016/S0140-6736(02)11133-0. http://www.sciencedirect.com/science/article/pii/S0140673602111330
Lam V, Le DD, Le SP, Satoh S, Duong DA (2012) Nii, Japan at mediaeval 2012 violent scenes detection affect task. In: MediaEval 2011, multimedia benchmark workshop
Google Scholar
Lam V, Le DD, Satoh S, Duong, DA (2011) Nii, Japan at mediaeval 2011 violent scenes detection task. In: MediaEval 2011, multimedia benchmark workshop
Google Scholar
Lin J, Wang W (2009) Weakly-supervised violence detection in movies with audio and video based co-training. In: Advances in multimedia information processing-PCM 2009, Springer, pp 930–935
Google Scholar
Lucas P (2002) Restricted Bayesian network structure learning. In: Advances in Bayesian networks, studies in fuzziness and soft computing, pp 217–232
Google Scholar
Ludwig O, Delgado D, Goncalves V, Nunes U (2009) Trainable classifier-fusion schemes: An application to pedestrian detection. In: IEEE internation conference on intelligent transportation systems, pp 432–437
Google Scholar
Martin V, Glotin H, Paris S, Halkias X, Prevot JM (2012) Violence detection in video by large scale multi-scale local binary pattern dynamics. In: MediaEval 2012, multimedia benchmark workshop
Google Scholar
Nam J, Alghoniemy M, Tewfik AH (1998) Audio-visual content-based violent scene characterization. In: Proceedings of IEEE international conference on image processing (ICIP-98), vol 1. pp 353–357
Google Scholar
Nievas EB, Suarez OD, García GB, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Computer analysis of images and patterns, Springer, pp 332–339
Google Scholar
Penet C, Demarty CH, Gravier G, Gros P (2011) Technicolor and inria/irisa at mediaeval 2011: learning temporal modality integration with bayesian networks. In: MediaEval 2011, Multimedia Benchmark Workshop, CEUR Workshop Proceedings, vol 807. http://CEUR-WS.org
Penet C, Demarty CH, Gravier G, Gros P (2013) Audio event detection in movies using multiple audio words and contextual Bayesian networks. In: Workshop on content-based multimedia indexing
Google Scholar
Penet C, Demarty CH, Soleymani M, Gravier G, Gros P (2012) Technicolor/inria/imperial college london at the mediaeval 2012 violent scene detection task. In: MediaEval 2012, multimedia benchmark workshop
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Article Google Scholar
Safadi B, Quéenot G (2011) Lig at mediaeval 2011 affect task: use of a generic method. In: MediaEval 2011, multimedia benchmark, workshop
Google Scholar
Schlüter J, Ionescu B, Mironică I, Schedl M (2012) Arf @ mediaeval 2012: an uninformed approach to violence detection in hollywood movies. In: MediaEval 2012, multimedia benchmark, workshop
Google Scholar
Violence (1996) A public health priority. Technical Report, World Health Organization, Geneva, WHO/EHA/SPI.POA.2
Google Scholar
Zajdel W, Krijnders JD, Andringa T, Gavrila DM (2007) Cassandra: audio-video sensor fusion for aggression detection. In: IEEE conference on advanced video and signal based surveillance (AVSS 2007), pp 200–205
Google Scholar

Download references

Acknowledgments

This work was partially supported by the Quaero Program. We would also like to acknowledge the MediaEval Multimedia Benchmark for providing the framework to evaluate the task of violent scene detection. We also greatly appreciate our participants for giving us their consent to describe their systems and results in this paper. More information about the MediaEval campaign is available at: http://www.multimediaeval.org/. The working note proceedings of the MediaEval 2011 and 2012 which included the participants’ contributions can be found online at http://www.ceur-ws.org/Vol-807 and http://www.ceur-ws.org/Vol-927, respectively.

Author information

Authors and Affiliations

Technicolor, 975 av. des Champs Blancs, 35576, Cesson Sévigné Cedex, France
Claire-Hélène Demarty & Cédric Penet
LAPI, University Politehnica of Bucharest, 061071, Bucharest, Romania
Bogdan Ionescu
IRISA and INRIA Rennes, 35042, Rennes Cedex, France
Guillaume Gravier
iBUG, Imperial College London, London, SW7 2AZ, UK
Mohammad Soleymani

Authors

Claire-Hélène Demarty
View author publications
You can also search for this author in PubMed Google Scholar
Cédric Penet
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Ionescu
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Gravier
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Soleymani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claire-Hélène Demarty .

Editor information

Editors and Affiliations

University Politehnica of Bucharest, Romania
Bogdan Ionescu
University of Bordeaux, Talence, France
Jenny Benois-Pineau
Queen Mary University of London, London, United Kingdom
Tomas Piatrik
Lab. of Informatics of Grenoble, France
Georges Quénot

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Demarty, CH., Penet, C., Ionescu, B., Gravier, G., Soleymani, M. (2014). Multimodal Violence Detection in Hollywood Movies: State-of-the-Art and Benchmarking. In: Ionescu, B., Benois-Pineau, J., Piatrik, T., Quénot, G. (eds) Fusion in Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-05696-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-05696-8_8
Published: 26 March 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05695-1
Online ISBN: 978-3-319-05696-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics