Abstract
Automatic detection of aggressive situations has a high societal and scientific relevance. It has been argued that using data from multimodal sensors as for example video and sound as opposed to unimodal is bound to increase the accuracy of detections. We approach the problem of multimodal aggression detection from the viewpoint of a human observer and try to reproduce his predictions automatically. Typically, a single ground truth for all available modalities is used when training recognizers. We explore the benefits of adding an extra level of annotations, namely audio-only and video-only. We analyze these annotations and compare them to the multimodal case in order to have more insight into how humans reason using multimodal data. We train classifiers and compare the results when using unimodal and multimodal labels as ground truth. Both in the case of audio and video recognizer the performance increases when using the unimodal labels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Douglas-Cowie, E., Devillers, L., Martin, J.C., Cowie, R., Savvidou, S., Abrilian, S., Cox, C.: Multimodal databases of everyday emotion: Facing up to complexity. In: Ninth European Conference on Speech Communication and Technology (2005)
Hendriks, R.C., Heusdens, R., Jensen, J.: MMSE based noise PSD tracking with low complexity. In: IEEE Int. Conf. Acoust, Speech, Signal Processing, pp. 4266–4269 (2010)
Juslin, P.N., Scherer, K.R.: Vocal expression of affect. In: Harrigan, J., Rosenthal, R., Scherer, K. (eds.) The New Handbook of Methods in Nonverbal Behavior Research, pp. 65–135. Oxford University Press, Oxford (2005)
Kipp, M.: Spatiotemporal Coding in ANVIL. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008 (2008)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Int. Conf. of Computer Vision and Pattern Recognition (2008)
Lefter, I., Rothkrantz, L.J.M., Wiggers, P., Van Leeuwen, D.A.: Emotion recognition from speech by combining databases and fusion of classifiers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 353–360. Springer, Heidelberg (2010)
Nöth, E., Hacker, C., Batliner, A.: Does multimodality really help? The classification of emotion and of On/Off-focus in multimodal dialogues-two case studies. In: ELMAR, pp. 9–16 (2007)
Yang, Z.: Multi-Modal Aggression Detection in Trains. PhD thesis, Delft Univeristy of Technology (2009)
Zajdel, W., Krijnders, J.D., Andringa, T.C., Gavrila, D.M.: CASSANDRA: audio-video sensor fusion for aggression detection. In: Proc. IEEE Conference on Advanced Video and Signal Based Surveillance AVSS, pp. 200–205 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lefter, I., Rothkrantz, L.J.M., Burghouts, G., Yang, Z., Wiggers, P. (2011). Addressing Multimodality in Overt Aggression Detection. In: Habernal, I., Matoušek, V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science(), vol 6836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23538-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-23538-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23537-5
Online ISBN: 978-3-642-23538-2
eBook Packages: Computer ScienceComputer Science (R0)