Abstract
By analyzing a multimodal (audio-visual) database with aggressive incidents in trains, we have observed that there are no trivial fusion algorithms to successfully predict multimodal aggression based on unimodal sensor inputs. We proposed a fusion framework that contains a set of intermediate level variables (meta-features) between the low level sensor features and the multimodal aggression detection [1]. In this paper we predict the multimodal level of aggression and two of the meta-features: Context and Semantics. We do this based on the audio stream, from which we extract both acoustic (nonverbal) and linguistic (verbal) information. Given the spontaneous nature of speech in the database, we rely on a keyword spotting approach in the case of verbal information. We have found the existence of 6 semantic groups of keywords that have a positive influence on the prediction of aggression and of the two meta-features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Lefter, I., Burghouts, G., Rothkrantz, L.: Learning the fusion of audio and video aggression assessment by meta-information from human annotations. In: International Conference on Information Fusion, FUSION (in press, 2012)
Lefter, I., Burghouts, G., Rothkrantz, L.: Automatic audio-visual fusion for aggression detection using meta-information. In: IEEE Conference on Advanced Video and Signal Based Surveillance, AVSS (in press, 2012)
Atrey, P.K., Hossain, M.A., Saddik, A.E., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: A survey. Springer Multimedia Systems Journal, 345–379 (2010)
Schuller, B., Rigoll, G., Lang, M.: Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings. IEEE International Conference on.Acoustics, Speech, and Signal Processing, ICASSP 2004, vol. 1, pp. I–577–I–580 (2004)
Eyben, F., Wöllmer, M., Valstar, M., Gunes, H., Schuller, B., Pantic, M.: String-based audiovisual fusion of behavioural events for the assessment of dimensional affect. In: 2011 IEEE International Conference on Automatic Face Gesture Recognition and Workshops, FG 2011, pp. 322–329 (2011)
Whissell, C.M.: The dictionary of affect in language, vol. 4, pp. 113–131. Academic Press (1989)
Xu, H., Chua, T.S.: The fusion of audio-visual features and external knowledge for event detection in team sports video. In: Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval, MIR 2004, pp. 127–134. ACM, New York (2004)
Lefter, I., Rothkrantz, L.J.M., Wiggers, P., van Leeuwen, D.A.: Emotion Recognition from Speech by Combining Databases and Fusion of Classifiers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS, vol. 6231, pp. 353–360. Springer, Heidelberg (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lefter, I., Rothkrantz, L.J.M., Burghouts, G.J. (2012). Aggression Detection in Speech Using Sensor and Semantic Information. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_81
Download citation
DOI: https://doi.org/10.1007/978-3-642-32790-2_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32789-6
Online ISBN: 978-3-642-32790-2
eBook Packages: Computer ScienceComputer Science (R0)