Abstract
Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an \(\sim \)3.3% absolute higher performance than the uncompressed samples.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We upsampled emoDB from 16 kHz to 20 kHz. Hence, also the bitrate increased from 256 kbit/s to 320 kbit/s. This does not add missing information in the high frequency range but forces Opus to use the hybrid mode, as this mode is only available for super-wideband and fullband signals.
References
Albahri, A., Lech, M., Cheng, E.: Effect of speech compression on the automatic recognition of emotions. IJSPS 4(1), 55–61 (2016)
Biundo, S., Wendemuth, A.: Companion-technology for cognitive technical systems. KI - Künstliche Intelligenz 30(1), 71–75 (2016)
Brandenburg, K.: MP3 and AAC explained. In: 17th AES International Conference: High-Quality Audio Coding, Florence, Italy, September 1999
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520, Lisbon, Portugal (2005)
Byrne, C., Foulkes, P.: The ‘mobile phone effect’ on vowel formants. Int. J. Speech Lang. Law 11(1), 83–102 (2004)
Dhall, A., Goecke, R., Gedeon, T., Sebe, N.: Emotion recognition in the wild. J. Multimodal User Interfaces 10, 95–97 (2016)
Engberg, I.S., Hansen, A.V.: Documentation of the danish emotional speech database (DES), Tech. rep. Aalborg University, Denmark (1996)
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010, Firenze, Italy (2010)
García, N., Vásquez-Correa, J.C., Arias-Londoño, J.D., Várgas-Bonilla, J.F., Orozco-Arroyave, J.R.: Automatic emotion recognition in compressed speech using acoustic and non-linear features. In: Proceedings of STSIVA 2016, pp. 1–7 (2015)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Hoene, C., Valin, J.M., Vos, K., Skoglund, J.: Summary of Opus listening test results draft-valin-codec-results-03. Internet-draft, IETF (2013)
IBM Corporation and Microsoft Corporation: Multimedia programming interface and data specifications 1.0. Tech. rep., August 1991
ITU-T: Methods for subjective determination of transmission quality. REC P.800 (1996), https://www.itu.int/rec/T-REC-P.800-199608-I/en
ITU-T: Wideband Coding of Speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB). REC G.722.2 (2003), https://www.itu.int/rec/T-REC-G.722.2-200307-I/en
ITU-T: Methods for objective and subjective assessment of speech quality (POLQA): Perceptual Objective Listening Quality Assessment. REC P.863, September 2014, http://www.itu.int/rec/T-REC-P.863-201409-I/en
Jokisch, O., Maruschke, M., Meszaros, M., Iaroshenko, V.: Audio and speech quality survey of the opus codec in web real-time communication. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, Leipzig, Germany, pp. 254–262 (2016)
Lotz, A.F., Siegert, I., Maruschke, M., Wendemuth, A.: Audio compression and its impact on emotion recognition in affective computing. In: Elektronische Sprachsignalverarbeitung 2017, vol. 86, Saarbrücken, Germany, pp. 1–8 (2017)
Paulsen, S.: QoS/QoE-Modelle für den Dienst Voice over IP (VoIP). Ph.D. thesis, Universität Hamburg (2015)
Pfister, T., Robinson, P.: Speech emotion classification and public speaking skill assessment. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 151–162. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14715-9_15
Rämö, A., Toukomaa, H.: Voice quality characterization of IETF opus codec. In: Proceedings of the INTERSPEECH-2011, pp. 2541–2544, Florence, Italy (2011)
Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU-2009, Merano, Italy, pp. 552–557 (2009)
Siegert, I., Lotz, A.F., l. Duong, L., Wendemuth, A.: Measuring the impact of audio compression on the spectral quality of speech data. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, pp. 229–236. Leipzig, Germany (2016)
Siegert, I., Lotz, A.F., Maruschke, M., Jokisch, O., Wendemuth, A.: Emotion intelligibility within codec-compressed and reduced bandwith speech. In: ITG-Fb. 267: Speech Communication : 12. ITG-Fachtagung Sprachkommunikation 5–7. Oktober 2016 in Paderborn, pp. 215–219. VDE Verlag (2016)
Steininger, S., Schiel, F., Dioubina, O., Raubold, S.: Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, pp. 33–37 (2002)
Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys.: Conf. Ser., vol. 450, p. 012053 (2013)
Valin, J.M., Vos, K., Terriberry, T.: Definition of the opus audio codec. RFC 6716, http://tools.ietf.org/html/rfc6716
Valin, J.M., Maxwell, G., Terriberry, T.B., Vos, K.: The opus codec. In: 135th AES International Convention, New York, USA, October 2013
Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48, 1162–1181 (2006)
Vásquez-Correa, J.C., García, N., Vargas-Bonilla, J.F., Orozco-Arroyave, J.R., Arias-Londoño, J.D., Quintero, M.O.L.: Evaluation of wavelet measures on automatic detection of emotion in noisy and telephony speech signals. In: International Carnahan Conference on Security Technology, pp. 1–6 (2014)
Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009)
Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings of the IEEE ASRU-2011, Waikoloa, USA, pp. 523–528 (2011)
Acknowledgments
The authors thank for continued support by the SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” (www.sfb-trr-62.de) funded by the German Research Foundation (DFG). This work has further been sponsored by the Federal Ministry of Education and Research in the program Zwanzig20 – Partnership for Innovation as part of the research alliance 3Dsensation (www.3d-sensation.de). We would further like to thank SwissQual AG (a Rhode & Schwarz company), in particular Jens Berger, for supplying the POLQA testbed.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Siegert, I., Lotz, A.F., Egorow, O., Wendemuth, A. (2017). Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)