Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

Siegert, Ingo; Lotz, Alicia Flores; Egorow, Olga; Wendemuth, Andreas

doi:10.1007/978-3-319-66429-3_44

Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

Ingo Siegert¹⁶,
Alicia Flores Lotz¹⁶,
Olga Egorow¹⁶ &
…
Andreas Wendemuth^16,17

Conference paper
First Online: 13 August 2017

2238 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Abstract

Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an \(\sim \)3.3% absolute higher performance than the uncompressed samples.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
We upsampled emoDB from 16 kHz to 20 kHz. Hence, also the bitrate increased from 256 kbit/s to 320 kbit/s. This does not add missing information in the high frequency range but forces Opus to use the hybrid mode, as this mode is only available for super-wideband and fullband signals.

References

Albahri, A., Lech, M., Cheng, E.: Effect of speech compression on the automatic recognition of emotions. IJSPS 4(1), 55–61 (2016)
Google Scholar
Biundo, S., Wendemuth, A.: Companion-technology for cognitive technical systems. KI - Künstliche Intelligenz 30(1), 71–75 (2016)
Article Google Scholar
Brandenburg, K.: MP3 and AAC explained. In: 17th AES International Conference: High-Quality Audio Coding, Florence, Italy, September 1999
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520, Lisbon, Portugal (2005)
Google Scholar
Byrne, C., Foulkes, P.: The ‘mobile phone effect’ on vowel formants. Int. J. Speech Lang. Law 11(1), 83–102 (2004)
Article Google Scholar
Dhall, A., Goecke, R., Gedeon, T., Sebe, N.: Emotion recognition in the wild. J. Multimodal User Interfaces 10, 95–97 (2016)
Article Google Scholar
Engberg, I.S., Hansen, A.V.: Documentation of the danish emotional speech database (DES), Tech. rep. Aalborg University, Denmark (1996)
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010, Firenze, Italy (2010)
Google Scholar
García, N., Vásquez-Correa, J.C., Arias-Londoño, J.D., Várgas-Bonilla, J.F., Orozco-Arroyave, J.R.: Automatic emotion recognition in compressed speech using acoustic and non-linear features. In: Proceedings of STSIVA 2016, pp. 1–7 (2015)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Hoene, C., Valin, J.M., Vos, K., Skoglund, J.: Summary of Opus listening test results draft-valin-codec-results-03. Internet-draft, IETF (2013)
Google Scholar
IBM Corporation and Microsoft Corporation: Multimedia programming interface and data specifications 1.0. Tech. rep., August 1991
Google Scholar
ITU-T: Methods for subjective determination of transmission quality. REC P.800 (1996), https://www.itu.int/rec/T-REC-P.800-199608-I/en
ITU-T: Wideband Coding of Speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB). REC G.722.2 (2003), https://www.itu.int/rec/T-REC-G.722.2-200307-I/en
ITU-T: Methods for objective and subjective assessment of speech quality (POLQA): Perceptual Objective Listening Quality Assessment. REC P.863, September 2014, http://www.itu.int/rec/T-REC-P.863-201409-I/en
Jokisch, O., Maruschke, M., Meszaros, M., Iaroshenko, V.: Audio and speech quality survey of the opus codec in web real-time communication. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, Leipzig, Germany, pp. 254–262 (2016)
Google Scholar
Lotz, A.F., Siegert, I., Maruschke, M., Wendemuth, A.: Audio compression and its impact on emotion recognition in affective computing. In: Elektronische Sprachsignalverarbeitung 2017, vol. 86, Saarbrücken, Germany, pp. 1–8 (2017)
Google Scholar
Paulsen, S.: QoS/QoE-Modelle für den Dienst Voice over IP (VoIP). Ph.D. thesis, Universität Hamburg (2015)
Google Scholar
Pfister, T., Robinson, P.: Speech emotion classification and public speaking skill assessment. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 151–162. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14715-9_15
Chapter Google Scholar
Rämö, A., Toukomaa, H.: Voice quality characterization of IETF opus codec. In: Proceedings of the INTERSPEECH-2011, pp. 2541–2544, Florence, Italy (2011)
Google Scholar
Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU-2009, Merano, Italy, pp. 552–557 (2009)
Google Scholar
Siegert, I., Lotz, A.F., l. Duong, L., Wendemuth, A.: Measuring the impact of audio compression on the spectral quality of speech data. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, pp. 229–236. Leipzig, Germany (2016)
Google Scholar
Siegert, I., Lotz, A.F., Maruschke, M., Jokisch, O., Wendemuth, A.: Emotion intelligibility within codec-compressed and reduced bandwith speech. In: ITG-Fb. 267: Speech Communication : 12. ITG-Fachtagung Sprachkommunikation 5–7. Oktober 2016 in Paderborn, pp. 215–219. VDE Verlag (2016)
Google Scholar
Steininger, S., Schiel, F., Dioubina, O., Raubold, S.: Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, pp. 33–37 (2002)
Google Scholar
Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys.: Conf. Ser., vol. 450, p. 012053 (2013)
Google Scholar
Valin, J.M., Vos, K., Terriberry, T.: Definition of the opus audio codec. RFC 6716, http://tools.ietf.org/html/rfc6716
Valin, J.M., Maxwell, G., Terriberry, T.B., Vos, K.: The opus codec. In: 135th AES International Convention, New York, USA, October 2013
Google Scholar
Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48, 1162–1181 (2006)
Article Google Scholar
Vásquez-Correa, J.C., García, N., Vargas-Bonilla, J.F., Orozco-Arroyave, J.R., Arias-Londoño, J.D., Quintero, M.O.L.: Evaluation of wavelet measures on automatic detection of emotion in noisy and telephony speech signals. In: International Carnahan Conference on Security Technology, pp. 1–6 (2014)
Google Scholar
Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009)
Article Google Scholar
Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings of the IEEE ASRU-2011, Waikoloa, USA, pp. 523–528 (2011)
Google Scholar

Download references

Acknowledgments

The authors thank for continued support by the SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” (www.sfb-trr-62.de) funded by the German Research Foundation (DFG). This work has further been sponsored by the Federal Ministry of Education and Research in the program Zwanzig20 – Partnership for Innovation as part of the research alliance 3Dsensation (www.3d-sensation.de). We would further like to thank SwissQual AG (a Rhode & Schwarz company), in particular Jens Berger, for supplying the POLQA testbed.

Author information

Authors and Affiliations

Cognitive Systems Group, Institute of Information and Communication Engineering, Otto von Guericke University, 39016, Magdeburg, Germany
Ingo Siegert, Alicia Flores Lotz, Olga Egorow & Andreas Wendemuth
Center for Behavioral Brain Sciences, 39118, Magdeburg, Germany
Andreas Wendemuth

Authors

Ingo Siegert
View author publications
You can also search for this author in PubMed Google Scholar
Alicia Flores Lotz
View author publications
You can also search for this author in PubMed Google Scholar
Olga Egorow
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Wendemuth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ingo Siegert .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siegert, I., Lotz, A.F., Egorow, O., Wendemuth, A. (2017). Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_44
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics