Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features

Mukaihara, Kohei; Sakti, Sakriani; Nakamura, Satoshi

doi:10.1007/978-3-319-66429-3_63

Kohei Mukaihara¹⁶,
Sakriani Sakti¹⁶ &
Satoshi Nakamura¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2227 Accesses
2 Citations

Abstract

Emotionally coloured speech recognition is a key technology toward achieving human-like spoken dialog systems. However, despite rapid progress in automatic speech recognition (ASR) and emotion research, much less work has examined ASR systems that recognize the verbal content of emotionally coloured speech. Approaches that exist in emotional speech recognition mostly involve adapting standard ASR models to include information about prosody and emotion. In this study, instead of adapting a model to handle emotional speech, we focus on feature transformation methods to solve the mismatch and improve the ASR performance. In this way, we can train the model with emotionally coloured speech without any explicit emotional annotation. We investigate the use of two different deep bottleneck network structures: deep neural networks (DNNs) and convolutional neural networks (CNNs). We hypothesize that the trained bottleneck features may be able to extract essential information that represents the verbal content while abstracting away from superficial differences caused by emotional variance. We also try various combinations of these two bottleneck features with feature-space speaker adaptation. Experiments using Japanese and English emotional speech data reveal that both varieties of bottleneck features and feature-space speaker adaptation successfully improve the emotional speech recognition performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This framework was originally called a time-delay neural network [22] in speech recognition.

References

Arimoto, Y., Kawatsu, H., Ohno, S., Iida, H.: Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoust. Sci. Technol. 33(6), 359–369 (2012)
Article Google Scholar
Athanaselis, T., Bakamidis, S., Dologlou, I., Cowie, R., Douglas-Cowie, E., Cox, C.: ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw. 18(4), 437–444 (2005)
Article Google Scholar
Athanaselis, T., Bakamidis, S., Dologlou, I.: Recognizing verbal content of emotionally coloured speech. In: Proceedings of EUSIPCO, Florence, Italy (2006)
Google Scholar
Gales, M.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Article Google Scholar
Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In: Proceedings of ICASSP, pp. 661–664 (1998)
Google Scholar
Maekawa, K., Koiso, H., Furui, S., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of LREC, Athens, Greece, pp. 947–952 (2000)
Google Scholar
McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012)
Article Google Scholar
Miao, Y.: Kaldi+PDNN: building DNN-based ASR systems with Kaldi and PDNN. arXiv:1401.6984 (2014)
Murray, I., Arnott, L.: Toward the simulation of emotion in synthetic speech: a review of the listerature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)
Article Google Scholar
Paul, D., Baker, J.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of DARPA Speech and Language Workshop, San Mateo, USA (1992)
Google Scholar
Picard, R.: Affective Computing. MIT Press, Cambridge (1997)
Book Google Scholar
Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of emotion. Academic Press (1980)
Google Scholar
Polzin, S., Waibel, A.: Pronunciation variations in emotional speech. In: Proceedings of ESCA, pp. 103–108 (1998)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Moticek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU, Hawaii, USA (2011)
Google Scholar
Schuller, B., Stadermann, J., Rigoll, G.: Affect-robust speech recognition by dynamic emotional adaptation. In: Proceedings of Speech Prosody (2006)
Google Scholar
Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, Brighton, United Kingdom, pp. 312–315 (2009)
Google Scholar
Schuller, B., Steidl, S., Burkhardt, F., Devillers, L., Muller, C., Narayanan, S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, Makuhari, Japan, pp. 2794–2797 (2010)
Google Scholar
Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011 - the first international audio/visual emotion challenge. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), Memphis, Tennessee, pp. 415–424 (2011)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of of ICSLP, Denver, USA, pp. 901–904 (2002)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 37, 3371–3408 (2010)
MathSciNet MATH Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)
Article Google Scholar
Williams, C., Stevens, K.: Emotion and speech: some acoustical correlates. J. Acoust. Soc. Amer 52, 1238–1250 (1972)
Article Google Scholar

Download references

Acknowledgements

Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H06101, JP17H00747, and JP17K00237.

Author information

Authors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Japan
Kohei Mukaihara, Sakriani Sakti & Satoshi Nakamura

Authors

Kohei Mukaihara
View author publications
You can also search for this author in PubMed Google Scholar
Sakriani Sakti
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sakriani Sakti .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukaihara, K., Sakti, S., Nakamura, S. (2017). Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_63

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_63
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics