Skip to main content

Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

Abstract

Emotionally coloured speech recognition is a key technology toward achieving human-like spoken dialog systems. However, despite rapid progress in automatic speech recognition (ASR) and emotion research, much less work has examined ASR systems that recognize the verbal content of emotionally coloured speech. Approaches that exist in emotional speech recognition mostly involve adapting standard ASR models to include information about prosody and emotion. In this study, instead of adapting a model to handle emotional speech, we focus on feature transformation methods to solve the mismatch and improve the ASR performance. In this way, we can train the model with emotionally coloured speech without any explicit emotional annotation. We investigate the use of two different deep bottleneck network structures: deep neural networks (DNNs) and convolutional neural networks (CNNs). We hypothesize that the trained bottleneck features may be able to extract essential information that represents the verbal content while abstracting away from superficial differences caused by emotional variance. We also try various combinations of these two bottleneck features with feature-space speaker adaptation. Experiments using Japanese and English emotional speech data reveal that both varieties of bottleneck features and feature-space speaker adaptation successfully improve the emotional speech recognition performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This framework was originally called a time-delay neural network [22] in speech recognition.

References

  1. Arimoto, Y., Kawatsu, H., Ohno, S., Iida, H.: Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoust. Sci. Technol. 33(6), 359–369 (2012)

    Article  Google Scholar 

  2. Athanaselis, T., Bakamidis, S., Dologlou, I., Cowie, R., Douglas-Cowie, E., Cox, C.: ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw. 18(4), 437–444 (2005)

    Article  Google Scholar 

  3. Athanaselis, T., Bakamidis, S., Dologlou, I.: Recognizing verbal content of emotionally coloured speech. In: Proceedings of EUSIPCO, Florence, Italy (2006)

    Google Scholar 

  4. Gales, M.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  5. Gales, M.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)

    Article  Google Scholar 

  6. Gopinath, R.: Maximum likelihood modeling with Gaussian distributions for classification. In: Proceedings of ICASSP, pp. 661–664 (1998)

    Google Scholar 

  7. Maekawa, K., Koiso, H., Furui, S., Isahara, H.: Spontaneous speech corpus of Japanese. In: Proceedings of LREC, Athens, Greece, pp. 947–952 (2000)

    Google Scholar 

  8. McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012)

    Article  Google Scholar 

  9. Miao, Y.: Kaldi+PDNN: building DNN-based ASR systems with Kaldi and PDNN. arXiv:1401.6984 (2014)

  10. Murray, I., Arnott, L.: Toward the simulation of emotion in synthetic speech: a review of the listerature on human vocal emotion. J. Acoust. Soc. Am. 93(2), 1097–1108 (1993)

    Article  Google Scholar 

  11. Paul, D., Baker, J.: The design for the Wall Street Journal-based CSR corpus. In: Proceedings of DARPA Speech and Language Workshop, San Mateo, USA (1992)

    Google Scholar 

  12. Picard, R.: Affective Computing. MIT Press, Cambridge (1997)

    Book  Google Scholar 

  13. Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of emotion. Academic Press (1980)

    Google Scholar 

  14. Polzin, S., Waibel, A.: Pronunciation variations in emotional speech. In: Proceedings of ESCA, pp. 103–108 (1998)

    Google Scholar 

  15. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Moticek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU, Hawaii, USA (2011)

    Google Scholar 

  16. Schuller, B., Stadermann, J., Rigoll, G.: Affect-robust speech recognition by dynamic emotional adaptation. In: Proceedings of Speech Prosody (2006)

    Google Scholar 

  17. Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, Brighton, United Kingdom, pp. 312–315 (2009)

    Google Scholar 

  18. Schuller, B., Steidl, S., Burkhardt, F., Devillers, L., Muller, C., Narayanan, S.: The INTERSPEECH 2010 paralinguistic challenge. In: Proceedings of INTERSPEECH, Makuhari, Japan, pp. 2794–2797 (2010)

    Google Scholar 

  19. Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011 - the first international audio/visual emotion challenge. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), Memphis, Tennessee, pp. 415–424 (2011)

    Google Scholar 

  20. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of of ICSLP, Denver, USA, pp. 901–904 (2002)

    Google Scholar 

  21. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 37, 3371–3408 (2010)

    MathSciNet  MATH  Google Scholar 

  22. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)

    Article  Google Scholar 

  23. Williams, C., Stevens, K.: Emotion and speech: some acoustical correlates. J. Acoust. Soc. Amer 52, 1238–1250 (1972)

    Article  Google Scholar 

Download references

Acknowledgements

Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H06101, JP17H00747, and JP17K00237.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sakriani Sakti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Mukaihara, K., Sakti, S., Nakamura, S. (2017). Recognizing Emotionally Coloured Dialogue Speech Using Speaker-Adapted DNN-CNN Bottleneck Features. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66429-3_63

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66428-6

  • Online ISBN: 978-3-319-66429-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics