Investigating a Hybrid Learning Approach for Robust Automatic Speech Recognition

Pironkov, Gueorgui; Wood, Sean U. N.; Dupont, Stéphane; Dutoit, Thierry

doi:10.1007/978-3-030-00810-9_7

Investigating a Hybrid Learning Approach for Robust Automatic Speech Recognition

Gueorgui Pironkov¹⁶,
Sean U. N. Wood¹⁷,
Stéphane Dupont¹⁶ &
…
Thierry Dutoit¹⁶

Conference paper
First Online: 19 September 2018

598 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11171))

Abstract

In order to properly train an automatic speech recognition system, speech with its annotated transcriptions is required. The amount of real annotated data recorded in noisy and reverberant conditions is extremely limited, especially compared to the amount of data that can be simulated by adding noise to clean annotated speech. Thus, using both real and simulated data is important in order to improve robust speech recognition. Another promising method applied to speech recognition in noisy and reverberant conditions is multi-task learning. A successful auxiliary task consists of generating clean speech features using a regression loss (as a denoising auto-encoder). But this auxiliary task uses as targets clean speech which implies that real data cannot be used. In order to tackle this problem a Hybrid-Task Learning system is proposed. This system switches frequently between multi and single-task learning depending on whether the input is real or simulated data respectively. We show that the relative improvement brought by the proposed hybrid-task learning architecture can reach up to 4.4% compared to the traditional single-task learning approach on the CHiME4 database.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Caruana, R.: Multitask learning. In: Thrun, S., Pratt, L. (eds.) Learning to Learn. Springer, Boston (1997). https://doi.org/10.1007/978-1-4615-5529-2_5
Chapter Google Scholar
Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R.: Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: INTERSPEECH, pp. 3274–3278. ISCA (2015)
Google Scholar
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. (ICASSP) 19(4), 788–798 (2011)
Article Google Scholar
Garofolo, J., Graff, D., Paul, D., Pallett, D.: CSR-I (WSJ0) Complete LDC93S6A. Web Download. Linguistic Data Consortium, Philadelphia (1993)
Google Scholar
Giri, R., Seltzer, M.L., Droppo, J., Yu, D.: Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5014–5018. IEEE (2015)
Google Scholar
Hansen, J.H.: Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and lombard effect. IEEE Trans. Speech Audio Process. 2(4), 598–614 (1994)
Article Google Scholar
Kim, S., Raj, B., Lane, I.: Environmental noise embeddings for robust speech recognition. arXiv preprint arXiv:1601.02553 (2016)
Kinoshita, K., et al.: A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Sig. Process. 2016(1), 1–19 (2016)
Article Google Scholar
Kundu, S., Mantena, G., Qian, Y., Tan, T., Delcroix, M., Sim, K.C.: Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5025–5029. IEEE (2016)
Google Scholar
Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Proceedings of INTERSPEECH (2016)
Google Scholar
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R.: An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)
Article Google Scholar
Lu, Y., et al.: Multitask learning in connectionist speech recognition. In: Proceedings of the Tenth Australian International Conference on Speech Science and Technology, Sydney, 8–10 December 2004, pp. 312–315 (2004)
Google Scholar
Pironkov, G., Dupont, S., Dutoit, T.: Multi-task learning for speech recognition: an overview. In: Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN) (2016)
Google Scholar
Pironkov, G., Dupont, S., Dutoit, T.: Speaker-aware multi-task learning for automatic speech recognition. In: 23rd International Conference on Pattern Recognition (ICPR) (2016)
Google Scholar
Pironkov, G., Dupont, S., Wood, S.U.N., Dutoit, T.: Noise and speech estimation as auxiliary tasks for robust speech recognition. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 181–192. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68456-7_15
Chapter Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)
Google Scholar
Qian, Y., Tan, T., Yu, D.: An investigation into using parallel data for far-field speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5725–5729. IEEE (2016)
Google Scholar
Qian, Y., Yin, M., You, Y., Yu, K.: Multi-task joint-learning of deep neural networks for robust speech recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 310–316. IEEE (2015)
Google Scholar
Sakti, S., Kawanishi, S., Neubig, G., Yoshino, K., Nakamura, S.: Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 35–42. IEEE (2016)
Google Scholar
Stadermann, J., Koska, W., Rigoll, G.: Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model. In: INTERSPEECH, pp. 2993–2996 (2005)
Google Scholar
Tan, T., et al.: Speaker-aware training of LSTM-RNNS for acoustic modelling. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5280–5284. IEEE (2016)
Google Scholar
Tang, Z., Li, L., Wang, D.: Multi-task recurrent model for speech and speaker recognition. arXiv preprint arXiv:1603.09643 (2016)
Vincent, E., Watanabe, S., Nugraha, A.A., Barker, J., Marxer, R.: An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Comput. Speech Lang. 46, 535–557 (2016)
Article Google Scholar
Xiong, W., et al.: Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016)

Download references

Acknowledgments

This work has been partly funded by the Walloon Region of Belgium through the SPW-DGO6 Wallinov Program n^o1610152.

Author information

Authors and Affiliations

Numediart Institute, University of Mons, Mons, Belgium
Gueorgui Pironkov, Stéphane Dupont & Thierry Dutoit
NECOTIS, University of Sherbrooke, Sherbrooke, Canada
Sean U. N. Wood

Authors

Gueorgui Pironkov
View author publications
You can also search for this author in PubMed Google Scholar
Sean U. N. Wood
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Dupont
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Dutoit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gueorgui Pironkov .

Editor information

Editors and Affiliations

University of Mons, Mons, Belgium
Thierry Dutoit
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
University of Mons, Mons, Belgium
Gueorgui Pironkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pironkov, G., Wood, S.U.N., Dupont, S., Dutoit, T. (2018). Investigating a Hybrid Learning Approach for Robust Automatic Speech Recognition. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds) Statistical Language and Speech Processing. SLSP 2018. Lecture Notes in Computer Science(), vol 11171. Springer, Cham. https://doi.org/10.1007/978-3-030-00810-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-00810-9_7
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00809-3
Online ISBN: 978-3-030-00810-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics