Skip to main content

A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2018)

Abstract

Recently, recurrent neural networks have become state-of-the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and feature normalization techniques. We have evaluated feature-space maximum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, according to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. A flexible framework of neural networks for deep learning. https://chainer.org

  2. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)

  3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  4. Gales, M.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  6. Huang, Z., Tang, J., Xue, S., Dai, L.: Speaker adaptation of RNN-BLSTM for speech recognition based on speaker code. In: ICASSP, vol. 1, pp. 5305–5309 (2016). https://doi.org/10.1109/ICASSP.2016.7472690

  7. Karafiát, M., Burget, L., Matějka, P., Glembek, O., Černocký, J.: iVector-based discriminative adaptation for automatic speech recognition. In: Proceedings of 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, pp. 152–157 (2011). https://doi.org/10.1109/ASRU.2011.6163922

  8. Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: INTERSPEECH, pp. 3630–3634. ISCA (2015)

    Google Scholar 

  9. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH January 2015, pp. 3214–3218 (2015)

    Google Scholar 

  10. Rath, S.P., Povey, D., Veselý, K., Černocký, J.: Improved feature processing for deep neural networks. In: INTERSPEECH, pp. 109–113. ISCA (2013)

    Google Scholar 

  11. Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., Kessler, F.B.: Improving speech recognition by revising gated recurrent units. In: INTERSPEECH 2017, pp. 1308–1312 (2017). https://doi.org/10.21437/Interspeech.2017-775

  12. Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. In: INTERSPEECH, vol. 1, pp. 338–342 (2014). arXiv:1402.1128

  13. Saon, G., Soltau, H.: Unfolded Recurrent Neural Networks for Speech Recognition. In: INTERSPEECH, vol. 1, pp. 343–347 (2014). http://mazsola.iit.uni-miskolc.hu/~czap/letoltes/IS14/IS2014/PDF/AUTHOR/IS141054.PDF

  14. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 55–59 (2013)

    Google Scholar 

  15. Seide, F., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: ASRU (2011)

    Google Scholar 

  16. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Sig. Process. 37(3), 328–339 (1989). https://doi.org/10.1109/29.21701

    Article  Google Scholar 

  17. Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Speech Lang. Process. 22(12), 1713–1725 (2014). https://doi.org/10.1109/TASLP.2014.2346313

    Article  Google Scholar 

Download references

Acknowledgement

This work was supported by the project no. P103/12/G084 of the Grant Agency of the Czech Republic and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Vaněk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vaněk, J., Michálek, J., Zelinka, J., Psutka, J. (2018). A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds) Statistical Language and Speech Processing. SLSP 2018. Lecture Notes in Computer Science(), vol 11171. Springer, Cham. https://doi.org/10.1007/978-3-030-00810-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00810-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00809-3

  • Online ISBN: 978-3-030-00810-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics