Skip to main content

Speech Research at Google to Enable Universal Speech Interfaces

  • Chapter
  • First Online:
Book cover New Era for Robust Speech Recognition

Abstract

Since the wide adoption of smartphones, speech as an input modality has developed from a science fiction dream to a widely accepted technology. The quality demand on this technology that allowed fueling this adoption is high and has been a continuous focus of research activities at Google. Early adoption of large neural network model deployments and training of such models on large datasets has significantly improved core recognition accuracy. Adoption of novel approaches like long short-term memory models and connectionist temporal classification have further improved accuracy and reduced latency. In addition, algorithms that allow adaptive language modeling improve accuracy based on the context of the speech input. Focus on expanding coverage of the user population in terms of languages and speaker characteristics (e.g., child speech) has lead to novel algorithms that further pushed the universal speech input vision. Continuing this trend, our most recent investigations have been on noise and far-field robustness. Tackling speech processing in those environments will enable applications of in-car, wearable, and in-the-home scenarios and as such be another step towards true universal speech input. This chapter will briefly describe the algorithmic developments at Google over the past decade that have brought speech processing to where it is today.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agiomyrgiannakis, Y.: Vocaine the vocoder and applications in speech synthesis. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  2. Alberti, C., Bacchiani, M.: Discriminative features for language identification. In: Proceeding of Interspeech (2011)

    Google Scholar 

  3. Alberti, C., Bacchiani, M., Bezman, A., Chelba, C., Drofa, A., Liao, H., Moreno, P., Power, T., Sahuguet, A., Shugrina, M., Siohan, O.: An audio indexing system for election video material. In: Proceedings of ICASSP (2009)

    Book  Google Scholar 

  4. Aleksic, P., Allauzen, C., Elson, D., Kracun, A., Casado, D.M., Moreno, P.J.: Improved recognition of contact names in voice commands. In: Proceedings of ICASSP, pp. 4441–4444 (2015)

    Google Scholar 

  5. Aleksic, P., Ghodsi, M., Michaely, A., Allauzen, C., Hall, K., Roark, B., Rybach, D., Moreno, P.: Bringing contextual information to Google speech recognition. In: Proceedings of Interspeech (2015)

    Google Scholar 

  6. Allauzen, C., Riley, M.: Bayesian language model interpolation for mobile speech input. In: Proceedings of Interspeech, pp. 1429–1432 (2011)

    Google Scholar 

  7. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA) (2007)

    Google Scholar 

  8. Alvarez, R., Prabhavalkar, R., Bakhtin, A.: On the efficient representation and execution of deep acoustic models. In: Proceedings of Interspeech (2016)

    Book  Google Scholar 

  9. Bacchiani, M., Rybach, D.: Context dependent state tying for speech recognition using deep neural network acoustic models. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

  10. Bacchiani, M., Beaufays, F., Schalkwyk, J., Schuster, M., Strope, B.: Deploying GOOG-411: early lessons in data, measurement, and testing. In: Proceedings of ICASSP (2008)

    Google Scholar 

  11. Beaufays, F.: The neural networks behind Google voice transcription. In: Google Research blog (2015). https://research.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html

    Google Scholar 

  12. Beaufays, F.: How the dream of speech recognition became a reality. In: Google (2016). www.google.com/about/careers/stories/how-one-team-turned-the-dream-of-speech-recognition-into-a-reality

    Google Scholar 

  13. Beaufays, F., Strope, B.: Language modeling capitalization. In: Proceedings of ICASSP (2013)

    Google Scholar 

  14. Bourlard, H., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic, Dordrecht (1993)

    Google Scholar 

  15. Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

  16. Chen, G., Parada, C., Sainath, T.N.: Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  17. Chen, Y.H., Lopez-Moreno, I., Sainath, T., Visontai, M., Alvarez, R., Parada, C.: Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Proceedings of Interspeech (2015)

    Google Scholar 

  18. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large Clusters. In: OSDI’04, Sixth Symposium on Operating System Design and Implementation (2004)

    Google Scholar 

  19. Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Proceedings of Neural Information Processing Systems (NIPS) (2012)

    Google Scholar 

  20. Ebden, P., Sproat, R.: The Kestrel TTS text normalization system. J. Nat. Lang. Eng. 21(3), 333–353 (2014)

    Article  Google Scholar 

  21. Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J.: Frame by frame language identification in short utterances using deep neural networks. In: Neural Networks, Special Issue: Neural Network Learning in Big Data, pp. 49–58 (2014)

    Google Scholar 

  22. Gonzalvo, X., Tazari, S., Chan, C.A., Becker, M., Gutkin, A., Silen, H.: Recent advances in Google real-time HMM-driven unit selection synthesizer. In: Proceeding of Interspeech (2016)

    Book  Google Scholar 

  23. Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization in transcribed speech. In: Proceedings of ICASSP (2009)

    Book  Google Scholar 

  24. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, New York (2012)

    Google Scholar 

  25. Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

  26. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.M.: End-to-end text-dependent speaker verification. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  27. Hershey, J.R., Roux, J.L., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures. CoRR abs/1409.2574 (2014)

    Google Scholar 

  28. Hinton, G., Deng, L., Yu, D., Dahl, G., Rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  29. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  30. Hughes, T., Nakajima, K., Ha, L., Vasu, A., Moreno, P., LeBeau, M.: Building transcribed speech corpora quickly and cheaply for many languages. In: Proceedings of Interspeech (2010)

    Google Scholar 

  31. Kawahara, H., Agiomyrgiannakis, Y., Zen, H.: Using instantaneous frequency and aperiodicity detection to estimate f0 for high-quality speech synthesis. In: ISCA SSW9 (2016)

    Google Scholar 

  32. Lei, X., Senior, A., Gruenstein, A., Sorensen, J.: Accurate and compact large vocabulary speech recognition on mobile devices. In: Proceedings of Interspeech (2013)

    Google Scholar 

  33. Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Interspeech (2016)

    Book  Google Scholar 

  34. Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013)

    Google Scholar 

  35. McGraw, I., Prabhavalkar, R., Alvarez, R., Arenas, M.G., Rao, K., Rybach, D., Alsharif, O., Sak, H., Gruenstein, A., Beaufays, F., Parada, C.: Personalized speech recognition on mobile devices. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  36. Nakkiran, P., Alvarez, R., Prabhavalkar, R., Parada, C.: Compressing deep neural networks using a rank-constrained topology. In: Proceedings of Interspeech, pp. 1473–1477 (2015)

    Google Scholar 

  37. Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P., Sainath, T.: Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. In: Proceedings of ICASSP, pp. 4704–4708 (2015)

    Google Scholar 

  38. Prabhavalkar, R., Alsharif, O., Bruguier, A., McGraw, I.: On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  39. Rao, K., Peng, F., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: Proceedings of ICASSP (2016)

    Google Scholar 

  40. Recht, B., Re, C., Wright, S., Feng, N.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 693–701. Curran Associates, Red Hook (2011)

    Google Scholar 

  41. Robinson, T., Hochberg, M., Renals, S.: The Use of Recurrent Neural Networks in Continuous Speech Recognition. Springer, New York (1995)

    Google Scholar 

  42. Rutherford, A., Peng, F., Beaufays, F.: Pronunciation learning for named-entities through crowd-sourcing. In: Proceedings of Interspeech (2014)

    Google Scholar 

  43. Sainath, T., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Proceedings of Interspeech (2015)

    Google Scholar 

  44. Sainath, T., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)

    Book  Google Scholar 

  45. Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2015)

    Google Scholar 

  46. Sainath, T.N., Weiss, R.J., Wilson, K.W., Senior, A., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of Interspeech (2015)

    Google Scholar 

  47. Sainath, T.N., Narayanan, A., Weiss, R.J., Wilson, K.W., Bacchiani, M., Shafran, I.: Improvements to factorized neural network multichannel models. In: Interspeech (2016)

    Google Scholar 

  48. Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNS. In: Proceedings of ICASSP (2016)

    Book  Google Scholar 

  49. Sak, H., Sung, Y., Beaufays, F., Allauzen, C.: Written-domain language modeling for automatic speech recognition. In: Proceedings of Interspeech (2013)

    Google Scholar 

  50. Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech, pp. 338–342 (2014)

    Google Scholar 

  51. Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of Interspeech (2014)

    Google Scholar 

  52. Sak, H., Senior, A., Rao, K., Beaufays, F., Schalkwyk, J.: Google voice search: faster and more accurate. In: Google Research blog (2015). https://research.googleblog.com/2015/09/google-voice-search-faster-and-more.html

    Google Scholar 

  53. Sak, H., Senior, A.W., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. CoRR abs/1507.06947 (2015)

    Google Scholar 

  54. Sak, H., Senior, A.W., Rao, K., Irsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP, pp. 4280–4284 (2015)

    Google Scholar 

  55. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google Search by Voice: A Case Study. Springer, New York (2010)

    Google Scholar 

  56. Seltzer, M., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust handsfree speech recognition. IEEE Trans. Audio Speech Lang. Process. 12(5), 489–498 (2004)

    Article  Google Scholar 

  57. Senior, A., Heigold, G., Bacchiani, M., Liao, H.: GMM-free DNN training. In: Proceedings of ICASSP (2014)

    Google Scholar 

  58. Senior, A.W., Sak, H., Shafran, I.: Context dependent phone models for LSTM RNN acoustic modelling. In: Proceedings of ICASSP, pp. 4585–4589 (2015)

    Google Scholar 

  59. Shan, J., Wu, G., Hu, Z., Tang, X., Jansche, M., Moreno, P.J.: Search by voice in Mandarin Chinese. In: Proceedings of Interspeech, pp. 354–357 (2010)

    Google Scholar 

  60. Shugrina, M.: Formatting time-aligned ASR transcripts for readability. In: 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010)

    Google Scholar 

  61. Sorensen, J., Allauzen, C.: Unary data structures for language models. In: Proceedings of Interspeech (2011)

    Google Scholar 

  62. Sparrowhawk. https://github.com/google/sparrowhawk (2016)

  63. Tokuda, K., Zen, H.: Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In: Proceedings of ICASSP, pp. 4215–4219 (2015)

    Google Scholar 

  64. Tokuda, K., Zen, H.: Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In: Proceedings of ICASSP (2015)

    Google Scholar 

  65. Vanhoucke, V.: Speech recognition and deep learning. In: Google Research blog (2012). https://research.googleblog.com/2012/08/speech-recognition-and-deep-learning.html

    Google Scholar 

  66. Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of ICASSP (2014)

    Book  Google Scholar 

  67. Variani, E., Sainath, T.N., Shafran, I., Bacchiani, M.: Complex Linear Prediction (CLP): a discriminative approach to joint feature extraction and acoustic modeling. In: Proceedings of Interspeech (2016)

    Google Scholar 

  68. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition using time-delay neural networks. In: Proceedings of ICASSP, vol. 37, pp. 328–339 (1989)

    Google Scholar 

  69. Zen, H.: Acoustic modeling for speech synthesis – from HMM to RNN. Invited Talk. In: ASRU (2015)

    Google Scholar 

  70. Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of ICASSP, pp. 4470–4474 (2015)

    Google Scholar 

  71. Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proceedings of ICASSP, pp. 3872–3876 (2014)

    Google Scholar 

  72. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP, pp. 7962–7966 (2013)

    Google Scholar 

  73. Zen, H., Agiomyrgiannakis, Y., Egberts, N., Henderson, F., Szczepaniak, P.: Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In: Interspeech (2016)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michiel Bacchiani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Bacchiani, M. et al. (2017). Speech Research at Google to Enable Universal Speech Interfaces. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics