Speech Research at Google to Enable Universal Speech Interfaces

Bacchiani, Michiel; Beaufays, Françoise; Gruenstein, Alexander; Moreno, Pedro; Schalkwyk, Johan; Strohman, Trevor; Zen, Heiga

doi:10.1007/978-3-319-64680-0_18

Michiel Bacchiani⁵,
Françoise Beaufays⁶,
Alexander Gruenstein⁶,
Pedro Moreno⁵,
Johan Schalkwyk⁵,
Trevor Strohman⁶ &
…
Heiga Zen⁷

2265 Accesses
2 Citations
2 Altmetric

Abstract

Since the wide adoption of smartphones, speech as an input modality has developed from a science fiction dream to a widely accepted technology. The quality demand on this technology that allowed fueling this adoption is high and has been a continuous focus of research activities at Google. Early adoption of large neural network model deployments and training of such models on large datasets has significantly improved core recognition accuracy. Adoption of novel approaches like long short-term memory models and connectionist temporal classification have further improved accuracy and reduced latency. In addition, algorithms that allow adaptive language modeling improve accuracy based on the context of the speech input. Focus on expanding coverage of the user population in terms of languages and speaker characteristics (e.g., child speech) has lead to novel algorithms that further pushed the universal speech input vision. Continuing this trend, our most recent investigations have been on noise and far-field robustness. Tackling speech processing in those environments will enable applications of in-car, wearable, and in-the-home scenarios and as such be another step towards true universal speech input. This chapter will briefly describe the algorithmic developments at Google over the past decade that have brought speech processing to where it is today.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agiomyrgiannakis, Y.: Vocaine the vocoder and applications in speech synthesis. In: Proceedings of ICASSP (2015)
Book Google Scholar
Alberti, C., Bacchiani, M.: Discriminative features for language identification. In: Proceeding of Interspeech (2011)
Google Scholar
Alberti, C., Bacchiani, M., Bezman, A., Chelba, C., Drofa, A., Liao, H., Moreno, P., Power, T., Sahuguet, A., Shugrina, M., Siohan, O.: An audio indexing system for election video material. In: Proceedings of ICASSP (2009)
Book Google Scholar
Aleksic, P., Allauzen, C., Elson, D., Kracun, A., Casado, D.M., Moreno, P.J.: Improved recognition of contact names in voice commands. In: Proceedings of ICASSP, pp. 4441–4444 (2015)
Google Scholar
Aleksic, P., Ghodsi, M., Michaely, A., Allauzen, C., Hall, K., Roark, B., Rybach, D., Moreno, P.: Bringing contextual information to Google speech recognition. In: Proceedings of Interspeech (2015)
Google Scholar
Allauzen, C., Riley, M.: Bayesian language model interpolation for mobile speech input. In: Proceedings of Interspeech, pp. 1429–1432 (2011)
Google Scholar
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA) (2007)
Google Scholar
Alvarez, R., Prabhavalkar, R., Bakhtin, A.: On the efficient representation and execution of deep acoustic models. In: Proceedings of Interspeech (2016)
Book Google Scholar
Bacchiani, M., Rybach, D.: Context dependent state tying for speech recognition using deep neural network acoustic models. In: Proceedings of ICASSP (2014)
Book Google Scholar
Bacchiani, M., Beaufays, F., Schalkwyk, J., Schuster, M., Strope, B.: Deploying GOOG-411: early lessons in data, measurement, and testing. In: Proceedings of ICASSP (2008)
Google Scholar
Beaufays, F.: The neural networks behind Google voice transcription. In: Google Research blog (2015). https://research.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html
Google Scholar
Beaufays, F.: How the dream of speech recognition became a reality. In: Google (2016). www.google.com/about/careers/stories/how-one-team-turned-the-dream-of-speech-recognition-into-a-reality
Google Scholar
Beaufays, F., Strope, B.: Language modeling capitalization. In: Proceedings of ICASSP (2013)
Google Scholar
Bourlard, H., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic, Dordrecht (1993)
Google Scholar
Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: Proceedings of ICASSP (2014)
Book Google Scholar
Chen, G., Parada, C., Sainath, T.N.: Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of ICASSP (2015)
Book Google Scholar
Chen, Y.H., Lopez-Moreno, I., Sainath, T., Visontai, M., Alvarez, R., Parada, C.: Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Proceedings of Interspeech (2015)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large Clusters. In: OSDI’04, Sixth Symposium on Operating System Design and Implementation (2004)
Google Scholar
Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Proceedings of Neural Information Processing Systems (NIPS) (2012)
Google Scholar
Ebden, P., Sproat, R.: The Kestrel TTS text normalization system. J. Nat. Lang. Eng. 21(3), 333–353 (2014)
Article Google Scholar
Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J.: Frame by frame language identification in short utterances using deep neural networks. In: Neural Networks, Special Issue: Neural Network Learning in Big Data, pp. 49–58 (2014)
Google Scholar
Gonzalvo, X., Tazari, S., Chan, C.A., Becker, M., Gutkin, A., Silen, H.: Recent advances in Google real-time HMM-driven unit selection synthesizer. In: Proceeding of Interspeech (2016)
Book Google Scholar
Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization in transcribed speech. In: Proceedings of ICASSP (2009)
Book Google Scholar
Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, New York (2012)
Google Scholar
Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of ICASSP (2014)
Book Google Scholar
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.M.: End-to-end text-dependent speaker verification. In: Proceedings of ICASSP (2016)
Book Google Scholar
Hershey, J.R., Roux, J.L., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures. CoRR abs/1409.2574 (2014)
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G., Rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)
Book Google Scholar
Hughes, T., Nakajima, K., Ha, L., Vasu, A., Moreno, P., LeBeau, M.: Building transcribed speech corpora quickly and cheaply for many languages. In: Proceedings of Interspeech (2010)
Google Scholar
Kawahara, H., Agiomyrgiannakis, Y., Zen, H.: Using instantaneous frequency and aperiodicity detection to estimate f0 for high-quality speech synthesis. In: ISCA SSW9 (2016)
Google Scholar
Lei, X., Senior, A., Gruenstein, A., Sorensen, J.: Accurate and compact large vocabulary speech recognition on mobile devices. In: Proceedings of Interspeech (2013)
Google Scholar
Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Interspeech (2016)
Book Google Scholar
Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013)
Google Scholar
McGraw, I., Prabhavalkar, R., Alvarez, R., Arenas, M.G., Rao, K., Rybach, D., Alsharif, O., Sak, H., Gruenstein, A., Beaufays, F., Parada, C.: Personalized speech recognition on mobile devices. In: Proceedings of ICASSP (2016)
Book Google Scholar
Nakkiran, P., Alvarez, R., Prabhavalkar, R., Parada, C.: Compressing deep neural networks using a rank-constrained topology. In: Proceedings of Interspeech, pp. 1473–1477 (2015)
Google Scholar
Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P., Sainath, T.: Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. In: Proceedings of ICASSP, pp. 4704–4708 (2015)
Google Scholar
Prabhavalkar, R., Alsharif, O., Bruguier, A., McGraw, I.: On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In: Proceedings of ICASSP (2016)
Book Google Scholar
Rao, K., Peng, F., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: Proceedings of ICASSP (2016)
Google Scholar
Recht, B., Re, C., Wright, S., Feng, N.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 693–701. Curran Associates, Red Hook (2011)
Google Scholar
Robinson, T., Hochberg, M., Renals, S.: The Use of Recurrent Neural Networks in Continuous Speech Recognition. Springer, New York (1995)
Google Scholar
Rutherford, A., Peng, F., Beaufays, F.: Pronunciation learning for named-entities through crowd-sourcing. In: Proceedings of Interspeech (2014)
Google Scholar
Sainath, T., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Proceedings of Interspeech (2015)
Google Scholar
Sainath, T., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)
Book Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2015)
Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Senior, A., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of Interspeech (2015)
Google Scholar
Sainath, T.N., Narayanan, A., Weiss, R.J., Wilson, K.W., Bacchiani, M., Shafran, I.: Improvements to factorized neural network multichannel models. In: Interspeech (2016)
Google Scholar
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNS. In: Proceedings of ICASSP (2016)
Book Google Scholar
Sak, H., Sung, Y., Beaufays, F., Allauzen, C.: Written-domain language modeling for automatic speech recognition. In: Proceedings of Interspeech (2013)
Google Scholar
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech, pp. 338–342 (2014)
Google Scholar
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of Interspeech (2014)
Google Scholar
Sak, H., Senior, A., Rao, K., Beaufays, F., Schalkwyk, J.: Google voice search: faster and more accurate. In: Google Research blog (2015). https://research.googleblog.com/2015/09/google-voice-search-faster-and-more.html
Google Scholar
Sak, H., Senior, A.W., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. CoRR abs/1507.06947 (2015)
Google Scholar
Sak, H., Senior, A.W., Rao, K., Irsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP, pp. 4280–4284 (2015)
Google Scholar
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google Search by Voice: A Case Study. Springer, New York (2010)
Google Scholar
Seltzer, M., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust handsfree speech recognition. IEEE Trans. Audio Speech Lang. Process. 12(5), 489–498 (2004)
Article Google Scholar
Senior, A., Heigold, G., Bacchiani, M., Liao, H.: GMM-free DNN training. In: Proceedings of ICASSP (2014)
Google Scholar
Senior, A.W., Sak, H., Shafran, I.: Context dependent phone models for LSTM RNN acoustic modelling. In: Proceedings of ICASSP, pp. 4585–4589 (2015)
Google Scholar
Shan, J., Wu, G., Hu, Z., Tang, X., Jansche, M., Moreno, P.J.: Search by voice in Mandarin Chinese. In: Proceedings of Interspeech, pp. 354–357 (2010)
Google Scholar
Shugrina, M.: Formatting time-aligned ASR transcripts for readability. In: 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010)
Google Scholar
Sorensen, J., Allauzen, C.: Unary data structures for language models. In: Proceedings of Interspeech (2011)
Google Scholar
Sparrowhawk. https://github.com/google/sparrowhawk (2016)
Tokuda, K., Zen, H.: Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In: Proceedings of ICASSP, pp. 4215–4219 (2015)
Google Scholar
Tokuda, K., Zen, H.: Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In: Proceedings of ICASSP (2015)
Google Scholar
Vanhoucke, V.: Speech recognition and deep learning. In: Google Research blog (2012). https://research.googleblog.com/2012/08/speech-recognition-and-deep-learning.html
Google Scholar
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of ICASSP (2014)
Book Google Scholar
Variani, E., Sainath, T.N., Shafran, I., Bacchiani, M.: Complex Linear Prediction (CLP): a discriminative approach to joint feature extraction and acoustic modeling. In: Proceedings of Interspeech (2016)
Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition using time-delay neural networks. In: Proceedings of ICASSP, vol. 37, pp. 328–339 (1989)
Google Scholar
Zen, H.: Acoustic modeling for speech synthesis – from HMM to RNN. Invited Talk. In: ASRU (2015)
Google Scholar
Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of ICASSP, pp. 4470–4474 (2015)
Google Scholar
Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proceedings of ICASSP, pp. 3872–3876 (2014)
Google Scholar
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP, pp. 7962–7966 (2013)
Google Scholar
Zen, H., Agiomyrgiannakis, Y., Egberts, N., Henderson, F., Szczepaniak, P.: Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In: Interspeech (2016)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Google Inc., 76 Ninth Ave, New York, NY, 10011, USA
Michiel Bacchiani, Pedro Moreno & Johan Schalkwyk
Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA
Françoise Beaufays, Alexander Gruenstein & Trevor Strohman
Google Inc., 1-13 Saint Giles High Street, London, WC2H 8AG, USA
Heiga Zen

Authors

Michiel Bacchiani
View author publications
You can also search for this author in PubMed Google Scholar
Françoise Beaufays
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gruenstein
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Moreno
View author publications
You can also search for this author in PubMed Google Scholar
Johan Schalkwyk
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Strohman
View author publications
You can also search for this author in PubMed Google Scholar
Heiga Zen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michiel Bacchiani .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bacchiani, M. et al. (2017). Speech Research at Google to Enable Universal Speech Interfaces. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_18
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics