Abstract
Since the wide adoption of smartphones, speech as an input modality has developed from a science fiction dream to a widely accepted technology. The quality demand on this technology that allowed fueling this adoption is high and has been a continuous focus of research activities at Google. Early adoption of large neural network model deployments and training of such models on large datasets has significantly improved core recognition accuracy. Adoption of novel approaches like long short-term memory models and connectionist temporal classification have further improved accuracy and reduced latency. In addition, algorithms that allow adaptive language modeling improve accuracy based on the context of the speech input. Focus on expanding coverage of the user population in terms of languages and speaker characteristics (e.g., child speech) has lead to novel algorithms that further pushed the universal speech input vision. Continuing this trend, our most recent investigations have been on noise and far-field robustness. Tackling speech processing in those environments will enable applications of in-car, wearable, and in-the-home scenarios and as such be another step towards true universal speech input. This chapter will briefly describe the algorithmic developments at Google over the past decade that have brought speech processing to where it is today.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agiomyrgiannakis, Y.: Vocaine the vocoder and applications in speech synthesis. In: Proceedings of ICASSP (2015)
Alberti, C., Bacchiani, M.: Discriminative features for language identification. In: Proceeding of Interspeech (2011)
Alberti, C., Bacchiani, M., Bezman, A., Chelba, C., Drofa, A., Liao, H., Moreno, P., Power, T., Sahuguet, A., Shugrina, M., Siohan, O.: An audio indexing system for election video material. In: Proceedings of ICASSP (2009)
Aleksic, P., Allauzen, C., Elson, D., Kracun, A., Casado, D.M., Moreno, P.J.: Improved recognition of contact names in voice commands. In: Proceedings of ICASSP, pp. 4441–4444 (2015)
Aleksic, P., Ghodsi, M., Michaely, A., Allauzen, C., Hall, K., Roark, B., Rybach, D., Moreno, P.: Bringing contextual information to Google speech recognition. In: Proceedings of Interspeech (2015)
Allauzen, C., Riley, M.: Bayesian language model interpolation for mobile speech input. In: Proceedings of Interspeech, pp. 1429–1432 (2011)
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA) (2007)
Alvarez, R., Prabhavalkar, R., Bakhtin, A.: On the efficient representation and execution of deep acoustic models. In: Proceedings of Interspeech (2016)
Bacchiani, M., Rybach, D.: Context dependent state tying for speech recognition using deep neural network acoustic models. In: Proceedings of ICASSP (2014)
Bacchiani, M., Beaufays, F., Schalkwyk, J., Schuster, M., Strope, B.: Deploying GOOG-411: early lessons in data, measurement, and testing. In: Proceedings of ICASSP (2008)
Beaufays, F.: The neural networks behind Google voice transcription. In: Google Research blog (2015). https://research.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html
Beaufays, F.: How the dream of speech recognition became a reality. In: Google (2016). www.google.com/about/careers/stories/how-one-team-turned-the-dream-of-speech-recognition-into-a-reality
Beaufays, F., Strope, B.: Language modeling capitalization. In: Proceedings of ICASSP (2013)
Bourlard, H., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic, Dordrecht (1993)
Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: Proceedings of ICASSP (2014)
Chen, G., Parada, C., Sainath, T.N.: Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of ICASSP (2015)
Chen, Y.H., Lopez-Moreno, I., Sainath, T., Visontai, M., Alvarez, R., Parada, C.: Locally-connected and convolutional neural networks for small footprint speaker recognition. In: Proceedings of Interspeech (2015)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large Clusters. In: OSDI’04, Sixth Symposium on Operating System Design and Implementation (2004)
Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Proceedings of Neural Information Processing Systems (NIPS) (2012)
Ebden, P., Sproat, R.: The Kestrel TTS text normalization system. J. Nat. Lang. Eng. 21(3), 333–353 (2014)
Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J.: Frame by frame language identification in short utterances using deep neural networks. In: Neural Networks, Special Issue: Neural Network Learning in Big Data, pp. 49–58 (2014)
Gonzalvo, X., Tazari, S., Chan, C.A., Becker, M., Gutkin, A., Silen, H.: Recent advances in Google real-time HMM-driven unit selection synthesizer. In: Proceeding of Interspeech (2016)
Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization in transcribed speech. In: Proceedings of ICASSP (2009)
Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, New York (2012)
Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: Proceedings of ICASSP (2014)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.M.: End-to-end text-dependent speaker verification. In: Proceedings of ICASSP (2016)
Hershey, J.R., Roux, J.L., Weninger, F.: Deep unfolding: model-based inspiration of novel deep architectures. CoRR abs/1409.2574 (2014)
Hinton, G., Deng, L., Yu, D., Dahl, G., Rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: Proceedings of ICASSP (2015)
Hughes, T., Nakajima, K., Ha, L., Vasu, A., Moreno, P., LeBeau, M.: Building transcribed speech corpora quickly and cheaply for many languages. In: Proceedings of Interspeech (2010)
Kawahara, H., Agiomyrgiannakis, Y., Zen, H.: Using instantaneous frequency and aperiodicity detection to estimate f0 for high-quality speech synthesis. In: ISCA SSW9 (2016)
Lei, X., Senior, A., Gruenstein, A., Sorensen, J.: Accurate and compact large vocabulary speech recognition on mobile devices. In: Proceedings of Interspeech (2013)
Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M.: Neural network adaptive beamforming for robust multichannel speech recognition. In: Interspeech (2016)
Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013)
McGraw, I., Prabhavalkar, R., Alvarez, R., Arenas, M.G., Rao, K., Rybach, D., Alsharif, O., Sak, H., Gruenstein, A., Beaufays, F., Parada, C.: Personalized speech recognition on mobile devices. In: Proceedings of ICASSP (2016)
Nakkiran, P., Alvarez, R., Prabhavalkar, R., Parada, C.: Compressing deep neural networks using a rank-constrained topology. In: Proceedings of Interspeech, pp. 1473–1477 (2015)
Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P., Sainath, T.: Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. In: Proceedings of ICASSP, pp. 4704–4708 (2015)
Prabhavalkar, R., Alsharif, O., Bruguier, A., McGraw, I.: On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition. In: Proceedings of ICASSP (2016)
Rao, K., Peng, F., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: Proceedings of ICASSP (2016)
Recht, B., Re, C., Wright, S., Feng, N.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 693–701. Curran Associates, Red Hook (2011)
Robinson, T., Hochberg, M., Renals, S.: The Use of Recurrent Neural Networks in Continuous Speech Recognition. Springer, New York (1995)
Rutherford, A., Peng, F., Beaufays, F.: Pronunciation learning for named-entities through crowd-sourcing. In: Proceedings of Interspeech (2014)
Sainath, T., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Proceedings of Interspeech (2015)
Sainath, T., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of ICASSP (2015)
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M., Senior, A.: Speaker localization and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2015)
Sainath, T.N., Weiss, R.J., Wilson, K.W., Senior, A., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of Interspeech (2015)
Sainath, T.N., Narayanan, A., Weiss, R.J., Wilson, K.W., Bacchiani, M., Shafran, I.: Improvements to factorized neural network multichannel models. In: Interspeech (2016)
Sainath, T.N., Weiss, R.J., Wilson, K.W., Narayanan, A., Bacchiani, M.: Factored spatial and spectral multichannel raw waveform CLDNNS. In: Proceedings of ICASSP (2016)
Sak, H., Sung, Y., Beaufays, F., Allauzen, C.: Written-domain language modeling for automatic speech recognition. In: Proceedings of Interspeech (2013)
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Proceedings of Interspeech, pp. 338–342 (2014)
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., Mao, M.: Sequence discriminative distributed training of long short-term memory recurrent neural networks. In: Proceedings of Interspeech (2014)
Sak, H., Senior, A., Rao, K., Beaufays, F., Schalkwyk, J.: Google voice search: faster and more accurate. In: Google Research blog (2015). https://research.googleblog.com/2015/09/google-voice-search-faster-and-more.html
Sak, H., Senior, A.W., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. CoRR abs/1507.06947 (2015)
Sak, H., Senior, A.W., Rao, K., Irsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Proceedings of ICASSP, pp. 4280–4284 (2015)
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Garrett, M., Strope, B.: Google Search by Voice: A Case Study. Springer, New York (2010)
Seltzer, M., Raj, B., Stern, R.M.: Likelihood-maximizing beamforming for robust handsfree speech recognition. IEEE Trans. Audio Speech Lang. Process. 12(5), 489–498 (2004)
Senior, A., Heigold, G., Bacchiani, M., Liao, H.: GMM-free DNN training. In: Proceedings of ICASSP (2014)
Senior, A.W., Sak, H., Shafran, I.: Context dependent phone models for LSTM RNN acoustic modelling. In: Proceedings of ICASSP, pp. 4585–4589 (2015)
Shan, J., Wu, G., Hu, Z., Tang, X., Jansche, M., Moreno, P.J.: Search by voice in Mandarin Chinese. In: Proceedings of Interspeech, pp. 354–357 (2010)
Shugrina, M.: Formatting time-aligned ASR transcripts for readability. In: 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010)
Sorensen, J., Allauzen, C.: Unary data structures for language models. In: Proceedings of Interspeech (2011)
Sparrowhawk. https://github.com/google/sparrowhawk (2016)
Tokuda, K., Zen, H.: Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In: Proceedings of ICASSP, pp. 4215–4219 (2015)
Tokuda, K., Zen, H.: Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In: Proceedings of ICASSP (2015)
Vanhoucke, V.: Speech recognition and deep learning. In: Google Research blog (2012). https://research.googleblog.com/2012/08/speech-recognition-and-deep-learning.html
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: Proceedings of ICASSP (2014)
Variani, E., Sainath, T.N., Shafran, I., Bacchiani, M.: Complex Linear Prediction (CLP): a discriminative approach to joint feature extraction and acoustic modeling. In: Proceedings of Interspeech (2016)
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.: Phoneme recognition using time-delay neural networks. In: Proceedings of ICASSP, vol. 37, pp. 328–339 (1989)
Zen, H.: Acoustic modeling for speech synthesis – from HMM to RNN. Invited Talk. In: ASRU (2015)
Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of ICASSP, pp. 4470–4474 (2015)
Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proceedings of ICASSP, pp. 3872–3876 (2014)
Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of ICASSP, pp. 7962–7966 (2013)
Zen, H., Agiomyrgiannakis, Y., Egberts, N., Henderson, F., Szczepaniak, P.: Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In: Interspeech (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Bacchiani, M. et al. (2017). Speech Research at Google to Enable Universal Speech Interfaces. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)