Robust Features in Deep-Learning-Based Speech Recognition

Mitra, Vikramjit; Franco, Horacio; Stern, Richard M.; van Hout, Julien; Ferrer, Luciana; Graciarena, Martin; Wang, Wen; Vergyri, Dimitra; Alwan, Abeer; Hansen, John H. L.

doi:10.1007/978-3-319-64680-0_8

Vikramjit Mitra⁵,
Horacio Franco⁵,
Richard M. Stern⁶,
Julien van Hout⁵,
Luciana Ferrer⁷,
Martin Graciarena⁵,
Wen Wang⁵,
Dimitra Vergyri⁵,
Abeer Alwan⁸ &
…
John H. L. Hansen⁹

2391 Accesses
17 Citations

Abstract

Recent progress in deep learning has revolutionized speech recognition research, with Deep Neural Networks (DNNs) becoming the new state of the art for acoustic modeling. DNNs offer significantly lower speech recognition error rates compared to those provided by the previously used Gaussian Mixture Models (GMMs). Unfortunately, DNNs are data sensitive, and unseen data conditions can deteriorate their performance. Acoustic distortions such as noise, reverberation, channel differences, etc. add variation to the speech signal, which in turn impact DNN acoustic model performance. A straightforward solution to this issue is training the DNN models with these types of variation, which typically provides quite impressive performance. However, anticipating such variation is not always possible; in these cases, DNN recognition performance can deteriorate quite sharply. To avoid subjecting acoustic models to such variation, robust features have traditionally been used to create an invariant representation of the acoustic space. Most commonly, robust feature-extraction strategies have explored three principal areas: (a) enhancing the speech signal, with a goal of improving the perceptual quality of speech; (b) reducing the distortion footprint, with signal-theoretic techniques used to learn the distortion characteristics and subsequently filter them out of the speech signal; and finally (c) leveraging knowledge from auditory neuroscience and psychoacoustics, by using robust features inspired by auditory perception.

In this chapter, we present prominent robust feature-extraction strategies explored by the speech recognition research community, and we discuss their relevance to coping with data-mismatch problems in DNN-based acoustic modeling. We present results demonstrating the efficacy of robust features in the new paradigm of DNN acoustic models. And we discuss future directions in feature design for making speech recognition systems more robust to unseen acoustic conditions. Note that the approaches discussed in this chapter focus primarily on single channel data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280. IEEE (2012)
Google Scholar
Abdel-Hamid, O., Deng, L., Yu, D.: Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Interspeech, pp. 3366–3370 (2013)
Google Scholar
Atal, B.S., Hanauer, S.L.: Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am. 50(2B), 637–655 (1971)
Article Google Scholar
Athineos, M., Ellis, D.P.: Frequency-domain linear prediction for temporal features. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU’30, pp. 261–266. IEEE (2003)
Google Scholar
Athineos, M., Hermansky, H., Ellis, D.P.: LP-TRAP: linear predictive temporal patterns. Technical Report, IDIAP (2004)
Google Scholar
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “chiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) (2015)
Google Scholar
Bartels, C., Wang, W., Mitra, V., Richey, C., Kathol, A., Vergyri, D., Bratt, H., Hung, C.: Toward human-assisted lexical unit discovery without text resources. In: SLT (2016)
Book Google Scholar
Beh, J., Ko, H.: A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’03, vol. 1, pp. I–648. IEEE (2003)
Google Scholar
Bell, P., Gales, M., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., et al.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: 2015 Automatic Speech Recognition and Understanding Workshop (ASRU 2013) (2015)
Google Scholar
Benesty, J., Makino, S.: Speech Enhancement. Springer Science & Business Media, New York (2005)
Google Scholar
Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Unsupervised and Transfer Learning Challenges in Machine Learning, vol. 7, p. 19 (2012)
Google Scholar
Bhargava, M., Rose, R.: Architectures for deep neural network based acoustic models defined over windowed speech waveforms. In: 16th Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Boll, S.F.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge, MA (1994)
Google Scholar
Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Interspeech, pp. 905–909 (2014)
Google Scholar
Cieri, C., Miller, D., Walker, K.: The fisher corpus: a resource for the next generations of speech-to-text. In: LREC, vol. 4, pp. 69–71 (2004)
Google Scholar
Cohen, J.R.: Application of an auditory model to speech recognition. J. Acoust. Soc. Am. 85(6), 2623–2629 (1989)
Article Google Scholar
Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)
Article MATH Google Scholar
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Davis, K., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Am. 24(6), 637–642 (1952)
Article Google Scholar
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., et al.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Proceedings of the REVERB Workshop (2014)
Google Scholar
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8599–8603. IEEE (2013)
Google Scholar
Dennis, J., Dat, T.H.: Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R’s system description for the ASpIRE challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 518–524. IEEE (2015)
Google Scholar
Drullman, R., Festen, J.M., Plomp, R.: Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5), 2670–2680 (1994)
Article Google Scholar
Elliott, T.M., Theunissen, F.E.: The modulation transfer function for speech intelligibility. PLoS Comput. Biol. 5(3), e1000302 (2009)
Article Google Scholar
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms. ETSI ES 21 108, ver. 1.1.3 (2003)
Google Scholar
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 202, 050, ver. 1.1.5 (2007)
Google Scholar
Fine, S., Saon, G., Gopinath, R.A.: Digit recognition in noisy environments via a sequential GMM/SVM system. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–49. IEEE (2002)
Google Scholar
Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354. IEEE (1997)
Google Scholar
Flynn, R., Jones, E.: Combined speech enhancement and auditory modelling for robust distributed speech recognition. Speech Commun. 50(10), 797–809 (2008)
Article Google Scholar
Gales, M.J., Woodland, P.C.: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10(4), 249–264 (1996)
Article Google Scholar
Ganapathy, S., Thomas, S., Hermansky, H.: Temporal envelope compensation for robust phoneme recognition using modulation spectrum. J. Acoust. Soc. Am. 128(6), 3769–3780 (2010)
Article Google Scholar
Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S., Parthasarathi, S.H.K.: Robust i-vector based adaptation of DNN acoustic model for speech recognition. In: Interspeech (2015)
Google Scholar
Gelly, G., Gauvain, J-L.: Minimum word error training of RNN-based voice activity detection. In: Interspeech, pp. 2650–2654 (2015)
Google Scholar
Gemmeke, J.F., Virtanen, T.: Noise robust exemplar-based connected digit recognition. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4546–4549. IEEE (2010)
Google Scholar
Ghitza, O.: Auditory nerve representation as a front-end for speech recognition in a noisy environment. Comput. Speech Lang. 1(2), 109–130 (1986)
Article Google Scholar
Ghitza, O.: On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception. J. Acoust. Soc. Am. 110(3), 1628–1640 (2001)
Article Google Scholar
Gibson, J., Van Segbroeck, M., Narayanan, S.S.: Comparing time–frequency representations for directional derivative features. In: Interspeech, pp. 612–615 (2014)
Google Scholar
Giegerich, H.J.: English Phonology: An Introduction. Cambridge University Press, Cambridge (1992)
Book Google Scholar
Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L., Hansen, J.H., Janin, A., Lee, B.S., Lei, Y., Mitra, V., et al.: All for one: feature combination for highly channel-degraded speech activity detection. In: Interspeech, pp. 709–713 (2013)
Google Scholar
Graciarena, M., Ferrer, L., Mitra, V.: The SRI system for the NIST open sad 2015 speech activity detection evaluation. In: Interspeech, pp. 3673–3677 (2016)
Google Scholar
Grezl, F., Egorova, E., Karafiát, M.: Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 48–53. IEEE (2014)
Google Scholar
Gustafsson, H., Nordholm, S.E., Claesson, I.: Spectral subtraction using reduced delay convolution and adaptive averaging. IEEE Trans. Speech Audio Process. 9(8), 799–807 (2001)
Article Google Scholar
Harper, M.: The automatic speech recognition in reverberant environments (ASpIRE) challenge. In: ASRU (2015)
Google Scholar
Harvilla, M.J., Stern, R.M.: Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4697–4700. IEEE (2012)
Google Scholar
Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Article Google Scholar
Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Article Google Scholar
Hermansky, H., Sharma, S.: Temporal patterns (TRAPS) in ASR of noisy speech. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 289–292. IEEE (1999)
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Hirsch, G.: Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ Aurora DSR Working Group (2002)
Google Scholar
Hori, T., Chen, Z., Erdogan, H., Hershey, J.R., Roux, J., Mitra, V., Watanabe, S.: The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition. In: Proceedings of the IEEE ASRU (2015)
Book Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)
Google Scholar
Hsiao, R., Ma, J., Hartmann, W., Karafiat, M., Grézl, F., Burget, L., Szoke, I., Cernocky, J., Watanabe, S., Chen, Z., et al.: Robust speech recognition in unknown reverberant and noisy conditions. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (2015)
Book Google Scholar
Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 67–72 (1975)
Article Google Scholar
Itakura, F., Saito, S.: Statistical method for estimation of speech spectral density and formant frequencies. Electron. Commun. Jpn. 53(1), 36 (1970)
Google Scholar
Jabloun, F., Cetin, A.E., Erzin, E.: Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Process. Lett. 6(10), 259–261 (1999)
Article Google Scholar
Joris, P., Schreiner, C., Rees, A.: Neural processing of amplitude-modulated sounds. Physiol. Rev. 84(2), 541–577 (2004)
Article Google Scholar
Juang, B.H., Rabiner, L.R.: Automatic speech recognition – a brief history of the technology development. In: Encyclopedia of Language and Linguistics. Elsevier, Amsterdam (2005)
Google Scholar
Kanedera, N., Arai, T., Hermansky, H., Pavel, M.: On the importance of various modulation frequencies for speech recognition. In: 5th European Conference on Speech Communication and Technology (1997)
Google Scholar
Karafiát, M., Grézl, F., Burget, L., Szöke, I., Černockỳ, J.: Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In: 16th Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Kim, C., Stern, R.M.: Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. In: ICASSP, pp. 4574–4577 (2010)
Google Scholar
Kim, C., Stern, R.M.: Power-Cepstral Coefficients (PNCC) for Robust Speech Recognition. IEEE/ACM Trans. Audio, Speech, and Language Process. 24(7), 1315–1329 (2016)
Google Scholar
Kingsbury, B.E., Morgan, N., Greenberg, S.: Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)
Article Google Scholar
Kingsbury, B., Saon, G., Mangu, L., Padmanabhan, M., Sarikaya, R.: Robust speech recognition in noisy environments: the 2001 IBM spine evaluation system. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–53. IEEE (2002)
Google Scholar
Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: 13th Annual Conference of the International Speech Communication Association (2012)
Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE (2013)
Google Scholar
Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006, vol. 1, pp. I-237–I-240. IEEE (2006)
Google Scholar
Lyon, R.F.: A computational model of filtering, detection, and compression in the cochlea. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’82, vol. 7, pp. 1282–1285. IEEE (1982)
Google Scholar
Makhoul, J., Cosell, L.: LPCW: an LPC vocoder with linear predictive spectral warping. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’76, vol. 1, pp. 466–469. IEEE (1976)
Google Scholar
Maragos, P., Kaiser, J.F., Quatieri, T.F.: Energy separation in signal modulations with application to speech analysis. IEEE Trans. Signal Process. 41(10), 3024–3051 (1993)
Article MATH Google Scholar
Mesgarani, N., David, S., Shamma, S.: Representation of phonemes in primary auditory cortex: how the brain analyzes speech. In: 2007, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. IV–765. IEEE (2007)
Google Scholar
Meyer, B.T., Ravuri, S.V., Schädler, M.R., Morgan, N.: Comparing different flavors of spectro-temporal features for ASR. In: Interspeech, pp. 1269–1272 (2011)
Google Scholar
Mitra, V., Franco, H.: Time–frequency convolutional networks for robust speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 317–323. IEEE (2015)
Google Scholar
Mitra, V., Franco, H.: Coping with unseen data conditions: investigating neural net architectures, robust features, and information fusion for robust speech recognition. In: Interspeech, pp. 3783–3787 (2016)
Google Scholar
Mitra, V., Nam, H., Espy-Wilson, C.Y., Saltzman, E., Goldstein, L.: Retrieving tract variables from acoustics: a comparison of different machine learning strategies. IEEE J. Sel. Top. Signal. Process. 4(6), 1027–1045 (2010)
Article Google Scholar
Mitra, V., Franco, H., Graciarena, M., Mandal, A.: Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4117–4120. IEEE (2012)
Google Scholar
Mitra, V., Franco, H., Graciarena, M.: Damped oscillator cepstral coefficients for robust speech recognition. In: Interspeech, pp. 886–890 (2013)
Google Scholar
Mitra, V., Franco, H., Graciarena, M., Vergyri, D.: Medium-duration modulation cepstral feature for robust speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1749–1753. IEEE (2014)
Google Scholar
Mitra, V., Wang, W., Franco, H.: Deep convolutional nets and robust features for reverberation-robust speech recognition. In: Spoken Language Technology Workshop (SLT), pp. 548–553. IEEE (2014)
Google Scholar
Mitra, V., Wang, W., Franco, H., Lei, Y., Bartels, C., Graciarena, M.: Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In: Interspeech, pp. 895–899 (2014)
Google Scholar
Mitra, V., Hout, J.V., McLaren, M., Wang, W., Graciarena, M., Vergyri, D., Franco, H.: Combating reverberation in large vocabulary continuous speech recognition. In: 16th Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Mitra, V., Van Hout, J., Wang, W., Graciarena, M., McLaren, M., Franco, H., Vergyri, D.: Improving robustness against reverberation for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 525–532. IEEE (2015)
Google Scholar
Mitra, V., van Hout, J., Wang, W., Bartels, C., Franco, H., Vergyri, D., et al.: Fusion strategies for robust speech recognition and keyword spotting for channel- and noise-degraded speech. In: Interspeech, 2016 (2016)
Book Google Scholar
Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
Article Google Scholar
Moore, B.: An Introduction to the Psychology of Hearing. Emerald Group Publishing Ltd., Bingley (1989)
Google Scholar
Neiman, A.B., Dierkes, K., Lindner, B., Han, L., Shilnikov, A.L., et al.: Spontaneous voltage oscillations and response dynamics of a Hodgkin–Huxley type model of sensory hair cells. J. Math. Neurosci. 1(11), 11 (2011)
Article MathSciNet MATH Google Scholar
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: 16th Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.: Complex sounds and auditory images. Audit. Physiol. Percep. 83, 429–446 (1992)
Article Google Scholar
Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: JHU ASPIRE system: robust LVCSR with TDNNS, i-vector adaptation, and RNN-LMS. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (2015)
Google Scholar
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech (2015)
Google Scholar
Potamianos, A., Maragos, P.: Time–frequency distributions for automatic speech recognition. IEEE Trans. Speech Audio Process. 9(3), 196–200 (2001)
Article Google Scholar
Rabiner, L.R., Levinson, S.E., Rosenberg, A.E., Wilpon, J.G.: Speaker-independent recognition of isolated words using clustering techniques. IEEE Trans. Acoust. Speech Signal Process. 27(4), 336–349 (1979)
Article MATH Google Scholar
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Interspeech, pp. 109–113 (2013)
Google Scholar
Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302. IEEE (2013)
Google Scholar
Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)
Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of the Interspeech (2015)
Google Scholar
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013)
Google Scholar
Schroeder, M.R.: Recognition of complex acoustic signals. Life Sci. Res. Rep. 5(324), 130 (1977)
Google Scholar
Schwarz, P.: Phoneme recognition based on long temporal context. Ph.D. thesis, Burno University of Technology (2009)
Google Scholar
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE (2011)
Google Scholar
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Interspeech, pp. 437–440 (2011)
Google Scholar
Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402. IEEE (2013)
Google Scholar
Seneff, S.: A joint synchrony/mean-rate model of auditory speech processing. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognition, pp. 101–111. Morgan Kaufmann, Burlington, MA (1990)
Chapter Google Scholar
Shao, Y., Srinivasan, S., Jin, Z., Wang, D.: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang. 24(1), 77–93 (2010)
Article Google Scholar
Srinivasan, S., Wang, D.: Transforming binary uncertainties for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(7), 2130–2140 (2007)
Article Google Scholar
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Article Google Scholar
Tchorz, J., Kollmeier, B.: A model of auditory perception as front end for automatic speech recognition. J. Acoust. Soc. Am. 106(4), 2040–2050 (1999)
Article Google Scholar
Teager, H.M.: Some observations on oral air flow during phonation. IEEE Trans. Acoust. Speech Signal Process. 28(5), 599–601 (1980)
Article Google Scholar
Thomas, S., Saon, G., Van Segbroeck, M., Narayanan, S.S.: Improvements to the IBM speech activity detection system for the DARPA RATS program. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4500–4504. IEEE (2015)
Google Scholar
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Interspeech, pp. 890–894 (2014)
Google Scholar
Tyagi, V.: Fepstrum features: design and application to conversational speech recognition. Technical Report, IBM Research Report (2011)
Google Scholar
Van Hout, J.: Low complexity spectral imputation for noise robust speech recognition. M.S. thesis, UCLA (2012)
Google Scholar
Viemeister, N.F.: Temporal modulation transfer functions based upon modulation thresholds. J. Acoust. Soc. Am. 66(5), 1364–1380 (1979)
Article Google Scholar
Virag, N.: Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process. 7(2), 126–137 (1999)
Article Google Scholar
Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer Academic, Dordrecht (2005)
Chapter Google Scholar
Weninger, F., Watanabe, S., Le Roux, J., Hershey, J., Tachioka, Y., Geiger, J., Schuller, B., Rigoll, G.: The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In: Proceedings of the REVERB Workshop (2014)
Google Scholar
Yoshioka, T., Ragni, A., Gales, M.J.: Investigation of unsupervised adaptation of DNN acoustic models with filter bank input. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6344–6348. IEEE (2014)
Google Scholar
Yost, W.A., Moore, M.: Temporal changes in a complex spectral profile. J. Acoust. Soc. Am. 81(6), 1896–1905 (1987)
Article Google Scholar
Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks – studies on speech recognition tasks. arXiv:1301.3605 (2013, arXiv preprint)
Google Scholar
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. IEEE (2013)
Google Scholar
Zhan, P., Waibel, A.: Vocal tract length normalization for LVCSR. Technical Report, CMU-LTI-97-150, Carnegie Mellon University (1997)
Google Scholar
Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Incorporating tandem/HATS MLP features into SRIS conversational speech recognition system. In: Proceedings of the DARPA Rich Transcription Workshop (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Speech Technology and Research (STAR) Lab., SRI International, 333 Ravenswood Ave., Menlo Park, CA, 94025-3493, USA
Vikramjit Mitra, Horacio Franco, Julien van Hout, Martin Graciarena, Wen Wang & Dimitra Vergyri
Department of Electrical and Computer Engineering and Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA
Richard M. Stern
Instituto de Investigación en Ciencias de la Computación, CONICET-UBA, Oficina 15, Pabellón I, Ciudad Universitaria, C1428EGA, Ciudad de Buenos Aires, Argentina
Luciana Ferrer
Department of Electrical Engineering, University of California, Los Angeles, 405 Hilgard Ave., Los Angeles, CA, 90095, USA
Abeer Alwan
Center for Robust Speech Systems (CRSS), University of Texas at Dallas, 800 W Campbell Road, Richardson, TX, 75080-3021, USA
John H. L. Hansen

Authors

Vikramjit Mitra
View author publications
You can also search for this author in PubMed Google Scholar
Horacio Franco
View author publications
You can also search for this author in PubMed Google Scholar
Richard M. Stern
View author publications
You can also search for this author in PubMed Google Scholar
Julien van Hout
View author publications
You can also search for this author in PubMed Google Scholar
Luciana Ferrer
View author publications
You can also search for this author in PubMed Google Scholar
Martin Graciarena
View author publications
You can also search for this author in PubMed Google Scholar
Wen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dimitra Vergyri
View author publications
You can also search for this author in PubMed Google Scholar
Abeer Alwan
View author publications
You can also search for this author in PubMed Google Scholar
John H. L. Hansen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vikramjit Mitra .

Editor information

Editors and Affiliations

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
Shinji Watanabe
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Marc Delcroix
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Florian Metze
Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA
John R. Hershey

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mitra, V. et al. (2017). Robust Features in Deep-Learning-Based Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-64680-0_8
Published: 26 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics