Abstract
Recent progress in deep learning has revolutionized speech recognition research, with Deep Neural Networks (DNNs) becoming the new state of the art for acoustic modeling. DNNs offer significantly lower speech recognition error rates compared to those provided by the previously used Gaussian Mixture Models (GMMs). Unfortunately, DNNs are data sensitive, and unseen data conditions can deteriorate their performance. Acoustic distortions such as noise, reverberation, channel differences, etc. add variation to the speech signal, which in turn impact DNN acoustic model performance. A straightforward solution to this issue is training the DNN models with these types of variation, which typically provides quite impressive performance. However, anticipating such variation is not always possible; in these cases, DNN recognition performance can deteriorate quite sharply. To avoid subjecting acoustic models to such variation, robust features have traditionally been used to create an invariant representation of the acoustic space. Most commonly, robust feature-extraction strategies have explored three principal areas: (a) enhancing the speech signal, with a goal of improving the perceptual quality of speech; (b) reducing the distortion footprint, with signal-theoretic techniques used to learn the distortion characteristics and subsequently filter them out of the speech signal; and finally (c) leveraging knowledge from auditory neuroscience and psychoacoustics, by using robust features inspired by auditory perception.
In this chapter, we present prominent robust feature-extraction strategies explored by the speech recognition research community, and we discuss their relevance to coping with data-mismatch problems in DNN-based acoustic modeling. We present results demonstrating the efficacy of robust features in the new paradigm of DNN acoustic models. And we discuss future directions in feature design for making speech recognition systems more robust to unseen acoustic conditions. Note that the approaches discussed in this chapter focus primarily on single channel data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280. IEEE (2012)
Abdel-Hamid, O., Deng, L., Yu, D.: Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Interspeech, pp. 3366–3370 (2013)
Atal, B.S., Hanauer, S.L.: Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am. 50(2B), 637–655 (1971)
Athineos, M., Ellis, D.P.: Frequency-domain linear prediction for temporal features. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU’30, pp. 261–266. IEEE (2003)
Athineos, M., Hermansky, H., Ellis, D.P.: LP-TRAP: linear predictive temporal patterns. Technical Report, IDIAP (2004)
Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “chiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) (2015)
Bartels, C., Wang, W., Mitra, V., Richey, C., Kathol, A., Vergyri, D., Bratt, H., Hung, C.: Toward human-assisted lexical unit discovery without text resources. In: SLT (2016)
Beh, J., Ko, H.: A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’03, vol. 1, pp. I–648. IEEE (2003)
Bell, P., Gales, M., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., et al.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: 2015 Automatic Speech Recognition and Understanding Workshop (ASRU 2013) (2015)
Benesty, J., Makino, S.: Speech Enhancement. Springer Science & Business Media, New York (2005)
Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Unsupervised and Transfer Learning Challenges in Machine Learning, vol. 7, p. 19 (2012)
Bhargava, M., Rose, R.: Architectures for deep neural network based acoustic models defined over windowed speech waveforms. In: 16th Annual Conference of the International Speech Communication Association (2015)
Boll, S.F.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge, MA (1994)
Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Interspeech, pp. 905–909 (2014)
Cieri, C., Miller, D., Walker, K.: The fisher corpus: a resource for the next generations of speech-to-text. In: LREC, vol. 4, pp. 69–71 (2004)
Cohen, J.R.: Application of an auditory model to speech recognition. J. Acoust. Soc. Am. 85(6), 2623–2629 (1989)
Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Davis, K., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Am. 24(6), 637–642 (1952)
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., et al.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Proceedings of the REVERB Workshop (2014)
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8599–8603. IEEE (2013)
Dennis, J., Dat, T.H.: Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R’s system description for the ASpIRE challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 518–524. IEEE (2015)
Drullman, R., Festen, J.M., Plomp, R.: Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5), 2670–2680 (1994)
Elliott, T.M., Theunissen, F.E.: The modulation transfer function for speech intelligibility. PLoS Comput. Biol. 5(3), e1000302 (2009)
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms. ETSI ES 21 108, ver. 1.1.3 (2003)
ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 202, 050, ver. 1.1.5 (2007)
Fine, S., Saon, G., Gopinath, R.A.: Digit recognition in noisy environments via a sequential GMM/SVM system. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–49. IEEE (2002)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354. IEEE (1997)
Flynn, R., Jones, E.: Combined speech enhancement and auditory modelling for robust distributed speech recognition. Speech Commun. 50(10), 797–809 (2008)
Gales, M.J., Woodland, P.C.: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10(4), 249–264 (1996)
Ganapathy, S., Thomas, S., Hermansky, H.: Temporal envelope compensation for robust phoneme recognition using modulation spectrum. J. Acoust. Soc. Am. 128(6), 3769–3780 (2010)
Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S., Parthasarathi, S.H.K.: Robust i-vector based adaptation of DNN acoustic model for speech recognition. In: Interspeech (2015)
Gelly, G., Gauvain, J-L.: Minimum word error training of RNN-based voice activity detection. In: Interspeech, pp. 2650–2654 (2015)
Gemmeke, J.F., Virtanen, T.: Noise robust exemplar-based connected digit recognition. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4546–4549. IEEE (2010)
Ghitza, O.: Auditory nerve representation as a front-end for speech recognition in a noisy environment. Comput. Speech Lang. 1(2), 109–130 (1986)
Ghitza, O.: On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception. J. Acoust. Soc. Am. 110(3), 1628–1640 (2001)
Gibson, J., Van Segbroeck, M., Narayanan, S.S.: Comparing time–frequency representations for directional derivative features. In: Interspeech, pp. 612–615 (2014)
Giegerich, H.J.: English Phonology: An Introduction. Cambridge University Press, Cambridge (1992)
Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L., Hansen, J.H., Janin, A., Lee, B.S., Lei, Y., Mitra, V., et al.: All for one: feature combination for highly channel-degraded speech activity detection. In: Interspeech, pp. 709–713 (2013)
Graciarena, M., Ferrer, L., Mitra, V.: The SRI system for the NIST open sad 2015 speech activity detection evaluation. In: Interspeech, pp. 3673–3677 (2016)
Grezl, F., Egorova, E., Karafiát, M.: Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 48–53. IEEE (2014)
Gustafsson, H., Nordholm, S.E., Claesson, I.: Spectral subtraction using reduced delay convolution and adaptive averaging. IEEE Trans. Speech Audio Process. 9(8), 799–807 (2001)
Harper, M.: The automatic speech recognition in reverberant environments (ASpIRE) challenge. In: ASRU (2015)
Harvilla, M.J., Stern, R.M.: Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4697–4700. IEEE (2012)
Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)
Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Hermansky, H., Sharma, S.: Temporal patterns (TRAPS) in ASR of noisy speech. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 289–292. IEEE (1999)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Hirsch, G.: Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ Aurora DSR Working Group (2002)
Hori, T., Chen, Z., Erdogan, H., Hershey, J.R., Roux, J., Mitra, V., Watanabe, S.: The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition. In: Proceedings of the IEEE ASRU (2015)
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)
Hsiao, R., Ma, J., Hartmann, W., Karafiat, M., Grézl, F., Burget, L., Szoke, I., Cernocky, J., Watanabe, S., Chen, Z., et al.: Robust speech recognition in unknown reverberant and noisy conditions. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (2015)
Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 67–72 (1975)
Itakura, F., Saito, S.: Statistical method for estimation of speech spectral density and formant frequencies. Electron. Commun. Jpn. 53(1), 36 (1970)
Jabloun, F., Cetin, A.E., Erzin, E.: Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Process. Lett. 6(10), 259–261 (1999)
Joris, P., Schreiner, C., Rees, A.: Neural processing of amplitude-modulated sounds. Physiol. Rev. 84(2), 541–577 (2004)
Juang, B.H., Rabiner, L.R.: Automatic speech recognition – a brief history of the technology development. In: Encyclopedia of Language and Linguistics. Elsevier, Amsterdam (2005)
Kanedera, N., Arai, T., Hermansky, H., Pavel, M.: On the importance of various modulation frequencies for speech recognition. In: 5th European Conference on Speech Communication and Technology (1997)
Karafiát, M., Grézl, F., Burget, L., Szöke, I., Černockỳ, J.: Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In: 16th Annual Conference of the International Speech Communication Association (2015)
Kim, C., Stern, R.M.: Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. In: ICASSP, pp. 4574–4577 (2010)
Kim, C., Stern, R.M.: Power-Cepstral Coefficients (PNCC) for Robust Speech Recognition. IEEE/ACM Trans. Audio, Speech, and Language Process. 24(7), 1315–1329 (2016)
Kingsbury, B.E., Morgan, N., Greenberg, S.: Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)
Kingsbury, B., Saon, G., Mangu, L., Padmanabhan, M., Sarikaya, R.: Robust speech recognition in noisy environments: the 2001 IBM spine evaluation system. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–53. IEEE (2002)
Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: 13th Annual Conference of the International Speech Communication Association (2012)
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE (2013)
Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006, vol. 1, pp. I-237–I-240. IEEE (2006)
Lyon, R.F.: A computational model of filtering, detection, and compression in the cochlea. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’82, vol. 7, pp. 1282–1285. IEEE (1982)
Makhoul, J., Cosell, L.: LPCW: an LPC vocoder with linear predictive spectral warping. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’76, vol. 1, pp. 466–469. IEEE (1976)
Maragos, P., Kaiser, J.F., Quatieri, T.F.: Energy separation in signal modulations with application to speech analysis. IEEE Trans. Signal Process. 41(10), 3024–3051 (1993)
Mesgarani, N., David, S., Shamma, S.: Representation of phonemes in primary auditory cortex: how the brain analyzes speech. In: 2007, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. IV–765. IEEE (2007)
Meyer, B.T., Ravuri, S.V., Schädler, M.R., Morgan, N.: Comparing different flavors of spectro-temporal features for ASR. In: Interspeech, pp. 1269–1272 (2011)
Mitra, V., Franco, H.: Time–frequency convolutional networks for robust speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 317–323. IEEE (2015)
Mitra, V., Franco, H.: Coping with unseen data conditions: investigating neural net architectures, robust features, and information fusion for robust speech recognition. In: Interspeech, pp. 3783–3787 (2016)
Mitra, V., Nam, H., Espy-Wilson, C.Y., Saltzman, E., Goldstein, L.: Retrieving tract variables from acoustics: a comparison of different machine learning strategies. IEEE J. Sel. Top. Signal. Process. 4(6), 1027–1045 (2010)
Mitra, V., Franco, H., Graciarena, M., Mandal, A.: Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4117–4120. IEEE (2012)
Mitra, V., Franco, H., Graciarena, M.: Damped oscillator cepstral coefficients for robust speech recognition. In: Interspeech, pp. 886–890 (2013)
Mitra, V., Franco, H., Graciarena, M., Vergyri, D.: Medium-duration modulation cepstral feature for robust speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1749–1753. IEEE (2014)
Mitra, V., Wang, W., Franco, H.: Deep convolutional nets and robust features for reverberation-robust speech recognition. In: Spoken Language Technology Workshop (SLT), pp. 548–553. IEEE (2014)
Mitra, V., Wang, W., Franco, H., Lei, Y., Bartels, C., Graciarena, M.: Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In: Interspeech, pp. 895–899 (2014)
Mitra, V., Hout, J.V., McLaren, M., Wang, W., Graciarena, M., Vergyri, D., Franco, H.: Combating reverberation in large vocabulary continuous speech recognition. In: 16th Annual Conference of the International Speech Communication Association (2015)
Mitra, V., Van Hout, J., Wang, W., Graciarena, M., McLaren, M., Franco, H., Vergyri, D.: Improving robustness against reverberation for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 525–532. IEEE (2015)
Mitra, V., van Hout, J., Wang, W., Bartels, C., Franco, H., Vergyri, D., et al.: Fusion strategies for robust speech recognition and keyword spotting for channel- and noise-degraded speech. In: Interspeech, 2016 (2016)
Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)
Moore, B.: An Introduction to the Psychology of Hearing. Emerald Group Publishing Ltd., Bingley (1989)
Neiman, A.B., Dierkes, K., Lindner, B., Han, L., Shilnikov, A.L., et al.: Spontaneous voltage oscillations and response dynamics of a Hodgkin–Huxley type model of sensory hair cells. J. Math. Neurosci. 1(11), 11 (2011)
Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: 16th Annual Conference of the International Speech Communication Association (2015)
Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.: Complex sounds and auditory images. Audit. Physiol. Percep. 83, 429–446 (1992)
Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: JHU ASPIRE system: robust LVCSR with TDNNS, i-vector adaptation, and RNN-LMS. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (2015)
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech (2015)
Potamianos, A., Maragos, P.: Time–frequency distributions for automatic speech recognition. IEEE Trans. Speech Audio Process. 9(3), 196–200 (2001)
Rabiner, L.R., Levinson, S.E., Rosenberg, A.E., Wilpon, J.G.: Speaker-independent recognition of isolated words using clustering techniques. IEEE Trans. Acoust. Speech Signal Process. 27(4), 336–349 (1979)
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Interspeech, pp. 109–113 (2013)
Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302. IEEE (2013)
Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of the Interspeech (2015)
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013)
Schroeder, M.R.: Recognition of complex acoustic signals. Life Sci. Res. Rep. 5(324), 130 (1977)
Schwarz, P.: Phoneme recognition based on long temporal context. Ph.D. thesis, Burno University of Technology (2009)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE (2011)
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Interspeech, pp. 437–440 (2011)
Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402. IEEE (2013)
Seneff, S.: A joint synchrony/mean-rate model of auditory speech processing. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognition, pp. 101–111. Morgan Kaufmann, Burlington, MA (1990)
Shao, Y., Srinivasan, S., Jin, Z., Wang, D.: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang. 24(1), 77–93 (2010)
Srinivasan, S., Wang, D.: Transforming binary uncertainties for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(7), 2130–2140 (2007)
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Tchorz, J., Kollmeier, B.: A model of auditory perception as front end for automatic speech recognition. J. Acoust. Soc. Am. 106(4), 2040–2050 (1999)
Teager, H.M.: Some observations on oral air flow during phonation. IEEE Trans. Acoust. Speech Signal Process. 28(5), 599–601 (1980)
Thomas, S., Saon, G., Van Segbroeck, M., Narayanan, S.S.: Improvements to the IBM speech activity detection system for the DARPA RATS program. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4500–4504. IEEE (2015)
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Interspeech, pp. 890–894 (2014)
Tyagi, V.: Fepstrum features: design and application to conversational speech recognition. Technical Report, IBM Research Report (2011)
Van Hout, J.: Low complexity spectral imputation for noise robust speech recognition. M.S. thesis, UCLA (2012)
Viemeister, N.F.: Temporal modulation transfer functions based upon modulation thresholds. J. Acoust. Soc. Am. 66(5), 1364–1380 (1979)
Virag, N.: Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process. 7(2), 126–137 (1999)
Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer Academic, Dordrecht (2005)
Weninger, F., Watanabe, S., Le Roux, J., Hershey, J., Tachioka, Y., Geiger, J., Schuller, B., Rigoll, G.: The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In: Proceedings of the REVERB Workshop (2014)
Yoshioka, T., Ragni, A., Gales, M.J.: Investigation of unsupervised adaptation of DNN acoustic models with filter bank input. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6344–6348. IEEE (2014)
Yost, W.A., Moore, M.: Temporal changes in a complex spectral profile. J. Acoust. Soc. Am. 81(6), 1896–1905 (1987)
Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks – studies on speech recognition tasks. arXiv:1301.3605 (2013, arXiv preprint)
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. IEEE (2013)
Zhan, P., Waibel, A.: Vocal tract length normalization for LVCSR. Technical Report, CMU-LTI-97-150, Carnegie Mellon University (1997)
Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Incorporating tandem/HATS MLP features into SRIS conversational speech recognition system. In: Proceedings of the DARPA Rich Transcription Workshop (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Mitra, V. et al. (2017). Robust Features in Deep-Learning-Based Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-64680-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64679-4
Online ISBN: 978-3-319-64680-0
eBook Packages: Computer ScienceComputer Science (R0)