Skip to main content

Robust Features in Deep-Learning-Based Speech Recognition

  • Chapter
  • First Online:
Book cover New Era for Robust Speech Recognition

Abstract

Recent progress in deep learning has revolutionized speech recognition research, with Deep Neural Networks (DNNs) becoming the new state of the art for acoustic modeling. DNNs offer significantly lower speech recognition error rates compared to those provided by the previously used Gaussian Mixture Models (GMMs). Unfortunately, DNNs are data sensitive, and unseen data conditions can deteriorate their performance. Acoustic distortions such as noise, reverberation, channel differences, etc. add variation to the speech signal, which in turn impact DNN acoustic model performance. A straightforward solution to this issue is training the DNN models with these types of variation, which typically provides quite impressive performance. However, anticipating such variation is not always possible; in these cases, DNN recognition performance can deteriorate quite sharply. To avoid subjecting acoustic models to such variation, robust features have traditionally been used to create an invariant representation of the acoustic space. Most commonly, robust feature-extraction strategies have explored three principal areas: (a) enhancing the speech signal, with a goal of improving the perceptual quality of speech; (b) reducing the distortion footprint, with signal-theoretic techniques used to learn the distortion characteristics and subsequently filter them out of the speech signal; and finally (c) leveraging knowledge from auditory neuroscience and psychoacoustics, by using robust features inspired by auditory perception.

In this chapter, we present prominent robust feature-extraction strategies explored by the speech recognition research community, and we discuss their relevance to coping with data-mismatch problems in DNN-based acoustic modeling. We present results demonstrating the efficacy of robust features in the new paradigm of DNN acoustic models. And we discuss future directions in feature design for making speech recognition systems more robust to unseen acoustic conditions. Note that the approaches discussed in this chapter focus primarily on single channel data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280. IEEE (2012)

    Google Scholar 

  2. Abdel-Hamid, O., Deng, L., Yu, D.: Exploring convolutional neural network structures and optimization techniques for speech recognition. In: Interspeech, pp. 3366–3370 (2013)

    Google Scholar 

  3. Atal, B.S., Hanauer, S.L.: Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am. 50(2B), 637–655 (1971)

    Article  Google Scholar 

  4. Athineos, M., Ellis, D.P.: Frequency-domain linear prediction for temporal features. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU’30, pp. 261–266. IEEE (2003)

    Google Scholar 

  5. Athineos, M., Hermansky, H., Ellis, D.P.: LP-TRAP: linear predictive temporal patterns. Technical Report, IDIAP (2004)

    Google Scholar 

  6. Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third “chiME” speech separation and recognition challenge: dataset, task and baselines. In: 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) (2015)

    Google Scholar 

  7. Bartels, C., Wang, W., Mitra, V., Richey, C., Kathol, A., Vergyri, D., Bratt, H., Hung, C.: Toward human-assisted lexical unit discovery without text resources. In: SLT (2016)

    Book  Google Scholar 

  8. Beh, J., Ko, H.: A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’03, vol. 1, pp. I–648. IEEE (2003)

    Google Scholar 

  9. Bell, P., Gales, M., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., et al.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: 2015 Automatic Speech Recognition and Understanding Workshop (ASRU 2013) (2015)

    Google Scholar 

  10. Benesty, J., Makino, S.: Speech Enhancement. Springer Science & Business Media, New York (2005)

    Google Scholar 

  11. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Unsupervised and Transfer Learning Challenges in Machine Learning, vol. 7, p. 19 (2012)

    Google Scholar 

  12. Bhargava, M., Rose, R.: Architectures for deep neural network based acoustic models defined over windowed speech waveforms. In: 16th Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  13. Boll, S.F.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  14. Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge, MA (1994)

    Google Scholar 

  15. Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Interspeech, pp. 905–909 (2014)

    Google Scholar 

  16. Cieri, C., Miller, D., Walker, K.: The fisher corpus: a resource for the next generations of speech-to-text. In: LREC, vol. 4, pp. 69–71 (2004)

    Google Scholar 

  17. Cohen, J.R.: Application of an auditory model to speech recognition. J. Acoust. Soc. Am. 85(6), 2623–2629 (1989)

    Article  Google Scholar 

  18. Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)

    Article  MATH  Google Scholar 

  19. Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  20. Davis, K., Biddulph, R., Balashek, S.: Automatic recognition of spoken digits. J. Acoust. Soc. Am. 24(6), 637–642 (1952)

    Article  Google Scholar 

  21. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  22. Delcroix, M., Yoshioka, T., Ogawa, A., Kubo, Y., Fujimoto, M., Ito, N., Kinoshita, K., Espi, M., Hori, T., Nakatani, T., et al.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Proceedings of the REVERB Workshop (2014)

    Google Scholar 

  23. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8599–8603. IEEE (2013)

    Google Scholar 

  24. Dennis, J., Dat, T.H.: Single and multi-channel approaches for distant speech recognition under noisy reverberant conditions: I2R’s system description for the ASpIRE challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 518–524. IEEE (2015)

    Google Scholar 

  25. Drullman, R., Festen, J.M., Plomp, R.: Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5), 2670–2680 (1994)

    Article  Google Scholar 

  26. Elliott, T.M., Theunissen, F.E.: The modulation transfer function for speech intelligibility. PLoS Comput. Biol. 5(3), e1000302 (2009)

    Article  Google Scholar 

  27. ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms. ETSI ES 21 108, ver. 1.1.3 (2003)

    Google Scholar 

  28. ETSI: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. ETSI ES 202, 050, ver. 1.1.5 (2007)

    Google Scholar 

  29. Fine, S., Saon, G., Gopinath, R.A.: Digit recognition in noisy environments via a sequential GMM/SVM system. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–49. IEEE (2002)

    Google Scholar 

  30. Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 347–354. IEEE (1997)

    Google Scholar 

  31. Flynn, R., Jones, E.: Combined speech enhancement and auditory modelling for robust distributed speech recognition. Speech Commun. 50(10), 797–809 (2008)

    Article  Google Scholar 

  32. Gales, M.J., Woodland, P.C.: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang. 10(4), 249–264 (1996)

    Article  Google Scholar 

  33. Ganapathy, S., Thomas, S., Hermansky, H.: Temporal envelope compensation for robust phoneme recognition using modulation spectrum. J. Acoust. Soc. Am. 128(6), 3769–3780 (2010)

    Article  Google Scholar 

  34. Garimella, S., Mandal, A., Strom, N., Hoffmeister, B., Matsoukas, S., Parthasarathi, S.H.K.: Robust i-vector based adaptation of DNN acoustic model for speech recognition. In: Interspeech (2015)

    Google Scholar 

  35. Gelly, G., Gauvain, J-L.: Minimum word error training of RNN-based voice activity detection. In: Interspeech, pp. 2650–2654 (2015)

    Google Scholar 

  36. Gemmeke, J.F., Virtanen, T.: Noise robust exemplar-based connected digit recognition. In: 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4546–4549. IEEE (2010)

    Google Scholar 

  37. Ghitza, O.: Auditory nerve representation as a front-end for speech recognition in a noisy environment. Comput. Speech Lang. 1(2), 109–130 (1986)

    Article  Google Scholar 

  38. Ghitza, O.: On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception. J. Acoust. Soc. Am. 110(3), 1628–1640 (2001)

    Article  Google Scholar 

  39. Gibson, J., Van Segbroeck, M., Narayanan, S.S.: Comparing time–frequency representations for directional derivative features. In: Interspeech, pp. 612–615 (2014)

    Google Scholar 

  40. Giegerich, H.J.: English Phonology: An Introduction. Cambridge University Press, Cambridge (1992)

    Book  Google Scholar 

  41. Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L., Hansen, J.H., Janin, A., Lee, B.S., Lei, Y., Mitra, V., et al.: All for one: feature combination for highly channel-degraded speech activity detection. In: Interspeech, pp. 709–713 (2013)

    Google Scholar 

  42. Graciarena, M., Ferrer, L., Mitra, V.: The SRI system for the NIST open sad 2015 speech activity detection evaluation. In: Interspeech, pp. 3673–3677 (2016)

    Google Scholar 

  43. Grezl, F., Egorova, E., Karafiát, M.: Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 48–53. IEEE (2014)

    Google Scholar 

  44. Gustafsson, H., Nordholm, S.E., Claesson, I.: Spectral subtraction using reduced delay convolution and adaptive averaging. IEEE Trans. Speech Audio Process. 9(8), 799–807 (2001)

    Article  Google Scholar 

  45. Harper, M.: The automatic speech recognition in reverberant environments (ASpIRE) challenge. In: ASRU (2015)

    Google Scholar 

  46. Harvilla, M.J., Stern, R.M.: Histogram-based subband powerwarping and spectral averaging for robust speech recognition under matched and multistyle training. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4697–4700. IEEE (2012)

    Google Scholar 

  47. Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)

    Article  Google Scholar 

  48. Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)

    Article  Google Scholar 

  49. Hermansky, H., Sharma, S.: Temporal patterns (TRAPS) in ASR of noisy speech. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 289–292. IEEE (1999)

    Google Scholar 

  50. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  51. Hirsch, G.: Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task. ETSI STQ Aurora DSR Working Group (2002)

    Google Scholar 

  52. Hori, T., Chen, Z., Erdogan, H., Hershey, J.R., Roux, J., Mitra, V., Watanabe, S.: The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition. In: Proceedings of the IEEE ASRU (2015)

    Book  Google Scholar 

  53. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)

    Google Scholar 

  54. Hsiao, R., Ma, J., Hartmann, W., Karafiat, M., Grézl, F., Burget, L., Szoke, I., Cernocky, J., Watanabe, S., Chen, Z., et al.: Robust speech recognition in unknown reverberant and noisy conditions. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (2015)

    Book  Google Scholar 

  55. Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 67–72 (1975)

    Article  Google Scholar 

  56. Itakura, F., Saito, S.: Statistical method for estimation of speech spectral density and formant frequencies. Electron. Commun. Jpn. 53(1), 36 (1970)

    Google Scholar 

  57. Jabloun, F., Cetin, A.E., Erzin, E.: Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Process. Lett. 6(10), 259–261 (1999)

    Article  Google Scholar 

  58. Joris, P., Schreiner, C., Rees, A.: Neural processing of amplitude-modulated sounds. Physiol. Rev. 84(2), 541–577 (2004)

    Article  Google Scholar 

  59. Juang, B.H., Rabiner, L.R.: Automatic speech recognition – a brief history of the technology development. In: Encyclopedia of Language and Linguistics. Elsevier, Amsterdam (2005)

    Google Scholar 

  60. Kanedera, N., Arai, T., Hermansky, H., Pavel, M.: On the importance of various modulation frequencies for speech recognition. In: 5th European Conference on Speech Communication and Technology (1997)

    Google Scholar 

  61. Karafiát, M., Grézl, F., Burget, L., Szöke, I., Černockỳ, J.: Three ways to adapt a CTS recognizer to unseen reverberated speech in BUT system for the ASpIRE challenge. In: 16th Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  62. Kim, C., Stern, R.M.: Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. In: ICASSP, pp. 4574–4577 (2010)

    Google Scholar 

  63. Kim, C., Stern, R.M.: Power-Cepstral Coefficients (PNCC) for Robust Speech Recognition. IEEE/ACM Trans. Audio, Speech, and Language Process. 24(7), 1315–1329 (2016)

    Google Scholar 

  64. Kingsbury, B.E., Morgan, N., Greenberg, S.: Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)

    Article  Google Scholar 

  65. Kingsbury, B., Saon, G., Mangu, L., Padmanabhan, M., Sarikaya, R.: Robust speech recognition in noisy environments: the 2001 IBM spine evaluation system. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–53. IEEE (2002)

    Google Scholar 

  66. Kingsbury, B., Sainath, T.N., Soltau, H.: Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian-free optimization. In: 13th Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  67. Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., Maas, R.: The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE (2013)

    Google Scholar 

  68. Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006, vol. 1, pp. I-237–I-240. IEEE (2006)

    Google Scholar 

  69. Lyon, R.F.: A computational model of filtering, detection, and compression in the cochlea. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’82, vol. 7, pp. 1282–1285. IEEE (1982)

    Google Scholar 

  70. Makhoul, J., Cosell, L.: LPCW: an LPC vocoder with linear predictive spectral warping. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’76, vol. 1, pp. 466–469. IEEE (1976)

    Google Scholar 

  71. Maragos, P., Kaiser, J.F., Quatieri, T.F.: Energy separation in signal modulations with application to speech analysis. IEEE Trans. Signal Process. 41(10), 3024–3051 (1993)

    Article  MATH  Google Scholar 

  72. Mesgarani, N., David, S., Shamma, S.: Representation of phonemes in primary auditory cortex: how the brain analyzes speech. In: 2007, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, pp. IV–765. IEEE (2007)

    Google Scholar 

  73. Meyer, B.T., Ravuri, S.V., Schädler, M.R., Morgan, N.: Comparing different flavors of spectro-temporal features for ASR. In: Interspeech, pp. 1269–1272 (2011)

    Google Scholar 

  74. Mitra, V., Franco, H.: Time–frequency convolutional networks for robust speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 317–323. IEEE (2015)

    Google Scholar 

  75. Mitra, V., Franco, H.: Coping with unseen data conditions: investigating neural net architectures, robust features, and information fusion for robust speech recognition. In: Interspeech, pp. 3783–3787 (2016)

    Google Scholar 

  76. Mitra, V., Nam, H., Espy-Wilson, C.Y., Saltzman, E., Goldstein, L.: Retrieving tract variables from acoustics: a comparison of different machine learning strategies. IEEE J. Sel. Top. Signal. Process. 4(6), 1027–1045 (2010)

    Article  Google Scholar 

  77. Mitra, V., Franco, H., Graciarena, M., Mandal, A.: Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4117–4120. IEEE (2012)

    Google Scholar 

  78. Mitra, V., Franco, H., Graciarena, M.: Damped oscillator cepstral coefficients for robust speech recognition. In: Interspeech, pp. 886–890 (2013)

    Google Scholar 

  79. Mitra, V., Franco, H., Graciarena, M., Vergyri, D.: Medium-duration modulation cepstral feature for robust speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1749–1753. IEEE (2014)

    Google Scholar 

  80. Mitra, V., Wang, W., Franco, H.: Deep convolutional nets and robust features for reverberation-robust speech recognition. In: Spoken Language Technology Workshop (SLT), pp. 548–553. IEEE (2014)

    Google Scholar 

  81. Mitra, V., Wang, W., Franco, H., Lei, Y., Bartels, C., Graciarena, M.: Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions. In: Interspeech, pp. 895–899 (2014)

    Google Scholar 

  82. Mitra, V., Hout, J.V., McLaren, M., Wang, W., Graciarena, M., Vergyri, D., Franco, H.: Combating reverberation in large vocabulary continuous speech recognition. In: 16th Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  83. Mitra, V., Van Hout, J., Wang, W., Graciarena, M., McLaren, M., Franco, H., Vergyri, D.: Improving robustness against reverberation for automatic speech recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 525–532. IEEE (2015)

    Google Scholar 

  84. Mitra, V., van Hout, J., Wang, W., Bartels, C., Franco, H., Vergyri, D., et al.: Fusion strategies for robust speech recognition and keyword spotting for channel- and noise-degraded speech. In: Interspeech, 2016 (2016)

    Book  Google Scholar 

  85. Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012)

    Article  Google Scholar 

  86. Moore, B.: An Introduction to the Psychology of Hearing. Emerald Group Publishing Ltd., Bingley (1989)

    Google Scholar 

  87. Neiman, A.B., Dierkes, K., Lindner, B., Han, L., Shilnikov, A.L., et al.: Spontaneous voltage oscillations and response dynamics of a Hodgkin–Huxley type model of sensory hair cells. J. Math. Neurosci. 1(11), 11 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  88. Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: 16th Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  89. Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.: Complex sounds and auditory images. Audit. Physiol. Percep. 83, 429–446 (1992)

    Article  Google Scholar 

  90. Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., Khudanpur, S.: JHU ASPIRE system: robust LVCSR with TDNNS, i-vector adaptation, and RNN-LMS. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (2015)

    Google Scholar 

  91. Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Interspeech (2015)

    Google Scholar 

  92. Potamianos, A., Maragos, P.: Time–frequency distributions for automatic speech recognition. IEEE Trans. Speech Audio Process. 9(3), 196–200 (2001)

    Article  Google Scholar 

  93. Rabiner, L.R., Levinson, S.E., Rosenberg, A.E., Wilpon, J.G.: Speaker-independent recognition of isolated words using clustering techniques. IEEE Trans. Acoust. Speech Signal Process. 27(4), 336–349 (1979)

    Article  MATH  Google Scholar 

  94. Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Interspeech, pp. 109–113 (2013)

    Google Scholar 

  95. Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302. IEEE (2013)

    Google Scholar 

  96. Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)

    Google Scholar 

  97. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Proceedings of the Interspeech (2015)

    Google Scholar 

  98. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: ASRU, pp. 55–59 (2013)

    Google Scholar 

  99. Schroeder, M.R.: Recognition of complex acoustic signals. Life Sci. Res. Rep. 5(324), 130 (1977)

    Google Scholar 

  100. Schwarz, P.: Phoneme recognition based on long temporal context. Ph.D. thesis, Burno University of Technology (2009)

    Google Scholar 

  101. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE (2011)

    Google Scholar 

  102. Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Interspeech, pp. 437–440 (2011)

    Google Scholar 

  103. Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402. IEEE (2013)

    Google Scholar 

  104. Seneff, S.: A joint synchrony/mean-rate model of auditory speech processing. In: Waibel, A., Lee, K.-F. (eds.) Readings in Speech Recognition, pp. 101–111. Morgan Kaufmann, Burlington, MA (1990)

    Chapter  Google Scholar 

  105. Shao, Y., Srinivasan, S., Jin, Z., Wang, D.: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang. 24(1), 77–93 (2010)

    Article  Google Scholar 

  106. Srinivasan, S., Wang, D.: Transforming binary uncertainties for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15(7), 2130–2140 (2007)

    Article  Google Scholar 

  107. Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)

    Article  Google Scholar 

  108. Tchorz, J., Kollmeier, B.: A model of auditory perception as front end for automatic speech recognition. J. Acoust. Soc. Am. 106(4), 2040–2050 (1999)

    Article  Google Scholar 

  109. Teager, H.M.: Some observations on oral air flow during phonation. IEEE Trans. Acoust. Speech Signal Process. 28(5), 599–601 (1980)

    Article  Google Scholar 

  110. Thomas, S., Saon, G., Van Segbroeck, M., Narayanan, S.S.: Improvements to the IBM speech activity detection system for the DARPA RATS program. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4500–4504. IEEE (2015)

    Google Scholar 

  111. Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Interspeech, pp. 890–894 (2014)

    Google Scholar 

  112. Tyagi, V.: Fepstrum features: design and application to conversational speech recognition. Technical Report, IBM Research Report (2011)

    Google Scholar 

  113. Van Hout, J.: Low complexity spectral imputation for noise robust speech recognition. M.S. thesis, UCLA (2012)

    Google Scholar 

  114. Viemeister, N.F.: Temporal modulation transfer functions based upon modulation thresholds. J. Acoust. Soc. Am. 66(5), 1364–1380 (1979)

    Article  Google Scholar 

  115. Virag, N.: Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech Audio Process. 7(2), 126–137 (1999)

    Article  Google Scholar 

  116. Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer Academic, Dordrecht (2005)

    Chapter  Google Scholar 

  117. Weninger, F., Watanabe, S., Le Roux, J., Hershey, J., Tachioka, Y., Geiger, J., Schuller, B., Rigoll, G.: The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In: Proceedings of the REVERB Workshop (2014)

    Google Scholar 

  118. Yoshioka, T., Ragni, A., Gales, M.J.: Investigation of unsupervised adaptation of DNN acoustic models with filter bank input. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6344–6348. IEEE (2014)

    Google Scholar 

  119. Yost, W.A., Moore, M.: Temporal changes in a complex spectral profile. J. Acoust. Soc. Am. 81(6), 1896–1905 (1987)

    Article  Google Scholar 

  120. Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks – studies on speech recognition tasks. arXiv:1301.3605 (2013, arXiv preprint)

    Google Scholar 

  121. Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. IEEE (2013)

    Google Scholar 

  122. Zhan, P., Waibel, A.: Vocal tract length normalization for LVCSR. Technical Report, CMU-LTI-97-150, Carnegie Mellon University (1997)

    Google Scholar 

  123. Zhu, Q., Stolcke, A., Chen, B.Y., Morgan, N.: Incorporating tandem/HATS MLP features into SRIS conversational speech recognition system. In: Proceedings of the DARPA Rich Transcription Workshop (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vikramjit Mitra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Mitra, V. et al. (2017). Robust Features in Deep-Learning-Based Speech Recognition. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J. (eds) New Era for Robust Speech Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-64680-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64680-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64679-4

  • Online ISBN: 978-3-319-64680-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics