Skip to main content

Improving Speech Recognition with Drop-in Replacements for f-Bank Features

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2019)

Abstract

While a number of learned feature representations have been proposed for speech recognition, employing f-bank features often leads to the best results. In this paper, we focus on two alternative methods of improving this existing representation. First, triangular filters can be replaced with Gabor filters, a compactly supported filter that better localizes events in time, or with psychoacoustically-motivated Gammatone filters. Second, by rearranging the order of operations in computing filter bank features, the resulting coefficients will have better time-frequency resolution. By merely swapping f-banks with other types of filters in modern phone recognizers, we achieved significant reductions in error rates across repeated trials.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Features: https://github.com/sdrobert/pydrobert-speech CNN-CTC: https://github.com/sdrobert/more-or-let RNN-HMM: https://github.com/sdrobert/pytorch-kaldi.

  2. 2.

    https://bitbucket.org/mravanelli/pytorch-kaldi-v0.0/src/master/.

References

  1. Aertsen, A.M.H.J., Olders, J.H.J., Johannesma, P.I.M.: Spectro-temporal receptive fields of auditory neurons in the grassfrog. Biol. Cybern. 39(3), 195–209 (1981)

    Article  Google Scholar 

  2. Andén, J., Mallat, S.: Deep scattering spectrum. IEEE Trans. Signal Process. 62(16), 4114–4128 (2014)

    Article  MathSciNet  Google Scholar 

  3. Chang, S.-Y., Meyer, B.T., Morgan, N.: Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7063–7067 (2013)

    Google Scholar 

  4. Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Proceedings Interspeech (2014)

    Google Scholar 

  5. Chollet, F., et al.: Keras. https://keras.io (2015)

  6. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP — A collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 960–964 (2014)

    Google Scholar 

  7. Deng, L., Acero, A., Dahl, G., Yu, D.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 30–42 (2012)

    Article  Google Scholar 

  8. Dimitriadis, D., Maragos, P., Potamianos, A.: On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 19(6), 1504–1516 (2011)

    Article  Google Scholar 

  9. Flanagan, J.L.: Models for approximating basilar membrane displacement. Bell Syst. Tech. J. 39(5), 1163–1191 (1960)

    Article  Google Scholar 

  10. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)

    Article  Google Scholar 

  11. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  12. Gales, M.J.F.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)

    Article  Google Scholar 

  13. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on International Conference on Machine Learning. ICML 2013, vol. 28, pp. III-1319-III-1327 (2013)

    Google Scholar 

  14. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 369–376. ICML 2006, ACM, New York, NY, USA (2006)

    Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  17. Hoshen, Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4624–4628 (2015)

    Google Scholar 

  18. Kovács, G., Tóth, L., Gosztolya, G.: Multi-band processing with gabor filters and time delay neural nets for noise robust speech recognition. In: IEEE Spoken Language Technology Workshop. pp. 242–249 (2018)

    Google Scholar 

  19. Mallat, S.: A Wavelet Tour of Signal Processing: The Sparse Way. Elsevier Science, Amsterdam (2008)

    MATH  Google Scholar 

  20. Mallat, S.: Group invariant scattering. Commun. Pure Appl. Math. 65(10), 1331–1398 (2012)

    Article  MathSciNet  Google Scholar 

  21. Peddinti, V., Sainath, T., Maymon, S., Ramabhadran, B., Nahamoo, D., Goel, V.: Deep scattering spectrum with deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 210–214 (May 2014)

    Google Scholar 

  22. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Hilton Waikoloa Village, Big Island, Hawaii, US (Dec 2011)

    Google Scholar 

  23. Ravanelli, M., Bengio, Y.: Speech and Speaker Recognition from Raw Waveform with SincNet. CoRR abs/1812.05920 (2018)

    Google Scholar 

  24. Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y.: Improving speech recognition by revising gated recurrent units. In: Proceedings Interspeech. pp. 1308–1312 (2017)

    Google Scholar 

  25. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Proceedings Interspeech (2015)

    Google Scholar 

  26. Schädler, M.R., Kollmeier, B.: Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J. Acoust. Soc. Am. 137(4), 2047–2059 (2015)

    Article  Google Scholar 

  27. Schluter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. vol. 4, pp. IV-649 (Apr 2007)

    Google Scholar 

  28. Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2vec: Unsupervised Pre-training for Speech Recognition. CoRR abs/1904.05862 (2019)

    Google Scholar 

  29. Shao, Y., Jin, Z., Wang, D., Srinivasan, S.: An auditory-based feature for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4625–4628 (Apr 2009)

    Google Scholar 

  30. Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)

    Article  Google Scholar 

  31. Tóth, L.: Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 190–194 (May 2014)

    Google Scholar 

  32. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bullet. 1(6), 80–83 (1945)

    Article  Google Scholar 

  33. Young, S., et al.: The HTK book (for HTK version 3.4). Cambridge University Engineering Department 2(2), 2–3 (2006)

    Google Scholar 

  34. Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5509–5513 (Apr 2018)

    Google Scholar 

  35. Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E.: End-to-end speech recognition from the raw waveform. In: Proceedings Interspeech. pp. 781–785 (2018)

    Google Scholar 

  36. Zhang, Y., et al.: Towards end-to-end speech recognition with deep convolutional neural networks. In: Proceedings Interspeech. pp. 410–414 (2016)

    Google Scholar 

Download references

Acknowledgements

This research was funded by a Canada Graduate Scholarship and a Strategic Project Grant from the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sean Robertson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Robertson, S., Penn, G., Wang, Y. (2019). Improving Speech Recognition with Drop-in Replacements for f-Bank Features. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31372-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31371-5

  • Online ISBN: 978-3-030-31372-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics