Improving Speech Recognition with Drop-in Replacements for f-Bank Features

Robertson, Sean; Penn, Gerald; Wang, Yingxue

doi:10.1007/978-3-030-31372-2_18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11816))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

785 Accesses

Abstract

While a number of learned feature representations have been proposed for speech recognition, employing f-bank features often leads to the best results. In this paper, we focus on two alternative methods of improving this existing representation. First, triangular filters can be replaced with Gabor filters, a compactly supported filter that better localizes events in time, or with psychoacoustically-motivated Gammatone filters. Second, by rearranging the order of operations in computing filter bank features, the resulting coefficients will have better time-frequency resolution. By merely swapping f-banks with other types of filters in modern phone recognizers, we achieved significant reductions in error rates across repeated trials.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gabor Filterbank Features for Robust Speech Recognition

Feature selection for robust automatic speech recognition: a temporal offset approach

Article 20 March 2015

Robust Features in Deep-Learning-Based Speech Recognition

Notes

1.
Features: https://github.com/sdrobert/pydrobert-speech CNN-CTC: https://github.com/sdrobert/more-or-let RNN-HMM: https://github.com/sdrobert/pytorch-kaldi.
2.
https://bitbucket.org/mravanelli/pytorch-kaldi-v0.0/src/master/.

References

Aertsen, A.M.H.J., Olders, J.H.J., Johannesma, P.I.M.: Spectro-temporal receptive fields of auditory neurons in the grassfrog. Biol. Cybern. 39(3), 195–209 (1981)
Article Google Scholar
Andén, J., Mallat, S.: Deep scattering spectrum. IEEE Trans. Signal Process. 62(16), 4114–4128 (2014)
Article MathSciNet Google Scholar
Chang, S.-Y., Meyer, B.T., Morgan, N.: Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7063–7067 (2013)
Google Scholar
Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Proceedings Interspeech (2014)
Google Scholar
Chollet, F., et al.: Keras. https://keras.io (2015)
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP — A collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 960–964 (2014)
Google Scholar
Deng, L., Acero, A., Dahl, G., Yu, D.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 30–42 (2012)
Article Google Scholar
Dimitriadis, D., Maragos, P., Potamianos, A.: On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 19(6), 1504–1516 (2011)
Article Google Scholar
Flanagan, J.L.: Models for approximating basilar membrane displacement. Bell Syst. Tech. J. 39(5), 1163–1191 (1960)
Article Google Scholar
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)
Article Google Scholar
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Gales, M.J.F.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Article Google Scholar
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on International Conference on Machine Learning. ICML 2013, vol. 28, pp. III-1319-III-1327 (2013)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 369–376. ICML 2006, ACM, New York, NY, USA (2006)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)
MathSciNet MATH Google Scholar
Hoshen, Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4624–4628 (2015)
Google Scholar
Kovács, G., Tóth, L., Gosztolya, G.: Multi-band processing with gabor filters and time delay neural nets for noise robust speech recognition. In: IEEE Spoken Language Technology Workshop. pp. 242–249 (2018)
Google Scholar
Mallat, S.: A Wavelet Tour of Signal Processing: The Sparse Way. Elsevier Science, Amsterdam (2008)
MATH Google Scholar
Mallat, S.: Group invariant scattering. Commun. Pure Appl. Math. 65(10), 1331–1398 (2012)
Article MathSciNet Google Scholar
Peddinti, V., Sainath, T., Maymon, S., Ramabhadran, B., Nahamoo, D., Goel, V.: Deep scattering spectrum with deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 210–214 (May 2014)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Hilton Waikoloa Village, Big Island, Hawaii, US (Dec 2011)
Google Scholar
Ravanelli, M., Bengio, Y.: Speech and Speaker Recognition from Raw Waveform with SincNet. CoRR abs/1812.05920 (2018)
Google Scholar
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y.: Improving speech recognition by revising gated recurrent units. In: Proceedings Interspeech. pp. 1308–1312 (2017)
Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Proceedings Interspeech (2015)
Google Scholar
Schädler, M.R., Kollmeier, B.: Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J. Acoust. Soc. Am. 137(4), 2047–2059 (2015)
Article Google Scholar
Schluter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. vol. 4, pp. IV-649 (Apr 2007)
Google Scholar
Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2vec: Unsupervised Pre-training for Speech Recognition. CoRR abs/1904.05862 (2019)
Google Scholar
Shao, Y., Jin, Z., Wang, D., Srinivasan, S.: An auditory-based feature for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4625–4628 (Apr 2009)
Google Scholar
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Article Google Scholar
Tóth, L.: Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 190–194 (May 2014)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bullet. 1(6), 80–83 (1945)
Article Google Scholar
Young, S., et al.: The HTK book (for HTK version 3.4). Cambridge University Engineering Department 2(2), 2–3 (2006)
Google Scholar
Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5509–5513 (Apr 2018)
Google Scholar
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E.: End-to-end speech recognition from the raw waveform. In: Proceedings Interspeech. pp. 781–785 (2018)
Google Scholar
Zhang, Y., et al.: Towards end-to-end speech recognition with deep convolutional neural networks. In: Proceedings Interspeech. pp. 410–414 (2016)
Google Scholar

Download references

Acknowledgements

This research was funded by a Canada Graduate Scholarship and a Strategic Project Grant from the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Department of Computer Science, University of Toronto, 40 St. George St., Toronto, ON, Canada
Sean Robertson, Gerald Penn & Yingxue Wang
Vector Institute, 661 University Ave., Toronto, ON, Canada
Sean Robertson

Authors

Sean Robertson
View author publications
You can also search for this author in PubMed Google Scholar
Gerald Penn
View author publications
You can also search for this author in PubMed Google Scholar
Yingxue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sean Robertson .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Queen Mary University of London, London, UK
Matthew Purver
Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Robertson, S., Penn, G., Wang, Y. (2019). Improving Speech Recognition with Drop-in Replacements for f-Bank Features. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-31372-2_18
Published: 27 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31371-5
Online ISBN: 978-3-030-31372-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Speech Recognition with Drop-in Replacements for f-Bank Features

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gabor Filterbank Features for Robust Speech Recognition

Feature selection for robust automatic speech recognition: a temporal offset approach

Robust Features in Deep-Learning-Based Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improving Speech Recognition with Drop-in Replacements for f-Bank Features

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gabor Filterbank Features for Robust Speech Recognition

Feature selection for robust automatic speech recognition: a temporal offset approach

Robust Features in Deep-Learning-Based Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation