Abstract
While a number of learned feature representations have been proposed for speech recognition, employing f-bank features often leads to the best results. In this paper, we focus on two alternative methods of improving this existing representation. First, triangular filters can be replaced with Gabor filters, a compactly supported filter that better localizes events in time, or with psychoacoustically-motivated Gammatone filters. Second, by rearranging the order of operations in computing filter bank features, the resulting coefficients will have better time-frequency resolution. By merely swapping f-banks with other types of filters in modern phone recognizers, we achieved significant reductions in error rates across repeated trials.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aertsen, A.M.H.J., Olders, J.H.J., Johannesma, P.I.M.: Spectro-temporal receptive fields of auditory neurons in the grassfrog. Biol. Cybern. 39(3), 195–209 (1981)
Andén, J., Mallat, S.: Deep scattering spectrum. IEEE Trans. Signal Process. 62(16), 4114–4128 (2014)
Chang, S.-Y., Meyer, B.T., Morgan, N.: Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7063–7067 (2013)
Chang, S.Y., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Proceedings Interspeech (2014)
Chollet, F., et al.: Keras. https://keras.io (2015)
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP — A collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 960–964 (2014)
Deng, L., Acero, A., Dahl, G., Yu, D.: Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 20, 30–42 (2012)
Dimitriadis, D., Maragos, P., Potamianos, A.: On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 19(6), 1504–1516 (2011)
Flanagan, J.L.: Models for approximating basilar membrane displacement. Bell Syst. Tech. J. 39(5), 1163–1191 (1960)
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Gales, M.J.F.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on International Conference on Machine Learning. ICML 2013, vol. 28, pp. III-1319-III-1327 (2013)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. pp. 369–376. ICML 2006, ACM, New York, NY, USA (2006)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6(2), 65–70 (1979)
Hoshen, Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4624–4628 (2015)
Kovács, G., Tóth, L., Gosztolya, G.: Multi-band processing with gabor filters and time delay neural nets for noise robust speech recognition. In: IEEE Spoken Language Technology Workshop. pp. 242–249 (2018)
Mallat, S.: A Wavelet Tour of Signal Processing: The Sparse Way. Elsevier Science, Amsterdam (2008)
Mallat, S.: Group invariant scattering. Commun. Pure Appl. Math. 65(10), 1331–1398 (2012)
Peddinti, V., Sainath, T., Maymon, S., Ramabhadran, B., Nahamoo, D., Goel, V.: Deep scattering spectrum with deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 210–214 (May 2014)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Hilton Waikoloa Village, Big Island, Hawaii, US (Dec 2011)
Ravanelli, M., Bengio, Y.: Speech and Speaker Recognition from Raw Waveform with SincNet. CoRR abs/1812.05920 (2018)
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y.: Improving speech recognition by revising gated recurrent units. In: Proceedings Interspeech. pp. 1308–1312 (2017)
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNs. In: Proceedings Interspeech (2015)
Schädler, M.R., Kollmeier, B.: Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J. Acoust. Soc. Am. 137(4), 2047–2059 (2015)
Schluter, R., Bezrukov, I., Wagner, H., Ney, H.: Gammatone features and feature combination for large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. vol. 4, pp. IV-649 (Apr 2007)
Schneider, S., Baevski, A., Collobert, R., Auli, M.: Wav2vec: Unsupervised Pre-training for Speech Recognition. CoRR abs/1904.05862 (2019)
Shao, Y., Jin, Z., Wang, D., Srinivasan, S.: An auditory-based feature for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4625–4628 (Apr 2009)
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Tóth, L.: Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 190–194 (May 2014)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bullet. 1(6), 80–83 (1945)
Young, S., et al.: The HTK book (for HTK version 3.4). Cambridge University Engineering Department 2(2), 2–3 (2006)
Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5509–5513 (Apr 2018)
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E.: End-to-end speech recognition from the raw waveform. In: Proceedings Interspeech. pp. 781–785 (2018)
Zhang, Y., et al.: Towards end-to-end speech recognition with deep convolutional neural networks. In: Proceedings Interspeech. pp. 410–414 (2016)
Acknowledgements
This research was funded by a Canada Graduate Scholarship and a Strategic Project Grant from the Natural Sciences and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Robertson, S., Penn, G., Wang, Y. (2019). Improving Speech Recognition with Drop-in Replacements for f-Bank Features. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-31372-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31371-5
Online ISBN: 978-3-030-31372-2
eBook Packages: Computer ScienceComputer Science (R0)