Abstract
Automatic speech recognition relies on extracting features at fixed intervals. In order to enhance these features with dynamical (delta) components, discrete derivatives are usually computed and added as features. However, derivative operations tend to be susceptible to noise. Our proposed method alleviates this problem by replacing these derivatives with nearby features selected on a per-frequency basis. In particular, we noted that, at low frequency, consecutive samples are highly correlated and more information can be gathered by looking at features farther away in time. We thus propose a strategy to perform this frequency-based selection and evaluate it on the Aurora 2 continuous-digits and connected-digits tasks using MFCC, PLPCC and LPCC standard features. The results of our experimentations show that our strategy achieved an average relative improvement of \(32.10\,\%\) in accuracy, with most gains in very noisy environments where the traditional delta features have low recognition rates.
Similar content being viewed by others
References
Bahl, L., De Souza, P., Gopalakrishnan, P., Nahamoo, D., & Picheny, M. (1994). Robust methods for using context-dependent features and models in a continuous speech recognizer. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-94, 1994 (Vol. 1, pp. I–533). IEEE.
Bresenham, J. E. (1965). Algorithm for computer control of a digital plotter. IBM System Journal, 4(1), 25–30.
Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). San Diego: Academic Press Professional Inc.
Furui, S. (1986). Speaker-independent isolated word recognition based on emphasized spectral dynamics. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’86 (Vol. 11, pp. 1991–1994). IEEE.
Gales, M., & Young, S. (2008). The application of hidden markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Gales, M. J. (1998). Maximum likelihood linear transformations for hmm-based speech recognition. Computer Speech & Language, 12(2), 75–98.
Gales, M. J. (1999). Semi-tied covariance matrices for hidden markov models. IEEE Transactions on Speech and Audio Processing, 7(3), 272–281.
Gopinath, R. A. (1998). Maximum likelihood modeling with gaussian distributions for classification. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998 (Vol. 2, pp. 661–664). IEEE.
Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for MFCC feature extraction. In 4th International Conference on Signal Processing and Communication Systems (ICSPCS), 2010 (pp. 1–5). IEEE.
Hyvärinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis (Vol. 46). New York: Wiley.
Jolliffe, I. (1986). Principal component analysis. Springer series in statistics. Berlin: Springer.
Kumar, K., Kim, C., & Stern, R. M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011 (pp. 4784–4787). IEEE.
Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26(4), 283–297.
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech & Language, 9(2), 171–185.
Lockwood, P., & Boudy, J. (1992). Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars. Speech Communication, 11(23), 215–228.
Oppenheim, A. V., Schafer, R. W., & Buck, J. R. (1999). Discrete-time signal processing (2nd ed.). Upper Saddle River: Prentice-Hall Inc.
Pearce, D., günter Hirsch, H., & Gmbh, E. E. D. (2000). The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ISCA ITRW ASR2000 (pp. 29–32).
Rath, S. P., Povey, D., & Veselỳ, K. (2013). Improved feature processing for deep neural networks. In Proceedings of Interspeech.
Saon, G., Padmanabhan, M., Gopinath, R., & Chen, S. (2000). Maximum likelihood discriminant feature spaces. In Proceedings 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, ICASSP’00 (Vol. 2, pp. II1129–II1132). IEEE.
Shrawankar, U., & Thakare, V. M. (2013). Techniques for feature extraction in speech recognition system: A comparative study. arXiv:1305.1145.
Trottier, L., Chaib-draa, B., & Giguère, P. (2014). Effects of frequency-based inter-frame dependencies on automatic speech recognition. In Canadian Conference on AI (pp. 357–362).
Weng, Z., Li, L., & Guo, D. (2010). Speaker recognition using weighted dynamic MFCC based on GMM. In International Conference on Anti-Counterfeiting Security and Identification in Communication (ASID), 2010 (pp. 285–288). IEEE.
Young, S. J., Evermann, G., Gales, M. J. F., Hain, T., Kershaw, D., Moore, G., et al. (2006). The HTK book, version 3.4. Cambridge: Cambridge University Engineering Department.
Yu, D., Seltzer, M. L., Li, J., Huang, J.-T., & Seide, F. (2013). Feature learning in deep neural networks-studies on speech recognition tasks. arXiv:1301.3605.
Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(6), 582–589.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Trottier, L., Giguère, P. & Chaib-draa, B. Feature selection for robust automatic speech recognition: a temporal offset approach. Int J Speech Technol 18, 395–404 (2015). https://doi.org/10.1007/s10772-015-9276-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-015-9276-6