Abstract
Robustness of the automatic speech recognition (ASR) system relies upon the accuracy of feature extraction and classification in training phase. The mismatch between training and testing conditions during classification of large feature vectors causes a low performance. In this paper, the issue of robustness of acoustic information is addressed for practical Punjabi dataset. Traditional feature extraction approaches: mel frequency cepstral coefficients (MFCC) and gammatone frequency cepstral coefficients (GFCC) face the issue of high variance with leakage of spectral information. Also, handling of the huge number of feature information creates chaos for large speech vocabulary. To overcome this dilemma, a Principal component analysis (PCA) based multi-windowing technique is proposed with the incorporation of baseline GFCC and MFCC based feature approaches after the tuning of taper parameter. The proposed integrated approaches result in better feature vectors, which are further processed using differential evolution + hidden Markov model (DE + HMM) based modelling classifier. The integrated approaches show substantial performance for word recognition as compared to the conventional or fused feature extraction systems.
Similar content being viewed by others
References
Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communication,55(2), 237–251.
Charbuillet, C., Gas, B., Chetouani, M., & Zarader, J. L. (2006). Filter bank design for speaker diarization based on genetic algorithms. In 2006 IEEE international conference on acoustics, speech and signal processing, 2006. ICASSP 2006 Proceedings (Vol. 1, pp. I–I). IEEE.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing,28(4), 357–366.
Dua, M., Aggarwal, R., & Biswas, M. (2018a). Discriminative training using noise robust integrated features and refined hmm modeling. Journal of Intelligent Systems. https://doi.org/10.1515/jisys-2017-0618.
Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Performance evaluation of Hindi speech recognition system using optimized filterbanks. Engineering Science and Technology, an International Journal,21(3), 389–398.
Figielska, E., & Kasprzak, W. (2008). An evolutionary programming based algorithm for HMM training. Computational Intelligence: Methods and Applications, 166–175.
Ghitza, O. (1986). Auditory nerve representation as a front-end for speech recognition in a noisy environment. Computer Speech & Language,1(2), 109–130.
Hansson, M., & Salomonsson, G. (1997). A multiple window method for estimation of peaked spectra. IEEE Transactions on Signal Processing,45(3), 778–781.
Hansson-Sandsten, M., & Sandberg, J. (2009). Optimal cepstrum estimation using multiple windows. In IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009. (pp. 3077–3080). IEEE.
Harris, F. J. (1978). On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE,66(1), 51–83.
Hu, Y., & Loizou, P. C. (2004). Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Transactions on Speech and Audio Processing,12(1), 59–67.
Hung, J. W. (2004). Optimization of filter-bank to improve the extraction of MFCC features in speech recognition. In Proceedings of 2004 international symposium on intelligent multimedia, video and speech processing, 2004 (pp. 675–678). IEEE
Hung, J. W. (2004). Optimization of filter bank to improve the extraction of MFCC features in speech recognition. In Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004. (pp. 675–678).
Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017a). Refinement of HMM model parameters for punjabi automatic speech recognition (PASR) system. IETE Journal of Research,64(5), 1–16.
Kadyan, V., Mantri, A., & Aggarwal, R. K. (2017b). A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. International Journal of Speech Technology,20(4), 761–769.
Kinnunen, T., Saeidi, R., Sandberg, J., & Hansson-Sandsten, M. (2010). What else is new than the Hamming window? Robust MFCCs for speaker recognition via multitapering. In Eleventh Annual Conference of the International Speech Communication Association.
Kwong, S., Chau, C. W., Man, K. F., & Tang, K. S. (2001). Optimisation of HMM topology and its model parameters by genetic algorithms. Pattern Recognition,34(2), 509–522.
Lee, S. M., Fang, S. H., Hung, J. W., & Lee, L. S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. In IEEE workshop on Automatic speech recognition and understanding, 2001. ASRU’01 (pp. 49–52). IEEE.
Lee, S. M., Fang, S. H., Hung, J. W., & Lee, L. S. (2001). Improved MFCC feature extraction by PCA-optimized filter-bank for speech recognition. In IEEE workshop on automatic speech recognition and understanding, 2001. ASRU’01. (pp. 49–52).
Maganti, H. K., &Matassoni, M. (2010). An auditory based modulation spectral feature for reverberant speech recognition. In Eleventh Annual Conference of the International Speech Communication Association.
Maldonado, Y. P., Morales, S. O. C., & Ortega, R. O. C. (2012). GA approaches to HMM optimization for automatic speech recognition. In Mexican conference on pattern recognition (pp. 313–322). Springer, Berlin.
Minh, V. D., & Lee, S. (2004). PCA-based human auditory filter bank for speech recognition. In 2004 International Conference on Signal Processing and Communications, 2004. SPCOM’04 (pp. 393–397). IEEE.
Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., & Rice, P. (1987). An efficient auditory filter bank based on the gammatone function. In A meeting of the IOC Speech Group on Auditory Modelling at RSRE (Vol. 2, No. 7).
Pinheiro, H. N., Neto, F. M., Oliveira, A. L., Ren, T. I., Cavalcanti, G. D., & Adami, A. G. (2017). Optimizing speaker-specific filter banks for speaker verification. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5350–5354). IEEE.
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition (Vol. 14). Englewood Cliffs: PTR Prentice Hall.
Riedel, K. S., & Sidorenko, A. (1995). Minimum bias multiple taper spectral estimation. IEEE Transactions on Signal Processing,43(1), 188–195.
Sandberg, J., Hansson-Sandsten, M., Kinnunen, T., Saeidi, R., Flandrin, P., & Borgnat, P. (2010). Multitaper estimation of frequency-warped cepstra with application to speaker verification. IEEE Signal Processing Letters,17(4), 343–346.
Schluter, R., Bezrukov, I., Wagner, H., & Ney, H. (2007). Gammatone features and feature combination for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007 (Vol. 4, pp. IV–649). IEEE.
Thomson, D. J. (1982). Spectrum estimation and harmonic analysis. Proceedings of the IEEE,70(9), 1055–1096.
Yang, F., Zhang, C., & Bai, G. (2008). A novel genetic algorithm based on tabu search for HMM optimization. In Natural Computation, 2008. ICNC’08. Fourth International Conference on (Vol. 4, pp. 57–61). IEEE.
Yang, F., Zhang, C., & Sun, T. (2008, December). Comparison of particle swarm optimization and genetic algorithm for HMM training. In 19th IEEE International conference on pattern recognition, 2008. ICPR 2008. (pp. 1–4).
Zolnay, A., Kocharov, D., Schlüter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication,49(6), 514–525.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kadyan, V., Mantri, A. & Aggarwal, R.K. Improved filter bank on multitaper framework for robust Punjabi-ASR system. Int J Speech Technol 23, 87–100 (2020). https://doi.org/10.1007/s10772-019-09654-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-019-09654-1