Abstract
Wavelet-based front-end processing technique has gained popularity for its noise removing capability. In this paper, a robust automatic speech recognition system is proposed by utilizing the advantages of psycho-acoustically motivated wavelet-based front-end compensator. In the front-end compensator block, voiced speech probability-based voice activity detector system is designed to separate voiced and unvoiced frames and to update noise statistics. The wavelet packet decomposition tree is designed according to equal rectangular bandwidth (ERB) scale. Wavelet decomposition based on ERB scale is utilized here as the central frequency of the ERB distribution resembles frequency response of human cochlea. Voiced and unvoiced frames are separately decomposed into 24 sub-bands to estimate average sub-band energy (ASE) of each frame. ASE is then used to calculate threshold value. Lastly, Wiener filtering is employed for reducing the residual noise before final reconstruction stage. The proposed system is evaluated on TIMIT database under various noise conditions. The phoneme recognition accuracy of the proposed system is compared with different baseline and robust features as well as with existing front-end compensation techniques. Additionally, the proposed front-end compensator is evaluated in terms of phoneme classification accuracy. Performance improvement is observed in all above experiments.
Similar content being viewed by others
References
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366
Wong E, Sridharan S (2001) Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification. In: Proceedings of 2001 international symposium on intelligent multimedia, video and speech processing. IEEE, pp 95–98
Shao Y, Srinivasan S, Jin Z, Wang D (2010) A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput Speech Lang 24(1):77–93
Biswas A, Sahu P, Bhowmick A, Chandra M (2014) Hindi vowel classification using GFCC and formant analysis in sensor mismatch condition. WSEAS Trans Syst 13:130–143
Hermansky H, Morgan N, Bayya A, Kohn P (1991) RASTA-PLP speech analysis. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, vol 1. Citeseer, pp 121–124
Gandhiraj R, Sathidevi P (2007) Auditory-based wavelet packet filterbank for speech recognition using neural network. In: International conference on advanced computing and communications, 2007. ADCOM 2007. IEEE, pp 666–673
Farooq O, Datta S (2001) Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Process Lett 8(7):196–198
Farooq O, Datta S, Shrotriya M (2010) Wavelet sub-band based temporal features for robust Hindi phoneme recognition. Int J Wavelets Multiresolut Inf Process 8(06):847–859
Wang XP, Zhu C-Q, Li Z-G (2002) A comparative study on wavelet packet based front-end in connected mandarin digit recognition. In: International symposium on Chinese spoken language processing
Biswas A, Sahu P, Chandra M (2014) Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition. Comput Electr Eng 40(4):1111–1122
Sahu P, Biswas A, Bhowmick A, Chandra M (2014) Auditory erb like admissible wavelet packet features for timit phoneme recognition. Int J Eng Sci Technol 17(3):145–151
Ali AMA, Van der Spiegel J, Mueller P (2002) Robust auditory-based speech processing using the average localized synchrony detection. IEEE Trans Speech Audio Process 10(5):279–292
Kajita S, Itakura F (1994) Subband-autocorrelation analysis and its application for speech recognition. In: 1994 IEEE international conference on acoustics, speech, and signal processing, 1994. ICASSP-94, vol 2. IEEE, pp 193–196
Ishizuka K, Miyazaki N (2004) Speech feature extraction method representing periodicity and aperiodicity in sub bands for robust speech recognition. In: IEEE international conference on acoustics, speech, and signal processing, 2004. Proceedings.(ICASSP’04), vol 1. IEEE, pp I–141
Biswas A, Sahu P, Bhowmick A, Chandra M (2015) Hindi phoneme classification using wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature. Comput Electr Eng 42:12–22
Goh YH, Raveendran P, Jamuar SS (2014) Robust speech recognition using harmonic features. IET Signal Process 8(2):167–175
Fukuda T, Ichikawa O, Nishimura M (2010) Long-term spectro-temporal and static harmonic features for voice activity detection. IEEE J Sel Top Signal Process 4(5):834–844
Biswas A, Sahu PK, Chandra M (2016) Admissible wavelet packet sub-band based harmonic energy features using anova fusion techniques for Hindi phoneme recognition. IET Signal Process 10(8):902–911
Biswas A, Sahu PK, Bhowmick A, Chandra M (2015) Admissible wavelet packet sub-band-based harmonic energy features for Hindi phoneme recognition. IET Signal Process 9(6):511–519
Bhowmick A, Chandra M (2017) Speech enhancement using voiced speech probability based wavelet decomposition. Comput Electr Eng 62:706–718
Gonzalez S, Brookes M (2014) PEFAC-a pitch estimation algorithm robust to high levels of noise. IEEE/ACM Trans Audio Speech Lang Process 22(2):518–530
Islam MT, Shahnaz C, Zhu W-P, Ahmad MO (2015) Speech enhancement based on student modeling of Teager energy operated perceptual wavelet packet coefficients and a custom thresholding function. IEEE/ACM Trans Audio Speech Lang Process 23(11):1800–1811
Donoho DL (1995) De-noising by soft-thresholding. IEEE Trans Inf Theory 41(3):613–627
Scalart P, Filho JV (1996) Speech enhancement based on a priori signal to noise estimation. In: 1996 IEEE international conference on acoustics, speech, and signal processing, 1996. ICASSP-96. Conference Proceedings, vol 2, IEEE, pp 629–632
El-Fattah MAA, Dessouky MI, Abbas AM, Diab SM, El-Rabaie E-SM, Al-Nuaimy W, Alshebeili SA, El-Samie FEA (2014) Speech enhancement with an adaptive wiener filter. Int J Speech Technol 17(1):53–64
Cohen I (2004) Speech enhancement using a noncausal a priori SNR estimator. IEEE Signal Process Lett 11(9):725–728
Lu Y, Loizou PC (2008) A geometric approach to spectral subtraction. Speech Commun 50(6):453–466
Plapous C, Marro C, Scalart P (2006) Improved signal-to-noise ratio estimation for speech enhancement. IEEE Trans Audio Speech Lang Process 14(6):2098–2108
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bhowmick, A., Biswas, A. & Chandra, M. Performance evaluation of psycho-acoustically motivated front-end compensator for TIMIT phone recognition. Pattern Anal Applic 23, 527–539 (2020). https://doi.org/10.1007/s10044-019-00816-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-019-00816-0