Abstract
Several improvements of algorithms for the front-end feature extraction of real-time speech decoding in noisy ambiance have been proposed with their demonstration on the TIMIT speech corpus. Real-Time Voice Activity Detection (RT-VAD) is used to separate the voiced–unvoiced part of input from silence in the streaming speech input. Novel techniques for RT-Zero Crossing Detection and RT-Pitch Detection are presented as part of RT-VAD. Real-Time approximate Kalman filter is then applied to de-noise the incoming signal. All these are applied across a collection of frames of speech called context. Frame-based Linear Discriminant Analysis (LDA)-feature extraction is done by RT-Cepstral Mean and Variance Normalization (RT-CMVN) and RT-Splicing. The algorithms are tested on the TIMIT database for various noise levels. It is observed that we obtain a word-error rate (WER) improvement of 5% for 30 dB and 7% for 10 dB SNR, thus validating the proposed algorithms. Also, the comparison with other works shows a superior Speech Hit Rate (SHR) of 90.6% and Noise Hit Rate (NHR) of 86.2%.
Similar content being viewed by others
Data availability
Open source TIMIT data is mainly used which is cited appropriately.
Code availability
Custom codes are used. The codes will be made available if required.
References
Abbasian, H., Nasersharif, B., Akbari, A., Rahmani, M., & Moin, M. (2008). Optimized linear discriminant analysis for extracting robust speech features. In 2008 3rd International symposium on communications, control and signal processing (pp. 819–824). IEEE.
ANSI. (1994). American National Standard Acoustical Terminology. In ANSI S1, 1994 (Vol. 1).
Bachu, R., Kopparthi, S., Adapa, B., & Barkana, B. (2008). Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In American Society for Engineering Education (ASEE) zone conference proceedings (pp. 1–7).
Bocklet, T., & Marek, A. (2018). Cepstral variance normalization for audio feature extraction. US Patent App. 15/528,068.
Das, O. (2016). Kalman filter in speech enhancement. Thesis, Jadavpur University.
Das, O., Goswami, B., & Ghosh, R. (2016). Application of the tuned Kalman filter in speech enhancement. In 2016 IEEE first international conference on control, measurement and instrumentation (CMI) (pp. 62–66). IEEE.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Delaney, B., Jayant, N., Hans, M., Simunic, T., & Acquaviva, A. (2002). A low-power, fixed-point, front-end feature extraction for a distributed speech recognition system. In 2002 IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. I-793). IEEE.
Dionelis, N., & Brookes, M. (2018). Speech enhancement using Kalman filtering in the logarithmic bark power spectral domain. In 2018 26th European signal processing conference (EUSIPCO) (pp. 1642–1646). IEEE.
Erdogan, H. (2005). Regularizing linear discriminant analysis for speech recognition. In Ninth European conference on speech communication and technology.
Estevez, P., Becerra-Yoma, N., Boric, N., & Ramırez, J. (2005). Genetic programming-based voice activity detection. Electronics Letters, 41(20), 1141–1143.
Fujimoto, M., & Ariki, Y. (2000). Noisy speech recognition using noise reduction method based on Kalman filter. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1727–1730). IEEE.
Gabrea, M. (2003). Kalman filter based single microphone noise canceller. In International workshop on acoustic echo and noise control.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, D. J., Dahlgren, N., & Zue, V. (1993). TIMIT acoustic–phonetic continuous speech corpus ldc93s1. Web. Download.
Ghahabi, O., Zhou, W., & Fischer, V. (2018). A robust voice activity detection for real-time automatic speech recognition. In Proceedings of ESSV.
Goh, Y. H., Raveendran, P., & Goh, Y. L. (2015). Robust speech recognition system using bidirectional Kalman filter. IET Signal Processing, 9(6), 491–497.
Guo, J., Sainath, T. N., & Weiss, R. J. (2019). A spelling correction model for end-to-end speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5651–5655). IEEE.
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1), 35–45.
Lacey, T. (1997). The Kalman filter. Tutorial—From the notes of the CS7322 course at Georgia Tech—Notes taken from the TINA Algortihms’ Guide by N Thacker, Electronic Systems Group, University of Sheffield.
Lin, Z. Q., Chung, A. G., & Wong, A. (2018). Edgespeechnets: Highly efficient deep neural networks for speech recognition on the edge. arXiv preprint arXiv:181008559.
MATLAB. (2019). Version 9.7.0.1471314 (R2019b) Update 7. The MathWorks, Inc.
Mathe, M., Nandyala, S. P., & Kumar, T. K. (2012). Speech enhancement using Kalman filter for white, random and color noise. In 2012 International conference on devices, circuits and systems (ICDCS) (pp. 195–198). IEEE.
Meoni, G., Pilato, L., & Fanucci, L. (2018). A low power voice activity detector for portable applications. In 2018 14th conference on Ph.D. research in microelectronics and electronics (PRIME) (pp. 41–44). IEEE.
Moattar, M. H., & Homayounpour, M. M. (2009). A simple but efficient real-time voice activity detection algorithm. In 2009 17th European signal processing conference (pp. 2549–2553). IEEE.
Mohan, M. S., Naik, N., Gemson, R., & Ananthasayanam, M. (2015). Introduction to the Kalman filter and tuning its statistics for near optimal estimates and Cramer Rao bound. arXiv preprint arXiv:150304313.
Nguyen, T. S., Sperber, M., Stüker, S., & Waibel, A. (2018). Building real-time speech recognition without CMVN. In International conference on speech and computer (pp. 451–460). Springer.
Paliwal, K., & Basu, A. (1987). A speech enhancement method based on Kalman filtering. In ICASSP’87. IEEE international conference on acoustics, speech, and signal processing (Vol. 12, pp. 177–180). IEEE.
Perkins, K., & Meeker, M. (2017). Internet trends 2017. https://www.slideshare.net/kleinerperkins/internet-trends-2017-report.
Price, M., Chandrakasan, A., & Glass, J. R. (2016). Memory-efficient modeling and search techniques for hardware ASR decoders. In Interspeech (pp. 1893–1897).
Price, M., Glass, J., & Chandrakasan, A. P. (2017). A low-power speech recognizer and voice activity detector using deep neural networks. IEEE Journal of Solid-State Circuits, 53(1), 66–75.
Pujol, P., Macho, D., & Nadeu, C. (2006). On real-time mean-and-variance normalization of speech recognition features. In 2006 IEEE international conference on acoustics, speech and signal processing Proceedings (Vol. 1, p. I-I). IEEE.
Rath, S. P., Povey, D., Veselỳ, K., & Cernockỳ, J. (2013). Improved feature processing for deep neural networks. In Interspeech (pp. 109–113).
Rosen, S. M., Fourcin, A., & Moore, B. C. (1981). Voice pitch as an aid to lipreading. Nature, 291(5811), 150.
Sárosi, G., Mozsáry, M., Mihajlik, P., & Fegyó, T. (2011). Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment. In 2011 6th Conference on speech technology and human–computer dialogue (SpeD) (pp. 1–8). IEEE.
Sehgal, A., Saki, F., & Kehtarnavaz, N. (2017). Real-time implementation of voice activity detector on arm embedded processor of smartphones. In 2017 IEEE 26th international symposium on industrial electronics (ISIE) (pp. 1285–1290). IEEE.
Sharma, N., & Sardana, S. (2016). A real time speech to text conversion system using bidirectional Kalman filter in MATLAB. In 2016 International conference on advances in computing, communications and informatics (ICACCI) (pp. 2353–2357). IEEE.
So, S., & Paliwal, K. K. (2011). Suppressing the influence of additive noise on the Kalman gain for low residual noise speech enhancement. Journal of Speech Communication, 53(3), 355–378.
Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.
Verteletskaya, E., & Sakhnov, K. (2010). Voice activity detection for speech enhancement applications. Acta Polytechnica,. https://doi.org/10.14311/1251.
Yang, X., Tan, B., Ding, J., Zhang, J., & Gong, J. (2010, June). Comparative study on voice activity detection algorithm. In International conference on electrical and control engineering (pp. 599–602). IEEE.
Ying, D., Yan, Y., Dang, J., & Soong, F. K. (2011). Voice activity detection based on an unsupervised learning framework. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2624–2633.
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Rights and permissions
About this article
Cite this article
Raj, P.P. Real-time pre-processing for improved feature extraction of noisy speech. Int J Speech Technol 24, 715–728 (2021). https://doi.org/10.1007/s10772-021-09835-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09835-x