Skip to main content

Advertisement

Log in

Real-time pre-processing for improved feature extraction of noisy speech

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Several improvements of algorithms for the front-end feature extraction of real-time speech decoding in noisy ambiance have been proposed with their demonstration on the TIMIT speech corpus. Real-Time Voice Activity Detection (RT-VAD) is used to separate the voiced–unvoiced part of input from silence in the streaming speech input. Novel techniques for RT-Zero Crossing Detection and RT-Pitch Detection are presented as part of RT-VAD. Real-Time approximate Kalman filter is then applied to de-noise the incoming signal. All these are applied across a collection of frames of speech called context. Frame-based Linear Discriminant Analysis (LDA)-feature extraction is done by RT-Cepstral Mean and Variance Normalization (RT-CMVN) and RT-Splicing. The algorithms are tested on the TIMIT database for various noise levels. It is observed that we obtain a word-error rate (WER) improvement of 5% for 30 dB and 7% for 10 dB SNR, thus validating the proposed algorithms. Also, the comparison with other works shows a superior Speech Hit Rate (SHR) of 90.6% and Noise Hit Rate (NHR) of 86.2%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data availability

Open source TIMIT data is mainly used which is cited appropriately.

Code availability

Custom codes are used. The codes will be made available if required.

References

  • Abbasian, H., Nasersharif, B., Akbari, A., Rahmani, M., & Moin, M. (2008). Optimized linear discriminant analysis for extracting robust speech features. In 2008 3rd International symposium on communications, control and signal processing (pp. 819–824). IEEE.

  • ANSI. (1994). American National Standard Acoustical Terminology. In ANSI S1, 1994 (Vol. 1).

  • Bachu, R., Kopparthi, S., Adapa, B., & Barkana, B. (2008). Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In American Society for Engineering Education (ASEE) zone conference proceedings (pp. 1–7).

  • Bocklet, T., & Marek, A. (2018). Cepstral variance normalization for audio feature extraction. US Patent App. 15/528,068.

  • Das, O. (2016). Kalman filter in speech enhancement. Thesis, Jadavpur University.

  • Das, O., Goswami, B., & Ghosh, R. (2016). Application of the tuned Kalman filter in speech enhancement. In 2016 IEEE first international conference on control, measurement and instrumentation (CMI) (pp. 62–66). IEEE.

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Delaney, B., Jayant, N., Hans, M., Simunic, T., & Acquaviva, A. (2002). A low-power, fixed-point, front-end feature extraction for a distributed speech recognition system. In 2002 IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. I-793). IEEE.

  • Dionelis, N., & Brookes, M. (2018). Speech enhancement using Kalman filtering in the logarithmic bark power spectral domain. In 2018 26th European signal processing conference (EUSIPCO) (pp. 1642–1646). IEEE.

  • Erdogan, H. (2005). Regularizing linear discriminant analysis for speech recognition. In Ninth European conference on speech communication and technology.

  • Estevez, P., Becerra-Yoma, N., Boric, N., & Ramırez, J. (2005). Genetic programming-based voice activity detection. Electronics Letters, 41(20), 1141–1143.

    Article  Google Scholar 

  • Fujimoto, M., & Ariki, Y. (2000). Noisy speech recognition using noise reduction method based on Kalman filter. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1727–1730). IEEE.

  • Gabrea, M. (2003). Kalman filter based single microphone noise canceller. In International workshop on acoustic echo and noise control.

  • Garofolo, J., Lamel, L., Fisher, W., Fiscus, D. J., Dahlgren, N., & Zue, V. (1993). TIMIT acoustic–phonetic continuous speech corpus ldc93s1. Web. Download.

  • Ghahabi, O., Zhou, W., & Fischer, V. (2018). A robust voice activity detection for real-time automatic speech recognition. In Proceedings of ESSV.

  • Goh, Y. H., Raveendran, P., & Goh, Y. L. (2015). Robust speech recognition system using bidirectional Kalman filter. IET Signal Processing, 9(6), 491–497.

    Article  Google Scholar 

  • Guo, J., Sainath, T. N., & Weiss, R. J. (2019). A spelling correction model for end-to-end speech recognition. In ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5651–5655). IEEE.

  • Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1), 35–45.

    Article  MathSciNet  Google Scholar 

  • Lacey, T. (1997). The Kalman filter. Tutorial—From the notes of the CS7322 course at Georgia Tech—Notes taken from the TINA Algortihms’ Guide by N Thacker, Electronic Systems Group, University of Sheffield.

  • Lin, Z. Q., Chung, A. G., & Wong, A. (2018). Edgespeechnets: Highly efficient deep neural networks for speech recognition on the edge. arXiv preprint arXiv:181008559.

  • MATLAB. (2019). Version 9.7.0.1471314 (R2019b) Update 7. The MathWorks, Inc.

  • Mathe, M., Nandyala, S. P., & Kumar, T. K. (2012). Speech enhancement using Kalman filter for white, random and color noise. In 2012 International conference on devices, circuits and systems (ICDCS) (pp. 195–198). IEEE.

  • Meoni, G., Pilato, L., & Fanucci, L. (2018). A low power voice activity detector for portable applications. In 2018 14th conference on Ph.D. research in microelectronics and electronics (PRIME) (pp. 41–44). IEEE.

  • Moattar, M. H., & Homayounpour, M. M. (2009). A simple but efficient real-time voice activity detection algorithm. In 2009 17th European signal processing conference (pp. 2549–2553). IEEE.

  • Mohan, M. S., Naik, N., Gemson, R., & Ananthasayanam, M. (2015). Introduction to the Kalman filter and tuning its statistics for near optimal estimates and Cramer Rao bound. arXiv preprint arXiv:150304313.

  • Nguyen, T. S., Sperber, M., Stüker, S., & Waibel, A. (2018). Building real-time speech recognition without CMVN. In International conference on speech and computer (pp. 451–460). Springer.

  • Paliwal, K., & Basu, A. (1987). A speech enhancement method based on Kalman filtering. In ICASSP’87. IEEE international conference on acoustics, speech, and signal processing (Vol. 12, pp. 177–180). IEEE.

  • Perkins, K., & Meeker, M. (2017). Internet trends 2017. https://www.slideshare.net/kleinerperkins/internet-trends-2017-report.

  • Price, M., Chandrakasan, A., & Glass, J. R. (2016). Memory-efficient modeling and search techniques for hardware ASR decoders. In Interspeech (pp. 1893–1897).

  • Price, M., Glass, J., & Chandrakasan, A. P. (2017). A low-power speech recognizer and voice activity detector using deep neural networks. IEEE Journal of Solid-State Circuits, 53(1), 66–75.

    Article  Google Scholar 

  • Pujol, P., Macho, D., & Nadeu, C. (2006). On real-time mean-and-variance normalization of speech recognition features. In 2006 IEEE international conference on acoustics, speech and signal processing Proceedings (Vol. 1, p. I-I). IEEE.

  • Rath, S. P., Povey, D., Veselỳ, K., & Cernockỳ, J. (2013). Improved feature processing for deep neural networks. In Interspeech (pp. 109–113).

  • Rosen, S. M., Fourcin, A., & Moore, B. C. (1981). Voice pitch as an aid to lipreading. Nature, 291(5811), 150.

    Article  Google Scholar 

  • Sárosi, G., Mozsáry, M., Mihajlik, P., & Fegyó, T. (2011). Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment. In 2011 6th Conference on speech technology and human–computer dialogue (SpeD) (pp. 1–8). IEEE.

  • Sehgal, A., Saki, F., & Kehtarnavaz, N. (2017). Real-time implementation of voice activity detector on arm embedded processor of smartphones. In 2017 IEEE 26th international symposium on industrial electronics (ISIE) (pp. 1285–1290). IEEE.

  • Sharma, N., & Sardana, S. (2016). A real time speech to text conversion system using bidirectional Kalman filter in MATLAB. In 2016 International conference on advances in computing, communications and informatics (ICACCI) (pp. 2353–2357). IEEE.

  • So, S., & Paliwal, K. K. (2011). Suppressing the influence of additive noise on the Kalman gain for low residual noise speech enhancement. Journal of Speech Communication, 53(3), 355–378.

    Article  Google Scholar 

  • Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.

    Article  Google Scholar 

  • Verteletskaya, E., & Sakhnov, K. (2010). Voice activity detection for speech enhancement applications. Acta Polytechnica,. https://doi.org/10.14311/1251.

    Article  Google Scholar 

  • Yang, X., Tan, B., Ding, J., Zhang, J., & Gong, J. (2010, June). Comparative study on voice activity detection algorithm. In International conference on electrical and control engineering (pp. 599–602). IEEE.

  • Ying, D., Yan, Y., Dang, J., & Soong, F. K. (2011). Voice activity detection based on an unsupervised learning framework. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2624–2633.

    Article  Google Scholar 

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. P. Raj.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raj, P.P. Real-time pre-processing for improved feature extraction of noisy speech. Int J Speech Technol 24, 715–728 (2021). https://doi.org/10.1007/s10772-021-09835-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09835-x

Keywords

Navigation