Abstract
Accurate recognition of noisy speech signal is still an obstacle for wider application of speech recognition technology. The robustness of a speech recognition system is heavily influenced by the ability to handle the presence of background noise. In this work, a Short Time Fourier Transform (STFT) filtering technique for the enhancement and recognition of the speech signal is presented. Conventionally, STFT filtering has been applied in speech analysis. However, in this study the combination of modified STFT with Adaptive window width based on the Chirp Rate, termed ASTFT, in conjunction with Spectrogram Features is proposed for optimal speech recognition and enhancement. LibriSpeech ASR Corpus is the benchmark dataset for this experiment. The spectrum from the enhanced Speech signal is estimated using several spectrogram features to obtain a unit peak amplitude. Priori Signal-to-Noise Ratio (SNR) estimation is performed on the modified STFT speech signal, and it achieved an SNR of 31.86 dB which is considered to be an effectively clean speech signal.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Nasreen, P.N., Kumar, A.C., Nabeel, P.A.: Speech analysis for automatic speech recognition. In: Proceedings of International Conference on Computing, Communication and Science (2016)
Delcroix, M., et al.: Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge. In: Reverb workshop (2014)
Cohen, I., Benesty, J., Gannot, S.: Speech Processing in Modern Communication: Challenges and Perspectives, vol. 3. Springer Science & Business Media, Berlin (2009)
Parchami, M., Zhu, W.-P., Champagne, B., Plourde, E.: Recent developments in speech enhancement in the short-time Fourier transform domain. IEEE Circ. Syst. Mag. 16(3), 45–77 (2016)
Kwok, H.K., Jones, D.L.: Improved instantaneous frequency estimation using an adaptive short-time Fourier transform. IEEE Trans. Sig. Process. 48(10), 2964–2972 (2000)
Zhong, J., Huang, Y.: Time-frequency representation based on an adaptive short-time Fourier transform. IEEE Trans. Sig. Process. 58, 5118–5128 (2010)
Toledano, D.T., Fernández-Gallego, M.P., Lozano-Diez, A.: Multi-resolution speech analysis for automatic speech recognition using deep neural networks: experiments on TIMIT. PloS one 13(10), e0205355 (2018)
Tüske, Z., Golik, P., Schlüter, R., Drepper, F.R.: Non-stationary feature extraction for automatic speech recognition. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5204–5207. IEEE (2011)
Parchami, M.: New Approaches for Speech Enhancement in the Short-Time Fourier Transform Domain. PhD thesis, Concordia University (2016)
Ahmadizadeh, M.: An Introduction to Short-Time Fourier Transform (STFT). Advanced Structural Dynamics, April 2014
Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3 (2014)
Solovyev, R.A., et al.: Deep learning approaches for understanding simple speech commands. In: 2020 IEEE 40th International Conference on Electronics and Nanotechnology (ELNANO), pp. 688–693. IEEE (2020)
Paliwal, K.K., Alsteris, L.D.: On the usefulness of STFT phase spectrum in human listening tests. Speech Communi. 45(2), 153–170 (2005)
Dutta, A., Valiveti, G.R.S.: Enhancing the performance of audio visual speech recognition using deep learning techniques. Int. J. Comput. Sci. Commun. 7(2), 126–135 (2016)
Creative Commons. Creative Commons Attribution 4.0 International (CC BY 4.0) License. https://creativecommons.org/licenses/by/4.0/. Accessed 07 Nov 2017
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Sarma, P., Sarmah, S., Bhuyan, M.P., Hore, K., Das, P.P.: Automatic spoken digit recognition using artificial neural network. Int. J. Sci. Technol. Res. 8(12), 1400–1404 (2019)
Gutierrez-Osuna, R.: Introduction to speech processing. CSE@ TAMU (2016)
Pei, S.-C., Huang, S.-G.: STFT with adaptive window width based on the chirp rate. IEEE Trans. Sig. Process. 60, 4065–4080 (2012)
Czerwinski, R.N., Jones, D.L.: Adaptive short-time Fourier analysis. IEEE Sig. Process. Lett. 4(2), 42–45 (1997)
McFee, B., et al.: Librosa: v0.4.0.Zenodo. In: Proceedings of the 14th Python in Science Conference (SCIPY 2015) (2015)
Singh, J., Kaur, K.: Speech enhancement for Punjabi language using deep neural network. In: 2019 International Conference on Signal Processing and Communication (ICSC), pp. 202–204. IEEE (2019)
F. A. Q. International Computer Science Institute (ICSI) Speech. https://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html. Accessed 17 Sep 2019
Athaley, P.D.A.: Audio signal denoising algorithm by adaptive block thresholding using STFT. Int. J. Trend Sci. Res. Dev. 1(6), 289–300 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Oruh, J., Viriri, S. (2021). Spectral Analysis for Automatic Speech Recognition and Enhancement. In: Renault, É., Boumerdassi, S., Mühlethaler, P. (eds) Machine Learning for Networking. MLN 2020. Lecture Notes in Computer Science(), vol 12629. Springer, Cham. https://doi.org/10.1007/978-3-030-70866-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-70866-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70865-8
Online ISBN: 978-3-030-70866-5
eBook Packages: Computer ScienceComputer Science (R0)