Spectro-temporal Power Spectrum Features for Noise Robust ASR

Riazati Seresht, Hamed; Ahadi, Seyed Mohammad; Seyedin, Sanaz

doi:10.1007/s00034-016-0434-0

Spectro-temporal Power Spectrum Features for Noise Robust ASR

Published: 22 November 2016

Volume 36, pages 3222–3242, (2017)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Hamed Riazati Seresht ORCID: orcid.org/0000-0002-3849-3061¹,
Seyed Mohammad Ahadi¹ &
Sanaz Seyedin¹

245 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, we present a new technique to extract a noise robust representation of speech signals called spectro-temporal power spectrum. This technique is based on applying a simple 2-D filter to the speech spectrogram to highlight the movements of spectral peaks. As speech spectral peaks constitute the regions of high-SNR (signal-to-noise ratio) values in the speech spectrogram, we expect that applying our filter will improve the recognition performance. In addition, by applying the 2-D filter, the spectro-temporal information around each frequency component is encoded into the frequency representation of speech signal. This information will help the recognizer to better identify the true state to which each frame should be allocated. Experimental results on the Aurora 2 task show that error rate improvements of about 40 and 35 % are obtained for test sets A and B, respectively, in comparison with the baseline system when combined with cepstral mean and variance normalization. Also, further improvement was achieved when the proposed features were extracted from enhanced spectra obtained by applying advanced front-end routine. Moreover, phone recognition task evaluated on TIMIT database showed the preference of the proposed method over the baseline methods. The obtained improvement by the proposed method is made with a very simple and easy-to-implement routine which makes it suitable for practical systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Mahendra Kumar Gourisaria, Rakshit Agrawal, … Pradeep Kumar Singh

SRMD: Sparse Random Mode Decomposition

Article 20 June 2023

Nicholas Richardson, Hayden Schaeffer & Giang Tran

Adaptive time-reassigned synchrosqueezing transform for seismic random noise suppression

Article 20 July 2023

Wei Liu, Shuangxi Li & Wei Chen

References

J. Bouvrie, T. Ezzat, T. Poggio, Localized spectro-temporal cepstral analysis of speech. in Proceedings on ICASSP (Las Vegas, NV, USA, 2008)
J. Chen, K.K. Paliwal, S. Nakamura, Cepstrum derived from differential power spectrum for robust speech recognition. Speech Commun. 41, 469–484 (2003)
Article Google Scholar
S.-Y. Chang, B.T. Meyer, N. Morgan, Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction. in Proceedings on ICASSP (Vancouver, Canada, 2013)
J. Demsar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
D.A. Depireux, J.Z. Simon, D.J. Klein, S.A. Shamma, Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. J. Neurophysiol. 85, 1220–1234 (2001)
Google Scholar
ETSI standard document, Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm, ETSI ES 202 050 v.1.1.5. Nov 2003
G. Farahani, S.M. Ahadi, M.M. Homayounpour, Features based on filtering and spectral peaks in autocorrelation domain for robust speech recognition. Comput. Speech Lang. 21, 187–205 (2007)
Article Google Scholar
S. Ganapathy, S. Thomas, H. Hermansky, Temporal envelope compensation for robust phoneme recognition using modulation spectrum. J. Acoust. Soc. Am. 128, 3769–3780 (2010)
Article Google Scholar
H.A. Gupta, A. Raju, A. Alwan, Non-linear dimension reduction of Gabor features for noise-robust ASR. in Proceedings on ICASSP (Florence, Italy, 2014)
M. Happel, S. Muller, J. Anemueller, F. Ohl, Predictability of STRFs in auditory cortex neurons depends on stimulus class. in Proceedings on Interspeech (Brisbane, Australia, 2008)
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87, 1738–1752 (1990)
Article Google Scholar
H. Hermansky, N. Morgan, Rasta processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)
Article Google Scholar
H.-G. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. in Proceedings on ISCA ITRW ASR (Paris, France, 2000)
HTK, The hidden Markov model toolkit (2002). [Online]. Version: HTK 3.2.1 (windows). Available: http://htk.eng.cam.ac.uk
S. Ikbal, H. Bourlard, M. Magimai, HMM/ANN based spectral peak location estimation for noise robust speech recognition. in Proceedings on ICASSP (Philadelphia, PA, USA, 2005)
S. Ikbal, M.M. Doss, H. Misra, H. Bourlard, Spectro-temporal activity pattern (STAP) features for robust ASR. in Proceedings on ICSLP (Jeju Island, South Korea, 2004)
C. Kim, R.M. Stern, Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. in Proceedings on ICASSP (Dallas, Texas, USA, 2010)
M. Kleinschmidt, D. Gelbart, Improving word accuracy with Gabor feature extraction. in Proceedings on Interspeech (Denver, CO, USA, 2002)
M. Marki, Y. Stylianou, Discrimination of speech from nonspeech in broadcast news based on modulation frequency features. Speech Commun. 53(5), 726–735 (2011)
Article Google Scholar
N. Mesgarani, S. Thomas, H. Hermansky, A multistream multiresolution framework for phoneme recognition. in Proceedings on Interspeech (Makuhari, Japan, 2010)
B.T. Meyer, B. Kollmeier, Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun. 53(5), 753–767 (2011)
Article Google Scholar
B.T. Meyer, S.R. Ravuri, M.R. Scheadler, N. Morgan, Comparing different flavors of spectro-temporal features for ASR. in Proceedings on Interspeech (Florence, Italy, 2011)
B. Meyer, C. Spille, B. Kollmeier, N. Morgan, Hooking up spectro-temporal filters with auditory-inspiring representations for robust automatic speech recognition. in Proceedings on Interspeech (Portland, Oregon, USA, 2012)
S.K. Nemala, K. Patil, M. Elhilali, Multistream bandpass modulation features for robust speech recognition. in Proceedings on Interspeech (Florence, Italy, 2011)
J. Ramirez, J.M. Gorriz, Recent advances in robust speech recognition technology (Bentham Science Publishers, Sharjah, 2011)
Google Scholar
S.V. Ravuri, N. Morgan, Easy does it: robust spectro-temporal many-stream ASR without fine tuning streams. in Proceedings on ICASSP (Kyoto, Japan, 2012)
M.R. Schaedler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131, 4134–4151 (2012)
Article Google Scholar
S. Seyedin, S.M. Ahadi, A new subband-weighted MVDR-based front-end for robust speech recognition. IEICE Trans. Inf. Syst. E93–D, 2252–2261 (2010)
Article Google Scholar
S. Seyedin, S.M. Ahadi, S. Gazor, New features using robust MVDR spectrum of filtered autocorrelation sequence for robust speech recognition. Scientific World J. 2013, 634160 (2013). doi:10.1155/2013/634160
S. Tiberwala, H. Hermansky, Multi-band and adaptation approaches to robust speech recognition. in Proceedings on Eurospeech (Rhodes, Greece, 1997)
A. Varga, H. Steeneken, M. Tomlinson, J.D., The NOISEX-92 study on the effect of additive noise on automatic speech recognition (Speech Research Unit, Defense Research Agency, Malvern, 1992)
M. Westphal, The use of cepstral means in conversational speech recognition. in Proceedings on Eurospeech (Rhodes, Greece, 1997)
X. Xiao, E.S. Chng, H. Li, Normalization of the speech modulation spectra for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 16, 1662–1674 (2008)
Article Google Scholar
S. Zhao, N. Morgan, Multi-stream spectro-temporal features for robust speech recognition. in Proceedings on Interspeech (Brisbane, Australia, 2008)
S.Y. Zhao, S. Ravuri, N. Morgan, Multi-stream to many-stream: using spectro-temporal features for ASR. in Proceedings ICASSP (Dallas, Texas, USA, 2010)

Download references

Acknowledgements

This work was in part supported by a grant from the Iran Telecommunication Research Center (ITRC).

Author information

Authors and Affiliations

Department of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran
Hamed Riazati Seresht, Seyed Mohammad Ahadi & Sanaz Seyedin

Authors

Hamed Riazati Seresht
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Mohammad Ahadi
View author publications
You can also search for this author in PubMed Google Scholar
Sanaz Seyedin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamed Riazati Seresht.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Riazati Seresht, H., Ahadi, S.M. & Seyedin, S. Spectro-temporal Power Spectrum Features for Noise Robust ASR. Circuits Syst Signal Process 36, 3222–3242 (2017). https://doi.org/10.1007/s00034-016-0434-0

Download citation

Received: 23 June 2015
Revised: 24 September 2016
Accepted: 27 September 2016
Published: 22 November 2016
Issue Date: August 2017
DOI: https://doi.org/10.1007/s00034-016-0434-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spectro-temporal Power Spectrum Features for Noise Robust ASR

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

SRMD: Sparse Random Mode Decomposition

Adaptive time-reassigned synchrosqueezing transform for seismic random noise suppression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spectro-temporal Power Spectrum Features for Noise Robust ASR

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

SRMD: Sparse Random Mode Decomposition

Adaptive time-reassigned synchrosqueezing transform for seismic random noise suppression

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation