Skip to main content
Log in

Temporal modulation normalization for robust speech feature extraction and recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speech signals are produced by the articulatory movements with a certain modulation structure constrained by the regular phonetic sequences. This modulation structure encodes most of the speech intelligibility information that can be used to discriminate the speech from noise. In this study, we proposed a noise reduction algorithm based on this speech modulation property. Two steps are involved in the proposed algorithm: one is the temporal modulation contrast normalization, another is the modulation events preserved smoothing. The purpose for these processing is to normalize the modulation contrast of the clean and noisy speech to be in the same level, and to smooth out the modulation artifacts caused by noise interferences. Since our proposed method can be used independently for noise reduction, it can be combined with the traditional noise reduction methods to further reduce the noise effect. We tested our proposed method as a front-end for robust speech recognition on the AURORA-2J data corpus. Two advanced noise reduction methods, ETSI advanced front-end (AFE) method, and particle filtering (PF) with minimum mean square error (MMSE) estimation method, are used for comparison and combinations. Experimental results showed that, as an independent front-end processor, our proposed method outperforms the advanced methods, and as combined front-ends, further improved the performance consistently than using each method independently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Atlas L, Shamma S (2003) Joint acoustic and modulation frequency. EURASIP JASP, no 7, pp 668–675

  2. Chen CP, Bilmes J (2007) MVA processing of speech features. IEEE Trans on Audio, Speech, and Language Processing 15(1):257–270

    Article  Google Scholar 

  3. Drullman R, Festen JM, Plomp R (1994) Effects of reducing slow temporal modulations on speech reception. J Acoust Soc Am 95(5):2670–2680

    Article  Google Scholar 

  4. Elad M (2002) On the origin of the bilateral filter and ways to improve it. IEEE Trans Image Process 11(10):1141–1151

    Article  MathSciNet  Google Scholar 

  5. ETSI ES 202 050 V1.1.5 (2007) Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithms; compression algorithms. ETSI standard

  6. Fujimoto M, Nakamura S (2006) Sequential non-stationary noise tracking using particle filtering with switching dynamic system. ICASSP06, vol I, pp 769–773

  7. Hermansky H, Morgan N, Hirsch HG (1993) Recognition of speech in additive and convolutional noise based on RASTA spectral processing. ICASSP93, pp 83–86

    Google Scholar 

  8. Houtgast T, Steeneken HJM (1985) A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J Acoust Soc Am 77(3):1069–1077

    Article  Google Scholar 

  9. Hung J, Lee LS (2006) Optimization of temporal filters for constructing robust features in speech recognition. IEEE Trans on Audio, Speech and Language Processing 14:808–832

    Article  Google Scholar 

  10. Kanedera N, Arai T, Hermansky H, Pavel M (1999) On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Commun 28(1):43–55

    Article  Google Scholar 

  11. Loizou PC (2007) Speech enhancement: theory and practice. CRC Press

  12. Lu X, Matsuda S, Unoki M, Shimizu T, Nakamura S (2009) Temporal contrast normalization and edge-preserved smoothing on temporal modulation structure for robust speech recognition. ICASSP09, pp 4573–4576

    Google Scholar 

  13. Moore BCJ (2003) An introduction to the psychology of hearing. Emerald Group Publishing Ltd

  14. Neumann J, Gasas JR, Macho D, Hidalgo JR (2007) Integration of audio-visual sensors and technologies in a smart room. Personal and Ubiquitous Computing, Springer London, ISSN: pp 1617–4909

    Google Scholar 

  15. Shannon RV, Zeng F, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270:303–304

    Article  Google Scholar 

  16. Shen JL, Hwang WL, Lee LS (1996) Robust speech recognition features based on temporal trajectory filtering of frequency band spectrum. ICSLP96, pp 881–884

    Google Scholar 

  17. Young et al. (2002) The HTK Book (version 3.2) Cambridge University Engineering Department, UK

    Google Scholar 

  18. Torre A, Peinado AM, Segura JC, Crdoba JLP, Bentez MC, Rubio AJ (2005) Histogram equalization of speech representation for robust speech recognition. IEEE Trans Speech Audio Process 13(3):355–366

    Article  Google Scholar 

  19. Xiao X, Chng ES, Li H (2007) Temporal structure normalization of speech feature for robust speech recognition. IEEE Signal Process Lett 14(7):500–503

    Article  Google Scholar 

  20. Xiao X, Chng ES, Li H (2008) Normalization of speech modulation spectra for robust speech recognition. IEEE Trans on Audio, Speech, and Language Processing 16(8):1662–1674

    Article  Google Scholar 

Download references

Acknowledgement

This study is supported by the MASTAR project of Knowledge Creating Communication Research Center of National Institute of Information and Communications Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xugang Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, X., Matsuda, S., Unoki, M. et al. Temporal modulation normalization for robust speech feature extraction and recognition. Multimed Tools Appl 52, 187–199 (2011). https://doi.org/10.1007/s11042-010-0465-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-010-0465-7

Keywords

Navigation