ABSTRACT
In this study, we proposed a feature extraction method based on the subband temporal envelopes (STEs) and their normalization for reverberated speech recognition. The STEs were extracted by using a series of constant bandwidth band-pass filters with Hilbert transform followed by a low-pass filtering. In the normalization, both the modulation spectrum (MS) of the subband temporal envelopes of the clean and reverberated speech are normalized to a reference MS calculated from a clean speech data set. Based on the normalized subband MS, the inverse Fourier transform was used to restore the subband temporal envelopes. We tested the proposed method on speech recognition in a reverberant room with different speaker to microphone distance (SMD). For comparison, the recognition performance of using the traditional Mel-cepstral coefficients with mean and variance normalization were used as the baseline. Experimental results showed that, by averaging the SMDs from 50 cm to 400 cm, there was a 44.96% relative improvement by only using subband temporal envelope processing, and further a 15.68% relative improvement by using the normalization on the subband modulation spectrum. Totally, there was about a 53.59% relative improvement, which was better than those of using other temporal filtering and normalization methods.
- S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. on Acoustics, Speech, and Signal Processing, ASSP (27), 113--120, 1979Google Scholar
- Y. Ephraim, and D. Malah. Speech enhancement using a minimum mean square error log-spectral amplitude estimator. IEEE Trans. on Acoustics, speech and signal processing, 33 (2), 443--445, 1985.Google Scholar
- P. J. Wolfe, and S. J. Godsill. Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement. EURASIP Journal on Applied Signal Processing, 10, 1043--1051, 2003. Google ScholarDigital Library
- S. Furui, and M. Sondhi. Advances in Speech Signal Processing, Marcel Dekker, Inc., New York, 1991. Google ScholarDigital Library
- T. Takiguchi, S. Nakamura, and K. Shikano. Hands-free speech recognition by HMM composition in noisy reverberant environments. IEICE Trans. D-II, J79-D-II (12), 2047--2053, 1996.Google Scholar
- S. Nakagawa S. A survey on automatic speech recognition. IEICE Trans. D-II, J83-D-II (2), 433--457, 2000.Google Scholar
- X. Lu, S. Matsuda, M. Unoki, T. Shimizu, and S. Nakamura. Temporal contrast normalization and edge-preserved smoothing on temporal modulation structure for robust speech recognition. In ICASSP09, 4573--4576, 2009. Google ScholarDigital Library
- N. Kanedera, T. Arai, H. Hermansky, M. Pavel. On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Communication, 28 (1), 43--55, 1999.Google ScholarCross Ref
- F. Liu, R. Stern, X. Huang, and A. Acero. Efficient cepstral normalization for robust speech recognition. In Proceedings of ARPA Human Language Technology Workshop, 1993. Google ScholarDigital Library
- H. Hermansky, N. Morgan and H. G. Hirsch. Recognition of speech in additive and convolutional noise based on RASTA spectral processing. In Proc. ICASSP'93, 83--86, 1993.Google ScholarCross Ref
- M. Miyoshi and Y. Kaneda. Inverse filtering of room acoustics. IEEE Trans. on Acoustics, speech, and signal processing, ASSP (36), 145--152, 1998.Google Scholar
- M. S. Brandstein and D. B. Ward, Eds. Microphone Arrays: Signal Processing Techniques and Applications, Springer-Verlag, Berlin, 1st edition, 2000.Google Scholar
- J. B. Allen, D. A. Berkley and J. Blauert. Multi-microphone signal-processing technique to remove room reverberation from speech signals. J. Acoust. Soc. Amer., 62 (4), 912--915, 1977.Google ScholarCross Ref
- K. Kinoshita, T. Nakatani, and M. Miyoshi. Spectral subtraction steered by multi-step forward linear prediction for single channel speech dereverberation. In Proc. ICASSP06, I, 817--820, 2006.Google ScholarCross Ref
- T. Nakatani, and M. Miyoshi. Blind dereverberation of single channel speech signal based on harmonic structure. In Proc. ICASSP03, 1, 92--95, 2003.Google Scholar
- T. Nakatani, M. Miyoshi and K. Kinoshita. Blind dereverberation of monaural speech signals based on harmonic structure. IEICE D-II, J88-D-II (3), 509--520, 2005.Google Scholar
- M. Unoki, T. Hosorogiya and Y. Ishimoto. Comparative evaluations of robust and accurate F0 estimates in reverberant environments. In Proc. ICASSP08, 4569--4572, 2008.Google Scholar
- R. Drullman, J. M. Festen, R. Plomp. Effects of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am., 95 (5), 2670--2680, 1994.Google ScholarCross Ref
- R. V. Shannon, F. Zeng, V. Kamath, J. Wygonski and M. Ekelid. Speech recognition with primarily temporal cues. Science, 270, 303--304, 1995.Google ScholarCross Ref
- C. P. Chen, J. Bilmes. MVA processing of speech features. IEEE Transactions on Audio, Speech, and Language Processing, 15 (1), 257--270, 2007. Google ScholarDigital Library
- X. Xiao, E. S. Chng, and H. Li. Temporal structure normalization of speech feature for robust speech recognition. IEEE Signal Processing Letters, 14 (7), 500--503, 2007.Google ScholarCross Ref
- X. Xiao, E. S. Chng, and H. Li. Normalization of speech modulation spectra for robust speech recognition. IEEE Trans. on Audio, Speech, and Language Processing, 16 (8), 1662--1674, 2008. Google ScholarDigital Library
- T. Houtgast and H. J. M. Steeneken. The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica, 28, 66--73, 1973.Google Scholar
- M. R. Schroeder. Modulation transfer function: definition and measurement. Acustica, 49, 179--182, 1981.Google Scholar
- S. Hirobayashi, H. Nomura, T. Koike and M. Tohyama. Speech waveform recovery from a reverberant speech signal using inverse filtering of the power envelope transfer function. IEICE Trans. A, Vol. J81-A, 10, 1323--1330, 1998.Google Scholar
- S. Hirobayashi and T. Yamabuchi. Validation of blind dereverberation using power envelope inverse filtering and filter banks. IEICE Trans. A, Vol. J83-A, 8, 1029--1033, 2000.Google Scholar
- M. Unoki, M. Furukawa, K. Sakata and M. Akagi. An improved method based on the MTF concept for restoring the power envelope from a reverberant signal. Acoust. Sci.&Tech., 25 (4), 232--242, 2004.Google Scholar
- M. Unoki, K. Sakata, M. Furukawa and M. Akagi. A speech dereverberation method based on the MTF concept in power envelope restoration. Acoust. Sci.&Tech., 25 (4), 243--254, 2004.Google Scholar
- X. Lu, M. Unoki and M. Akagi. A robust feature extraction based on the MTF concept for speech recognition in reverberant environment. In Proc. ICSLP06, 2546--2549, 2006.Google Scholar
- X. Lu, M. Unoki and M. Akagi. Comparative evaluation of modulation-transfer-function-based blind restoration of sub-band power envelopes of speech as a front-end processor for automatic speech recognition systems. Acoust. Sci.&Tech., 29 (6), 351--361, 2008.Google Scholar
- J. Neumann, J. R. Gasas, D. Macho, J. R. Hidalgo. Integration of audio-visual sensors and technologies in a smart room. Personal and Ubiquitous Computing, Springer London, ISSN: 1617--4909 (print), 2007. Google ScholarDigital Library
- B. J. Shannon, and K. K. Paliwal. A comparative study of filter bank spacing for speech recognition. In Microelectronic Engineering Research Conference, 1--3, 2003.Google Scholar
- http://sp.shinshu-u.ac.jp/CENSREC/, AURORA-2J database.Google Scholar
- The HTK Book (version 3.2), Cambridge University Engineering Department, 2002.Google Scholar
Index Terms
Normalization on the modulation spectrum of the subband temporal envelopes for automatic speech recognition in reverberant environments
Recommendations
Sub-band temporal modulation envelopes and their normalization for automatic speech recognition in reverberant environments
Abstract: Automatic speech recognition (ASR) in reverberant environments is still a challenging task. In this study, we propose a robust feature-extraction method on the basis of the normalization of the sub-band temporal modulation envelopes (TMEs). ...
Speech enhancement for robust automatic speech recognition
Evaluation of baseline CHiME3 recogniser in diverse range of acoustic conditions.Performance curves indicate relative influence of noise and reverberation.Evaluation of 6 different speech enhancement pipelines.Deverberation and beamforming dramatically ...
Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition
Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation ...
Comments