Abstract
The evolution of robust speech recognition systems that maintain a high level of recognition accuracy in difficult and dynamically-varying acoustical environments is becoming increasingly important as speech recognition technology becomes a more integral part of mobile applications. In distributed speech recognition (DSR) architecture the recogniser's front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to the remote back-end recogniser. DSR provides particular benefits for the applications of mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. A feature extraction algorithm integrated into the DSR system is required to operate in real-time as well as with the lowest possible computational costs.
In this paper, two innovative front-end processing techniques for noise robust speech recognition are presented and compared, time-domain based frame-attenuation (TD-FrAtt) and frequency-domain based frame-attenuation (FD-FrAtt). These techniques include different forms of frame-attenuation, improvement of spectral subtraction based on minimum statistics, as well as a mel-cepstrum feature extraction procedure. Tests are performed using the Slovenian SpeechDat II fixed telephone database and the Aurora 2 database together with the HTK speech recognition toolkit. The results obtained are especially encouraging for mobile DSR systems with limited sizes of available memory and processing power.
Similar content being viewed by others
References
Andrassy, B., Vlaj, D., and Beaugeant, C. (2001). Recognition performance of the siemens front-end with and without frame dropping on the Aurora 2 database. EUROSPEECH 2001 Proceedings. Aalborg, Denmark, pp. 193-196.
Benitez, C., Burget, L., Chen, B., Dupont, S., Garudadri, H., Hermansky, H., Jain, P., Kajarekar, S., Morgan, N., and Sivadas, S. (2001). Robust ASR front-end using spectral-based and discriminant features: Experiments on the Aurora tasks. EUROSPEECH 2001 Proceedings. Aalborg, Denmark, pp. 429-432.
Boll, S.F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2):113-120.
COST 249 SpeechDat SIG (2000). The RefRec Homepage. http://www.telenor.no/fou/prosjekter/taletek/refrec/
Deller, J.R., Proakis, J.G., and Hansen, J.H.L. (1993). Discrete-Time Processing of Speech Signals. New York, USA: Macmillan Publishing Company.
ETSI standard document (2000). Speech processing, transmission and quality aspects (STQ), distributed speech recognition, front-end feature extraction algorithm, compression algorithm. ETSI ES 201 108 v1.1.1 (2000-02). Sophia Antipolis, France.
ETSI-SMG technical specification (1994). European digital cellular telecommunication system (Phase 1)-Transmission planning aspects for the speech service in GSM PLMN system-GSM03.50, version3.4.0. Sophia Antipolis, France.
Hirsch, H.G. and Pearce, D. (2000). The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. ISCA ITRWASR 2000 Proceedings. Paris, France.
ITU recommendation G.712 (1996). Transmission performance characteristics of pulse code modulation channels. Geneva, Switzerland.
ITU recommendation G.723.1 A (1996). Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s. Annex A: Silence compression scheme. Geneva, Switzerland.
Junqua, J.-C. and Haton, J.-P. (1996). Robustness in Automatic Speech Recognition. Kluwer Academic Publishers. Norwell, Massachusetts, USA.
Kaiser, J. and Kačič, Z. (1997). SpeechDat II Slovenian Database for the Fixed Telephone Network. Maribor, Slovenia: University of Maribor.
Kotnik, B., Rotovnik, T., Kačič, Z., and Horvat, B. (2001a). The design of mobile multimodal communication device-personal navigator. EUROCON 2001 Proceedings, Bratislava, Slovakia, pp. 337-340.
Kotnik, B., Kačič, Z., and Horvat, B. (2001b). A Multiconditional Robust Front-End Feature Extraction with a Noise Reduction Procedure Based on Improved Spectral Subtraction Algorithm. EUROSPEECH 2001 Proceedings. Aalborg, Denmark, pp. 197-200.
Leonard, R.G. (1991). A Speaker-Independent Connected-Digit Database. Texas Instruments Inc., Dallas, Texas, USA
Lindberg, B., Johansen, F.T., Warakagoda, N., Lehtinen, G., Kačič, Z., Zgank, A., Elenius, K., and Salvi, G. (2000). A noise robust multilingual reference recogniser based on SpeechDat II. ICSLP 2000 Proceedings. Beijing, China. Paper No. 01775.
Martin, R. (1994). Spectral subtraction based on minimum statistics. EUSIPCO1994 Proceedings. Edinburgh, Scotland, UK. pp. 1182-1185.
Oviatt, S. (2000). Multimodal signal processing in naturalistic noisy environments. ICSLP 2000 Proceedings. Beijing, China, pp. 696-699.
Pearce, D. (2000). An overview of the ETSI standards activities for distributed speech recognition front-ends. AVIOS 2000 Proceedings. San Jose, CA, USA.
Van den Heuvel, H., Boves, L., Moreno, A., Omologo, M., Richard, G., and Sanders, E. (2001). Annotation in the SpeechDat projects. International Journal of Speech Technology, 4(2):127-143.
Varga, A.P. and Moore, R.K. (1990). Hidden Markov model decomposition of speech and noise. ICASSP 1990 Proceedings. Albuquerque, New Mexico, USA, pp. 845-848.
Yapanel, U., Hansen, J.H.L., Sarikaya, R., and Pellom, B. (2001). Robust digit recognition in noise: An evaluation using the AURORA Corpus. EUROSPEECH 2001 Proceedings. Aalborg, Denmark, pp. 209-212.
Young, S. (1997). HTKBook-Version 2.1, Cambridge, UK: Entropic Cambridge Research Laboratory.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kotnik, B., Vlaj, D. & Horvat, B. Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems. International Journal of Speech Technology 6, 205–219 (2003). https://doi.org/10.1023/A:1023410018862
Issue Date:
DOI: https://doi.org/10.1023/A:1023410018862