Elsevier

Signal Processing

Volume 91, Issue 8, August 2011, Pages 2101-2111
Signal Processing

Context-adaptive pre-processing scheme for robust speech recognition in fast-varying noise environment

https://doi.org/10.1016/j.sigpro.2011.03.020Get rights and content

Abstract

Based on the observation that dissimilar speech enhancement algorithms perform differently for different types of interference and noise conditions, we propose a context-adaptive speech pre-processing scheme, which performs adaptive selection of the most advantageous speech enhancement algorithm for each condition. The selection process is based on an unsupervised clustering of the acoustic feature space and a subsequent mapping function that identifies the most appropriate speech enhancement channel for each audio input, corresponding to unknown environmental conditions. Experiments performed on the MoveOn motorcycle speech and noise database validate the practical value of the proposed scheme for speech enhancement and demonstrate a significant improvement in terms of speech recognition accuracy, when compared to the one of the best performing individual speech enhancement algorithm. This is expressed as accuracy gain of 3.3% in terms of word recognition rate. The advance offered in the present work reaches beyond the specifics of the present application, and can be beneficial to spoken interfaces operating in fast-varying noise environments.

Introduction

Mobile systems, providing a large variety of services and interactions in a continuously changing environment, are nowadays a reality, and some of the most advanced applications have been ported to the mobile world [1]. Activities, which traditionally were performed in an office or at home, in a well-controlled environment, have now migrated outdoors, being supported by mobile and embedded technologies. The last results in an increased demand on services providing efficiency empowered by high comfort and safety in the new environment, taking into account that most of the time parallel activities, such as driving a car or a motorcycle, are performed. On the route, driver distraction can lead to significant risks, thus highly efficient human–computer interfaces are required.

In order to meet both, comfort and safety requirements, new technologies need to be introduced into the mobile environment, enabling drivers to interact with mobile systems and services in an easy risk-free way. Driving quality, stress and strain situations and user acceptance when using speech and manual commands to acquire certain information on the route have previously been studied [2], and the results have shown that, with speech input, the feeling of being distracted from driving is smaller, and road safety is improved, especially in case of complex tasks. Moreover, assessment of user requirements from multimodal interfaces in a car environment has shown that when the car is moving, the system should switch to the “speech-only” interaction mode, as any other safety risks (i.e. driver distraction from the driving task by gesture input or graphical output) must be avoided [3]. Even more, the use of graphical or gesture based interfaces although possible to some degree when driving a car, is highly limited when driving a motorcycle.

The performance of speech-based interfaces, although reliable enough in controlled environments to support speaker and device independence, degrades substantially in a non-stationary environment [4], reaching its worst in the motorcycle on the move environment. There are various factors, which contribute for severe degradation of the speech signal in a moving-motorcycle environment, among which are:

  • (i)

    the presence of additive interferences from the acoustic environment, such as rumble noises from road vibrations or from the friction between the tires and the road surface, other mechanical noise from fans, gears, horns, wind noise, engine noise, surroundings traffic noise, etc.

  • (ii)

    speech signal alteration related to changes in the speaker's voice and speaking style due to task stress, distributed attention, physical efforts, body vibrations, Lombard effect, etc.

In the present work, we focus our attention on compensating the speech signal degradation caused by the presence of additive interferences from the acoustic environment, which is somehow decoupled from the signal alteration due to changes in the cognitive load, the physical stress and the body vibration of the motorcyclist. Thus, any effort to deal with the alteration of speech due to the condition of the motorcyclist and the physical stress over his body remains beyond the scope of this work.

The accuracy of the automatic speech recognition is significantly improved by using suitably trained acoustic models for the speech decoder. Sufficient samples of the various noise scenarios and samples from the application domain should be included in the dataset used for training of the acoustic models to achieve the improvement of the overall speech recognition accuracy. For that purpose, various dedicated speech databases, which are representative for a set of mobile voice-interaction applications, have been designed, recorded and annotated, starting with the car environment, and emerging with the motorcycle one. European initiative, aiming at the development of databases in support of multilingual speech recognition applications in the car environment started in 1998 with the SPEECHDAT-CAR project [5]. The databases developed are designed to include a phonetically balanced corpus to train generic speech recognition systems and an application corpus, providing enough data to adapt speaker independent recognition systems to the automotive environment. A total of ten European languages are supported, with recordings from at least 300 speakers for each language and seven characteristic environments (low speed, high speed, with audio equipment on, etc.). The CU-Move corpus consists of five domains, including digit strings, route navigation expressions, street and location sentences, phonetically balanced sentences and a route navigation dialog in a human Wizard-of-Oz like scenario, considering a total of 500 speakers from the United States of America and a natural conversational interaction [6]. The research on human–computer interaction in car environment has evolved to the multimodal mode (audio and visual), and adequate audio–visual corpus has been developed in the AVICAR database [7] using a multi-sensory array of eight microphones and four video cameras. For the motorcycle environment, the SmartWeb motorbike corpus has been designed for a dialog system dealing with open domains [8]. Recently, a domain-specific database (operations of the motorcycle police force), dealing with the extreme conditions of the motorcycle environment, has been developed in the MoveOn project [9]. In the latest, the focus is on the domain specificity of the moving motorcycle on-the-road environment, where the cognitive load is quite high and the accuracy in recognition of commands in the context of a template driven dialog is of high priority.

In addition to the use of dedicated speech databases for adapting the acoustic models of the speech decoders, it has been proved that addition of noise suppression front-ends contributes for the improvement of the speech recognition accuracy. In the early 90', the first trials to perform speech recognition in car environment were done, and started with combinations of basic hidden Markov model (HMM) recognizers with front-end noise suppression, environmental noise adaptation and multi-channel concepts [10], [11]. Preliminary speech/noise detection with front-end speech enhancement methods as noise suppression front-ends for robust speech recognition has shown promising results and currently benefits from the suppression of interfering signals by using a microphone array, which enables both spatial and temporal measurements [12]. The advantages of multi-channel speech enhancement can be successfully applied to the car environment, while in the motorcycle environment research is focused to one-channel speech enhancement, since using microphone arrays is impractical.

After more than three decades of advances on the one-channel speech enhancement problem [13], [14], four distinct families of algorithms seem to have predominated in the literature: (i) the spectral subtractive algorithms [15], [16], [17], (ii) the statistical model-based approaches [18], [19], [20], (iii) the signal subspace approaches [21], [22] and (iv) the enhancement approaches based on a special type of filtering [23]. The references illustrating each of the aforementioned groups are indicative and here we do not claim exhaustiveness of the list for each family of algorithms. However, it is important to emphasize that although in each of the aforementioned families there are few algorithms which demonstrate high performance for a specific set of noise conditions, and although one can identify “the best performing algorithm among all families”, the speech accuracy gain obtained by this single method might still remain insufficient (and in a way suboptimal), when operation in highly non-stationary and fast-varying noise environments, such as the one associated with the motorcycle-on-the-move environment, is considered. This has been confirmed by a recent research [24], where evidence that a collaborative speech enhancement scheme outperforms the best individual algorithm was provided.

The research work presented in the following sections has been conducted in order to develop a robust and energy efficient speech interface in the MoveOn system [25] dedicated for command and control applications for police force motorcyclists. For the specific target group, a zero-distraction interaction system is aimed, since due to safety reasons the drivers are not able to interact through visual/tactile interfaces such as a screen or a button pad. Our research is supported by a dedicated speech database, which has previously been developed [26], and speech recognition accuracy has been improved by using suitably trained acoustic models. Recent research has shown that speech recognition accuracy can further be improved if a collaborative noise reduction scheme is employed, for the needs of the speech enhancement process [24], [27]. However, this advance relies on multiple speech enhancement channels that operate in parallel, and thus it was achieved at the cost of a significant increase of the computational and memory demands and a significantly increased complexity of the speech frond-end.

In the present work, we address the challenges imposed by the highly non-stationary and fast-varying noise environment in a constructional manner and develop a speech pre-processing scheme that adapts its configuration depending on the audio input. The present contribution builds on the fact that dissimilar speech enhancement algorithms perform differently for dissimilar types of interference and noise conditions [28], and on the idea that the most appropriate speech enhancement algorithm for each environmental condition can be selected dynamically during the run-time operation of the speech front-end, depending on the present audio input. The proposed adaptive scheme automatically selects only one speech enhancement channel, among all available, and thus alleviates scalability constraints inherent to earlier designs [24], [27], which assume a number of speech enhancement algorithms operating in parallel on a common input.

Specifically, the adaptive speech pre-processing scheme proposed here is organized as a two-stage process, where in the first stage the parameterized audio input is compared against a number of predefined clusters in the acoustic feature space to generate a new feature vector which consist of normalized log-likelihoods. The second stage employs a certain machine learning technique, which uses this new feature vector, in order to map the first-stage output to the most appropriate speech enhancement channel, among the available ones. In this manner each input speech utterance is redirected for processing to the most appropriate speech enhancement channel. Provided that a set of dissimilar speech enhancement methods are involved, and each of them offers an advantage in given noise conditions, we suppose that the overall accuracy improvement offered by the context-adaptive pre-processing scheme studied here will be higher than the accuracy of the best individual speech enhancement algorithm alone, when highly non-stationary and fast-varying noise environments are considered. To the best knowledge of the authors, the proposed selective scheme, based on GMM-clustering and subsequent mapping function for selection of the most appropriate speech enhancement channel has not previously been studied, and thus constitutes significant novelty in the manner that a speech front-end copes with highly non-stationary fast-varying noise conditions.

The practical usefulness of the proposed adaptive speech pre-processing scheme is investigated with the use of a number of traditional and recently developed speech enhancement algorithms. It is experimentally demonstrated that the proposed scheme contributes to a significant improvement of the speech recognition accuracy, when compared to the baseline result (the best individual speech enhancement method when used alone), while it needs only a fraction of the computational demands of the earlier design [24]. Thus, the proposed new design contributes to a significant reduction of the computational demand during operation and improves the energy efficiency of the speech front-end, which in the MoveOn application is part of a wearable solution that operates on battery power. The number and the actual choice of speech enhancement algorithms are application-specific issues, and their choice does not affect the logic of the operation of the proposed context-adaptive speech front-end.

The remaining of this article is organized as follows: in Section 2, we present the context-adaptive pre-processing scheme for speech enhancement. In Section 3, we briefly outline the MoveOn speech and noise database used in the experiments. In Section 4, we briefly outline the speech enhancement algorithms employed, detail on the experimental protocol that was followed, and explain the implementation and the settings of the various components. In Section 5, we present and discuss the experimental results and, finally, in Section 6, we conclude this work with a brief summary of work and results.

Section snippets

The context-adaptive speech pre-processing scheme

Fast-varying noise environments, which are typical for a motorcycle-on-the-move and are characterized by the superposition of interferences that vary vastly in both duration and spectral contents, are significant impediment for the use of speech recognition-based interaction services. This is because the speech enhancement process encounters significant difficulties due to the non-stationary interferences, originating from the acceleration and the deceleration of the engine, the vibrations from

The MoveOn speech and noise database

For the purpose of research and technology development in the MoveOn project, a dedicated speech database was recorded in environment typical for a motorcycle on-the-move [9]. Specifically, a group of thirty professional motorcyclists, members of the operational police force of UK, was recruited. While performing patrolling activates through the streets and suburbs of Birmingham, each participant was asked to repeat a number of domain-specific commands and expressions, or to provide a

Experimental setup

The context-adaptive speech front-end proposed in Section 2 was evaluated in different experimental setups: single channel and multiple channel speech enhancement, different number of clusters in the GMM, as well as for various channel mapping methods. In the following we describe in detail the settings of the experimental setup, the mapping algorithms and the experimental protocol of the present evaluation.

Experimental results

Following the experimental setup described in Section 4, we performed an experimental evaluation of the context-adaptive speech enhancement scheme proposed in Section 2. Firstly in Section 5.1 we present the evaluation of the performance of the individual speech enhancement methods, and afterwards in Section 5.2, we investigate the performance of the proposed context-adaptive scheme, for different settings of the GMM-based clustering and for different implementations of the mapping function. In

Conclusion

Speech interaction between a motorcycle driver and a spoken dialog system is often required for the needs of professional information support (as in police force operations) or for entertainment (web-access, control of music players or other personal devices, etc). The main difficulties for guaranteeing robust spoken dialog interaction in these conditions are due to both the fast-varying noise environment and the changes in the properties of speech signal because of the body stress, the

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments, which significantly improved the quality of this article. The research that led to the results reported here was financially supported by the MoveOn project (IST-2005-034753), co-funded by the European Community under the Sixth Framework Programme (FP6).

References (44)

  • A. Moreno, B. Linderberg, C. Draxler, G. Richard, K. Choukri, S. Euler, J. Allen, SPEECHDAT-CAR: a large speech...
  • J.H.L. Hansen, X. Zhang, M. Akbacak, U. Yapanel, B. Pellom, W. Ward, CU-Move: advances in in-vehicle speech systems for...
  • B. Lee, M. Hasegawa-Johnson, C. Goudeseune, AVICAR: audio–visual speech corpus in a car environment, in: Proceedings of...
  • M. Kaiser, H. Mogele, F. Shiel, Bikers accessing the web: the SmartWeb motorbike corpus, in: Proceedings of the LREC...
  • T. Winkler, T. Kostoulas, R. Adderley, C. Bonkowski, T. Ganchev, J. Kohler, N. Fakotakis, The MoveOn motorcycle speech...
  • J.H.L. Hansen et al.

    Constrained iterative speech enhancement with application to speech recognition

    IEEE Transactions on Audio, Speech and Signal Processing

    (1991)
  • M. Berouti, R. Schwartz, J. Makhoul, Enhancement of speech corrupted by acoustic noise, in: Proceedings of the IEEE...
  • R. Martin

    Noise power spectral density estimation based on optimal smoothing and minimum statistics

    IEEE Transactions on Speech and Audio Processing

    (2001)
  • S. Kamath, P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in:...
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean square error log-spectral amplitude estimator

    IEEE Transactions on Acoustics, Speech, Signal Processing

    (1985)
  • P. Loizou

    Speech enhancement based on perceptually motivated Bayesian estimators of the speech magnitude spectrum

    IEEE Transactions on Speech and Audio Processing

    (2005)
  • Y. Hu et al.

    Speech enhancement by wavelet thresholding the multitaper spectrum

    IEEE Transactions on Speech and Audio Processing

    (2004)
  • Cited by (9)

    • Independent vector analysis followed by HMM-based feature enhancement for robust speech recognition

      2016, Signal Processing
      Citation Excerpt :

      Noise robustness remains a considerably important issue in the field of automatic speech recognition (ASR) because ambient noise seriously degrades the performance of ASR systems in real-world environments. This degradation is mainly caused by differences in the training and testing environments, and many studies have aimed at recouping the loss by compensating for the mismatch (e.g. [1–5]). Although these approaches can improve the recognition accuracy under specific conditions, most of them frequently fail to attain high recognition performance in real-world environments with various nonstationary noises (e.g. [6]).

    • 2D Psychoacoustic modeling of equivalent masking for automatic speech recognition

      2015, Signal Processing
      Citation Excerpt :

      Notably, the human auditory system shows a much better resistance to the effects of noise [1,7]. The human auditory system can work relatively well in adverse situations where there is unpredictable environmental noise and distortion [8–11]. For example, a person with a healthy auditory system has little difficulty in communicating with other people in a crowded shopping mall, which would be a very challenging task for modern ASR [12–15].

    • Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition

      2015, Signal Processing
      Citation Excerpt :

      Noise robustness is a very important issue in the commercialization of automatic speech recognition (ASR) systems because the performance of such systems is seriously degraded in noisy real-world environments. This degradation occurs mainly because of the differences between training and testing environments, and many algorithms have been proposed to compensate for the mismatch (e.g. [1–5]). Although these approaches can improve recognition accuracy under some conditions, most of them frequently fail to result in high performance ASR systems in dynamically changing environments with various nonstationary interferences (e.g. [6]).

    • Affective speech interface in serious games for supporting therapy of mental disorders

      2012, Expert Systems with Applications
      Citation Excerpt :

      The existing components are augmented to support multimodality, to be adaptable to context changes, to user preferences, and to game tasks. The design and implementation was motivated from previous research in speech recognition (Mporas, Ganchev, Kocsis, & Fakotakis, 2011a, 2011b; Mporas, Ganchev, Siafarikas, & Kostoulas, 2007; Mporas, Kocsis, Ganchev, & Fakotakis, 2010). The experimental results indicate that the speech recognition performance for emotional speech is reduced moderately, and thus acoustic models built from emotional speech are required for optimal performance.

    View all citing articles on Scopus
    View full text