Elsevier

Information Sciences

Volume 156, Issues 1–2, 1 November 2003, Pages 55-69
Information Sciences

Noise-tolerant speech recognition: the SNN-TA approach

https://doi.org/10.1016/S0020-0255(03)00164-6Get rights and content

Abstract

Neural network learning theory draws a relationship between “learning with noise” and applying a regularization term in the cost function that is minimized during the training process on clean (non-noisy) data. Application of regularizers and other robust training techniques are aimed at improving the generalization capabilities of connectionist models, reducing overfitting. In spite of that, the generalization problem is usually overlooked by automatic speech recognition (ASR) practioners who use hidden Markov models (HMM) or other standard ASR paradigms. Nonetheless, it is reasonable to expect that an adequate neural network model (due to its universal approximation property and generalization capability) along with a suitable regularizer can exhibit good recognition performance whenever noise is added to the test data, although training is accomplished on clean data. This paper presents applications of a variant of the so called segmental neural network (SNN), introduced at BBN by Zavaliagkos et al. for rescoring the N-best hypothesis yielded by a standard continuous density HMM (CDHMM). An enhanced connectionist model, called SNN with trainable amplitude of activation functions (SNN-TA) is first used in this paper instead of the CDHMM to perform the recognition of isolated words. Viterbi-based segmentation is then introduced, relying on the level-building algorithm, that can be combined with the SNN-TA to obtain a hybrid framework for continuous speech recognition. The proposed paradigm is applied to the recognition of isolated and connected Italian digits under several noisy conditions, outperforming the CDHMMs.

Introduction

Increasing robustness to noise in an automatic speech recognition (ASR) system can be described as a generalization problem [15]: the recognizer is trained on a given training corpus (possibly collected in a laboratory, with clean acoustic conditions) and then applied on different, noisy signals, featuring only partially predictable environmental conditions. Techniques that allow for good recognition performance in spite of differences in conditions between training and test datasets are sought. State-of-the-art ASR systems usually rely on hidden Markov models (HMM) [6], [9]. HMMs offer good performance in laboratory tests, but robustness toward noise and variability to changes in acoustic conditions are problems far from being solved. In particular, no regularization theory [4] for HMMs has been developed so far. On the contrary, generalization properties of artificial neural networks (ANNs) are much more understood and exploited in ANNs regularized training algorithms.

In [11], [13] we emphasized that ANNs were intensively applied to ASR throughout a whole decade, but they basically failed as a general paradigm for ASR, especially with long sequences of acoustic observations (e.g. whole words from a dictionary or whole sentences), mostly due to the fact that learning long-term time dependencies with “conventional” connectionist architectures (including recurrent nets) is difficult [3]. A variety of different hybrid ANN/HMM systems [2], [5], [13], [14] was indeed introduced in recent years to tackle this problem, attempting to combine the desirable properties of both connectionist and markovian models.

The research described in the present paper starts by considering one such hybrid system, modifying the scope of the ANN within the paradigm, and applying a novel regularization technique to the connectionist model in order to increase its robustness to a significant extent. The hybrid under consideration is presented in [16]. An ANN called segmental neural network (SNN) is used therein for rescoring the N-best hypothesis of a HMM. The network computes scores on whole segments (sub-sequences) of frames, corresponding to phonemes, according to the segmentation provided by the underlying HMM. In so doing, correlations between nearby frames––belonging to the same phoneme––are exploited, thus overcoming the usual limitations following the assumption of independence made in standard HMMs. In addition, segmental information is expected to reduce noise sensitivity.

When an acoustic segment is fed into the SNN, the latter produces an estimation of the posterior probabilities of phonemes given that segment. Since the segments are made up of a variable number of frames, a technique of “normalization” of the length of the input window is needed in order to feed the fixed-sized input layer of the network. A discrete cosine transform (DCT) is applied to each segment, retaining as many parameters as it is necessary to fill the input layer.

In this paper the SNN is enhanced to provide robustness against noise by improving its generalization ability via a soft “self-regularization” technique based on the introduction of trainable amplitudes of activation functions (see Section 2). In addition, it is used as a speech recognizer itself, instead of as a mere rescoring tool for a HMM. We call such a model SNN-TA (segmental neural network with trainable amplitudes). To summarize, differences between the standard SNN and SNN-TA are the following: (a) SNN yields scores for individual phonemes over acoustic segments provided by the HMM. SNN-TA provides a posterior probability estimation for each “word” of the dictionary to be recognized over the whole input acoustic sequence; (b) in [16] SNNs are trained on a relative entropy criterion. Instead, backpropagation of squared errors between target and actual outputs is used herein, along with a gradient descent algorithm to train amplitudes; (c) the SNN-TA is said to be “regularized”, since a technique that increases its noise-tolerance (i.e., its generalization capabilities) is induced by the amplitude training algorithm.

Section snippets

The regularized SNN-TA

Fig. 1 shows a schematic representation of the architecture of the SNN-TA. In [10] we introduced a novel algorithm to learn the amplitude λ of non-linear activation functions in layered networks, without any assumptions on their analytical form f(x), i.e. transfer functions in the form y=λf(x) are considered. The algorithm is applied herein to train the SNN-TA, increasing its learning capabilities and providing a regularization effect that improves its generalization properties and, as a direct

Experiments

Three experimental setups were designed to evaluate the approach, as well as to compare it with standard acoustic models. In the first setup, clean speech signals are considered, and (real) noise is introduced via a simulator, relying on an additive–convolutive model, that allows for controlling the signal to noise ratio (SNR). Then, isolated digit-strings recorded in a real car environment under a variety of conditions are used for the test. Finally, digits extracted via a segmentation

Conclusions

Increasing noise-tolerance in ASR systems means ensuring improved generalization abilities, possibly relying on a proper regularization technique. Standard HMMs, although effective in laboratory tests, do completely lack of such a regularization theory, and they are actually far from being solving the problem of ASR in noisy conditions. ANNs appear to be a promising alternative. Unfortunately, between the end of the eighties and the beginning of the nineties, several attempts to apply ANNs to

References (16)

  • E. Trentin

    Networks with trainable amplitude of activation functions

    Neural Networks

    (2001)
  • E. Trentin et al.

    A survey of hybrid ANN/HMM models for automatic speech recognition

    Neurocomputing

    (2001)
  • B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, M. Omologo, Speaker independent continuous speech...
  • Y. Bengio

    Neural Networks for Speech and Sequence Recognition

    (1996)
  • Y. Bengio et al.

    Learning long-term dependencies with gradient descent is difficult

    IEEE Transactions on Neural Networks

    (1994)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • H. Bourlard et al.

    Connectionist speech recognition

    (1994)
  • R. De Mori

    Spoken Dialogues with Computers

    (1998)
There are more references available in the full text version of this article.

Cited by (5)

  • Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process

    2021, Engineering Applications of Artificial Intelligence
    Citation Excerpt :

    This consequently affects the overall performance of the system. While different approaches have addressed the question of solving or mitigating the errors produced in the ASR-M (Fernández-Díaz and Gallardo-Antolín, 2020; Graves et al., 2013; Squartini et al., 2012; Zhou et al., 2014; Trentin and Matassoni, 2003; Hannun et al., 2014; Shahamiri and Salim, 2014; Salem et al., 2007; Amrouche et al., 2010), only a few papers analyze the impact of these errors in subsequent components. Voleti et al. (2019) analyzed the effects of word substitution errors on sentence embeddings, and Simonnet et al. (2018) measured the impact of word substitution errors produced by ASR-M on NLU-M. Nevertheless, the question of the relationship between the different types of ASR-M errors and their influence on the EOTD-M has not been addressed.

  • A Large-Scale Depth-Based Multimodal Audio-Visual Corpus in Mandarin

    2019, Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
  • An audio-visual corpus for multimodal automatic speech recognition

    2017, Journal of Intelligent Information Systems
  • From bio-inspired computing to e-Biology

    2009, 7th International Conference on Creating, Connecting and Collaborating through Computing - C5 2009
View full text