Noise-tolerant speech recognition: the SNN-TA approach

doi:10.1016/S0020-0255(03)00164-6

Information Sciences

Volume 156, Issues 1–2, 1 November 2003, Pages 55-69

https://doi.org/10.1016/S0020-0255(03)00164-6 Get rights and content

Abstract

Neural network learning theory draws a relationship between “learning with noise” and applying a regularization term in the cost function that is minimized during the training process on clean (non-noisy) data. Application of regularizers and other robust training techniques are aimed at improving the generalization capabilities of connectionist models, reducing overfitting. In spite of that, the generalization problem is usually overlooked by automatic speech recognition (ASR) practioners who use hidden Markov models (HMM) or other standard ASR paradigms. Nonetheless, it is reasonable to expect that an adequate neural network model (due to its universal approximation property and generalization capability) along with a suitable regularizer can exhibit good recognition performance whenever noise is added to the test data, although training is accomplished on clean data. This paper presents applications of a variant of the so called segmental neural network (SNN), introduced at BBN by Zavaliagkos et al. for rescoring the N-best hypothesis yielded by a standard continuous density HMM (CDHMM). An enhanced connectionist model, called SNN with trainable amplitude of activation functions (SNN-TA) is first used in this paper instead of the CDHMM to perform the recognition of isolated words. Viterbi-based segmentation is then introduced, relying on the level-building algorithm, that can be combined with the SNN-TA to obtain a hybrid framework for continuous speech recognition. The proposed paradigm is applied to the recognition of isolated and connected Italian digits under several noisy conditions, outperforming the CDHMMs.

Introduction

Increasing robustness to noise in an automatic speech recognition (ASR) system can be described as a generalization problem [15]: the recognizer is trained on a given training corpus (possibly collected in a laboratory, with clean acoustic conditions) and then applied on different, noisy signals, featuring only partially predictable environmental conditions. Techniques that allow for good recognition performance in spite of differences in conditions between training and test datasets are sought. State-of-the-art ASR systems usually rely on hidden Markov models (HMM) [6], [9]. HMMs offer good performance in laboratory tests, but robustness toward noise and variability to changes in acoustic conditions are problems far from being solved. In particular, no regularization theory [4] for HMMs has been developed so far. On the contrary, generalization properties of artificial neural networks (ANNs) are much more understood and exploited in ANNs regularized training algorithms.

In [11], [13] we emphasized that ANNs were intensively applied to ASR throughout a whole decade, but they basically failed as a general paradigm for ASR, especially with long sequences of acoustic observations (e.g. whole words from a dictionary or whole sentences), mostly due to the fact that learning long-term time dependencies with “conventional” connectionist architectures (including recurrent nets) is difficult [3]. A variety of different hybrid ANN/HMM systems [2], [5], [13], [14] was indeed introduced in recent years to tackle this problem, attempting to combine the desirable properties of both connectionist and markovian models.

The research described in the present paper starts by considering one such hybrid system, modifying the scope of the ANN within the paradigm, and applying a novel regularization technique to the connectionist model in order to increase its robustness to a significant extent. The hybrid under consideration is presented in [16]. An ANN called segmental neural network (SNN) is used therein for rescoring the N-best hypothesis of a HMM. The network computes scores on whole segments (sub-sequences) of frames, corresponding to phonemes, according to the segmentation provided by the underlying HMM. In so doing, correlations between nearby frames––belonging to the same phoneme––are exploited, thus overcoming the usual limitations following the assumption of independence made in standard HMMs. In addition, segmental information is expected to reduce noise sensitivity.

When an acoustic segment is fed into the SNN, the latter produces an estimation of the posterior probabilities of phonemes given that segment. Since the segments are made up of a variable number of frames, a technique of “normalization” of the length of the input window is needed in order to feed the fixed-sized input layer of the network. A discrete cosine transform (DCT) is applied to each segment, retaining as many parameters as it is necessary to fill the input layer.

In this paper the SNN is enhanced to provide robustness against noise by improving its generalization ability via a soft “self-regularization” technique based on the introduction of trainable amplitudes of activation functions (see Section 2). In addition, it is used as a speech recognizer itself, instead of as a mere rescoring tool for a HMM. We call such a model SNN-TA (segmental neural network with trainable amplitudes). To summarize, differences between the standard SNN and SNN-TA are the following: (a) SNN yields scores for individual phonemes over acoustic segments provided by the HMM. SNN-TA provides a posterior probability estimation for each “word” of the dictionary to be recognized over the whole input acoustic sequence; (b) in [16] SNNs are trained on a relative entropy criterion. Instead, backpropagation of squared errors between target and actual outputs is used herein, along with a gradient descent algorithm to train amplitudes; (c) the SNN-TA is said to be “regularized”, since a technique that increases its noise-tolerance (i.e., its generalization capabilities) is induced by the amplitude training algorithm.

Section snippets

The regularized SNN-TA

Fig. 1 shows a schematic representation of the architecture of the SNN-TA. In [10] we introduced a novel algorithm to learn the amplitude λ of non-linear activation functions in layered networks, without any assumptions on their analytical form f(x), i.e. transfer functions in the form y=λf(x) are considered. The algorithm is applied herein to train the SNN-TA, increasing its learning capabilities and providing a regularization effect that improves its generalization properties and, as a direct

Experiments

Three experimental setups were designed to evaluate the approach, as well as to compare it with standard acoustic models. In the first setup, clean speech signals are considered, and (real) noise is introduced via a simulator, relying on an additive–convolutive model, that allows for controlling the signal to noise ratio (SNR). Then, isolated digit-strings recorded in a real car environment under a variety of conditions are used for the test. Finally, digits extracted via a segmentation

Conclusions

Increasing noise-tolerance in ASR systems means ensuring improved generalization abilities, possibly relying on a proper regularization technique. Standard HMMs, although effective in laboratory tests, do completely lack of such a regularization theory, and they are actually far from being solving the problem of ASR in noisy conditions. ANNs appear to be a promising alternative. Unfortunately, between the end of the eighties and the beginning of the nineties, several attempts to apply ANNs to

References (16)

E. Trentin
Networks with trainable amplitude of activation functions
Neural Networks
(2001)
E. Trentin et al.
A survey of hybrid ANN/HMM models for automatic speech recognition
Neurocomputing
(2001)
B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, M. Omologo, Speaker independent continuous speech...
Y. Bengio
Neural Networks for Speech and Sequence Recognition
(1996)
Y. Bengio et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Transactions on Neural Networks
(1994)
C.M. Bishop
Neural Networks for Pattern Recognition
(1995)
H. Bourlard et al.
Connectionist speech recognition
(1994)
R. De Mori
Spoken Dialogues with Computers
(1998)

There are more references available in the full text version of this article.

Cited by (5)

Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process
2021, Engineering Applications of Artificial Intelligence
Citation Excerpt :
This consequently affects the overall performance of the system. While different approaches have addressed the question of solving or mitigating the errors produced in the ASR-M (Fernández-Díaz and Gallardo-Antolín, 2020; Graves et al., 2013; Squartini et al., 2012; Zhou et al., 2014; Trentin and Matassoni, 2003; Hannun et al., 2014; Shahamiri and Salim, 2014; Salem et al., 2007; Amrouche et al., 2010), only a few papers analyze the impact of these errors in subsequent components. Voleti et al. (2019) analyzed the effects of word substitution errors on sentence embeddings, and Simonnet et al. (2018) measured the impact of word substitution errors produced by ASR-M on NLU-M. Nevertheless, the question of the relationship between the different types of ASR-M errors and their influence on the EOTD-M has not been addressed.
An End-Of-Turn Detection Module (EOTD-M) is an essential component of automatic Spoken Dialogue Systems. The capability of correctly detecting whether a user’s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in dialogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Understanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR-M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that models different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the different ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.
A Large-Scale Depth-Based Multimodal Audio-Visual Corpus in Mandarin
2019, Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
An audio-visual corpus for multimodal automatic speech recognition
2017, Journal of Intelligent Information Systems
Noise-tolerant inverse analysis models for nondestructive evaluation of transportation infrastructure systems using neural networks
2013, Nondestructive Testing and Evaluation
From bio-inspired computing to e-Biology
2009, 7th International Conference on Creating, Connecting and Collaborating through Computing - C5 2009

View full text

Noise-tolerant speech recognition: the SNN-TA approach

Abstract

Introduction

Section snippets

The regularized SNN-TA

Experiments

Conclusions

Neural Networks

Neurocomputing

Neural Networks for Speech and Sequence Recognition

Learning long-term dependencies with gradient descent is difficult

IEEE Transactions on Neural Networks

Neural Networks for Pattern Recognition

Connectionist speech recognition

Spoken Dialogues with Computers