Noise-tolerant speech recognition: the SNN-TA approach
Introduction
Increasing robustness to noise in an automatic speech recognition (ASR) system can be described as a generalization problem [15]: the recognizer is trained on a given training corpus (possibly collected in a laboratory, with clean acoustic conditions) and then applied on different, noisy signals, featuring only partially predictable environmental conditions. Techniques that allow for good recognition performance in spite of differences in conditions between training and test datasets are sought. State-of-the-art ASR systems usually rely on hidden Markov models (HMM) [6], [9]. HMMs offer good performance in laboratory tests, but robustness toward noise and variability to changes in acoustic conditions are problems far from being solved. In particular, no regularization theory [4] for HMMs has been developed so far. On the contrary, generalization properties of artificial neural networks (ANNs) are much more understood and exploited in ANNs regularized training algorithms.
In [11], [13] we emphasized that ANNs were intensively applied to ASR throughout a whole decade, but they basically failed as a general paradigm for ASR, especially with long sequences of acoustic observations (e.g. whole words from a dictionary or whole sentences), mostly due to the fact that learning long-term time dependencies with “conventional” connectionist architectures (including recurrent nets) is difficult [3]. A variety of different hybrid ANN/HMM systems [2], [5], [13], [14] was indeed introduced in recent years to tackle this problem, attempting to combine the desirable properties of both connectionist and markovian models.
The research described in the present paper starts by considering one such hybrid system, modifying the scope of the ANN within the paradigm, and applying a novel regularization technique to the connectionist model in order to increase its robustness to a significant extent. The hybrid under consideration is presented in [16]. An ANN called segmental neural network (SNN) is used therein for rescoring the N-best hypothesis of a HMM. The network computes scores on whole segments (sub-sequences) of frames, corresponding to phonemes, according to the segmentation provided by the underlying HMM. In so doing, correlations between nearby frames––belonging to the same phoneme––are exploited, thus overcoming the usual limitations following the assumption of independence made in standard HMMs. In addition, segmental information is expected to reduce noise sensitivity.
When an acoustic segment is fed into the SNN, the latter produces an estimation of the posterior probabilities of phonemes given that segment. Since the segments are made up of a variable number of frames, a technique of “normalization” of the length of the input window is needed in order to feed the fixed-sized input layer of the network. A discrete cosine transform (DCT) is applied to each segment, retaining as many parameters as it is necessary to fill the input layer.
In this paper the SNN is enhanced to provide robustness against noise by improving its generalization ability via a soft “self-regularization” technique based on the introduction of trainable amplitudes of activation functions (see Section 2). In addition, it is used as a speech recognizer itself, instead of as a mere rescoring tool for a HMM. We call such a model SNN-TA (segmental neural network with trainable amplitudes). To summarize, differences between the standard SNN and SNN-TA are the following: (a) SNN yields scores for individual phonemes over acoustic segments provided by the HMM. SNN-TA provides a posterior probability estimation for each “word” of the dictionary to be recognized over the whole input acoustic sequence; (b) in [16] SNNs are trained on a relative entropy criterion. Instead, backpropagation of squared errors between target and actual outputs is used herein, along with a gradient descent algorithm to train amplitudes; (c) the SNN-TA is said to be “regularized”, since a technique that increases its noise-tolerance (i.e., its generalization capabilities) is induced by the amplitude training algorithm.
Section snippets
The regularized SNN-TA
Fig. 1 shows a schematic representation of the architecture of the SNN-TA. In [10] we introduced a novel algorithm to learn the amplitude λ of non-linear activation functions in layered networks, without any assumptions on their analytical form f(x), i.e. transfer functions in the form y=λf(x) are considered. The algorithm is applied herein to train the SNN-TA, increasing its learning capabilities and providing a regularization effect that improves its generalization properties and, as a direct
Experiments
Three experimental setups were designed to evaluate the approach, as well as to compare it with standard acoustic models. In the first setup, clean speech signals are considered, and (real) noise is introduced via a simulator, relying on an additive–convolutive model, that allows for controlling the signal to noise ratio (SNR). Then, isolated digit-strings recorded in a real car environment under a variety of conditions are used for the test. Finally, digits extracted via a segmentation
Conclusions
Increasing noise-tolerance in ASR systems means ensuring improved generalization abilities, possibly relying on a proper regularization technique. Standard HMMs, although effective in laboratory tests, do completely lack of such a regularization theory, and they are actually far from being solving the problem of ASR in noisy conditions. ANNs appear to be a promising alternative. Unfortunately, between the end of the eighties and the beginning of the nineties, several attempts to apply ANNs to
References (16)
Networks with trainable amplitude of activation functions
Neural Networks
(2001)- et al.
A survey of hybrid ANN/HMM models for automatic speech recognition
Neurocomputing
(2001) - B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, M. Omologo, Speaker independent continuous speech...
Neural Networks for Speech and Sequence Recognition
(1996)- et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Transactions on Neural Networks
(1994) Neural Networks for Pattern Recognition
(1995)- et al.
Connectionist speech recognition
(1994) Spoken Dialogues with Computers
(1998)
Cited by (5)
Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process
2021, Engineering Applications of Artificial IntelligenceCitation Excerpt :This consequently affects the overall performance of the system. While different approaches have addressed the question of solving or mitigating the errors produced in the ASR-M (Fernández-Díaz and Gallardo-Antolín, 2020; Graves et al., 2013; Squartini et al., 2012; Zhou et al., 2014; Trentin and Matassoni, 2003; Hannun et al., 2014; Shahamiri and Salim, 2014; Salem et al., 2007; Amrouche et al., 2010), only a few papers analyze the impact of these errors in subsequent components. Voleti et al. (2019) analyzed the effects of word substitution errors on sentence embeddings, and Simonnet et al. (2018) measured the impact of word substitution errors produced by ASR-M on NLU-M. Nevertheless, the question of the relationship between the different types of ASR-M errors and their influence on the EOTD-M has not been addressed.
A Large-Scale Depth-Based Multimodal Audio-Visual Corpus in Mandarin
2019, Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018An audio-visual corpus for multimodal automatic speech recognition
2017, Journal of Intelligent Information SystemsNoise-tolerant inverse analysis models for nondestructive evaluation of transportation infrastructure systems using neural networks
2013, Nondestructive Testing and EvaluationFrom bio-inspired computing to e-Biology
2009, 7th International Conference on Creating, Connecting and Collaborating through Computing - C5 2009