Data augmentation and language model adaptation using singular value decomposition

https://doi.org/10.1016/j.patrec.2003.08.008Get rights and content

Abstract

A new augmentation method for counts to be used in language modeling is presented. It is based on word representations in a reduced space obtained with Singular Value Decomposition. A contribution to a count for a linguistic event x is obtained from the counts of observed events smoothed with a function of their distance from x. Experimental results on a spoken dialogue corpus show the performance of the proposed method, combined with maximum a posteriori probability adaptation, in terms of word error rate reduction.

Introduction

Most of the existing automatic speech recognition (ASR) systems generate word hypotheses with a Language Model (LM) which computes the probability of a sequence of words WN1=w1,w2,…,wn,…,wN as follows:P(WN1)=P(w1)∏Nn=2P(wn|w1,w2,…,wn−1)where the sequence w1,w2,…,wn−1 is called the history hn of word wn. A word can have many histories, thus the generic jth history of word wn will be indicated as hnj.

Usually, the probabilities appearing on the right-hand side of (1) are estimated with a training corpus. For long histories, it is practically impossible to find enough data in an even very large corpus, and approximations are introduced by clustering all the histories having the same one or two last words resulting in well-known bigram or trigram LMs.

When a new application domain is considered, many bigram and trigram probabilities change and new corpora are required for obtaining appropriate LMs. If large corpora are not available for this purpose, then an LM can be obtained by adapting available LMs trained with a large corpus in a more general domain. Various methods for LM adaptation have been proposed and an overview can be found in (Bellegarda, 2001). In general, even if a large training corpus is available, a number of linguistic events modeled by the LM are likely to be absent in the corpus.

Instead of adapting LM parameters, it is possible to perform data augmentation by inferring counts for training, based on the available adaptation data, in such a way that LM probabilities are estimated from counts obtained only from the adaptation data augmented with counts generated by a suitable smoothing/generalization criterion. Data augmentation is intended here as the computation of counts for unseen linguistic events using available counts of events that have been observed in a limited, application dependent corpus.

The approach proposed in this letter is based on the conjecture that if a word has been observed in a given context, then semantically similar words are likely to appear in the same context even if this event was not observed in the adaptation corpus.

Semantic similarity between words can be defined using a numerical distance between vectors representing words in a suitable space. Following an approach of Information Retrieval (Berry, 1992; Bellegarda, 1998; Deerwester et al., 1990), such a space can be defined using Singular Value Decomposition (SVD). In this way, the counts of the general purpose corpus and the counts obtained with adaptation have the same representation.

Further improvements can be obtained by performing maximum a posteriori (MAP) probability adaptation on the LM obtained with data augmentation.

Section snippets

Data augmentation

Let P={pij} be a I×J matrix where the generic element {pij} represents the probability or, simply, the count of observations of word wi in the context of history hj. Empirical evidence has shown that using counts provided better results than using probabilities indicating that normalization introduced by probability computation from counts is not effective. Thus, P was built as a matrix of counts. The ith row of matrix P is a vector whose J elements are the probabilities or the counts of wi

Adaptation with maximum a posteriori probability

In (De Mori and Federico, 1999), it is shown that MAP adaptation of LM probabilities can be performed by a linear interpolation of the a priori probabilities provided by the general LM and the probabilities obtained with the adaptation corpus. The same idea can be applied to bigram counts. Let cg(wj,wi) and cd(wj,wi) be respectively the bigram counts in the general corpus and in the domain adaptation corpus. Let Ng and Nd be respectively the sizes of the general corpus and the domain adaptation

Conclusions

A simple method has been proposed for obtaining bigram counts of unseen linguistic events by inference from counts of observed events weighted with a function of similarity between words. When counts are used for estimating the probabilities of a bigram LM, this approach provides estimates based on counts which avoid the use of back-off. Experimental results support the advantage of this approach showing a tangible WER reduction. Further benefits can be obtained by performing MAP adaptation of

References (6)

  • M.W. Berry

    Large-scale sparse singular value computations

    Int. J. Supercomput. Appl.

    (1992)
  • J. Bellegarda

    Multi-span statistical language modeling for large vocabulary speech recognition

    IEEE Trans. Speech Audio Process.

    (1998)
  • Bellegarda, J., 2001. An overview of statistical language model adaptation. In: Proc. ISCA-ITR Workshop on Adaptation...
There are more references available in the full text version of this article.

Cited by (0)

View full text