Elsevier

Speech Communication

Volume 42, Issue 1, January 2004, Pages 93-108
Speech Communication

Statistical language model adaptation: review and perspectives

https://doi.org/10.1016/j.specom.2003.08.002Get rights and content

Abstract

Speech recognition performance is severely affected when the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks differ. The aim of language model adaptation is to exploit specific, albeit limited, knowledge about the recognition task to compensate for this mismatch. More generally, an adaptive language model seeks to maintain an adequate representation of the current task domain under changing conditions involving potential variations in vocabulary, syntax, content, and style. This paper presents an overview of the major approaches proposed to address this issue, and offers some perspectives regarding their comparative merits and associated trade-offs.

Introduction

Language modeling plays a pivotal role in automatic speech recognition. It is variously used to constrain the acoustic analysis, guide the search through multiple (partial) text hypotheses, and/or contribute to the determination of the final transcription (Bahl et al., 1983; Jelinek, 1985; Rabiner et al., 1996). Fundamentally, its function is to encapsulate as much as possible of the syntactic, semantic, and pragmatic characteristics for the task considered.

In the search, the successful capture of this information is critical to help determine the most likely sequence of words spoken, because it quantifies which word sequences are acceptable in a given language for a given task, and which are not. In that sense, language modeling can be thought of as a way to impose a collection of constraints on word sequences. Since, generally, many different such sequences can be used to convey the same information, these constraints tend to be statistical in nature (Gorin, 1995). Thus, regularities in natural language are governed by an underlying (unknown) probability distribution on word sequences. The ideal outcome of language modeling, then, would be to derive a good estimate of this distribution.

In the unrestricted case, however, carrying out this task is not feasible: some simplifications are necessary to render the problem tractable. The standard approach is to constrain allowable word sequences to those that can be parsed under the control of a probabilistic context-free grammar (PCFG), a somewhat crude yet well-understood model of natural language (Church, 1987). Unfortunately, because parser complexity tends to be nonlinear, at the present time context-free parsing is simply not practical for any but the most rudimentary applications. The problem is then restricted to a subclass of PCFGs, strongly regular grammars, which can be efficiently mapped into equivalent (weighted) finite state automata with much more attractive computational properties. This has led to a large body of literature exploiting such properties in finite state transducers (Mohri, 2000).

Situations where such stochastic automata are especially easy to deploy include relatively self-contained, constrained vocabulary tasks (Pereira and Riley, 1997). This is often the case, for example, of a typical dialog state in a dialog system. At that point, given one or more input strings as input, the goal is to reestimate state transition probabilities pertaining only to the input set, so input string matching on a finite automaton is a convenient solution. In dictation and other large vocabulary applications, however, the size and complexity of the task complicates the issue of coverage. Generic stochastic automata indiscriminately accepting variable length sequences become unsuitable. To maintain tractability, attention is further restricted to a subclass of probabilistic regular grammars, stochastic n-gram models. Such models have been most prominently used with n=2 and 3, corresponding to classical statistical bigrams and trigrams (Jelinek, 1985).

This leads to the focus of this paper. Statistical n-grams, of course, can also be represented by equivalent (n-gram) finite state automata. In practice, the difference between the stochastic automaton and the original n-gram representation is largely a matter of implementation. Many systems in use today, especially for complex dialog applications, are based on the former (cf., for example, Riccardi and Gorin, 2000; Zue et al., 2000), while the latter is more prevalent amongst transcription systems (see e.g., Adda et al., 1999; Ohtsuki et al., 1999). In what follows, since the discussion is essentially unaffected by implementation details, the terminology “statistical language model” (SLM) will refer to the general concept of a stochastic n-gram.

Natural language is highly variable in several aspects.

First, language evolves as does the world it seeks to describe: contrast the recent surge of the word “proteomics” to the utter demise of “ague” (a burning fever, from Leviticus 26:16, King James translation of the Bible). The effective underlying vocabulary changes dynamically with time on a constant basis.

Second, different domains tend to involve relatively disjoint concepts with markedly different word sequence statistics: consider the relevance of “interest rate” to a banking application, versus a general conversation on gaming platforms. A heterogeneous subject matter drastically affects the underlying semantic characteristics of the discourse at topic boundaries.

Third, people naturally adjust their use of the language based on the task at hand: compare the typical syntax employed in formal technical papers to the one in casual e-mails, for example. While the overall grammatical infrastructure may remain invariant, syntactic clues generally differ from one task to the next.

And finally, people’s style of discourse may independently vary due to a variety of factors such as socio-economic status, emotional state, etc. This last effect, of course, is even more pronounced on spoken natural language.

As a result of this inherent variability, the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks are quite likely to differ. This is bad news for n-gram modeling, as the performance of any statistical approach always suffers from such mismatch. SLMs have indeed been found to be extremely brittle across domains (Rosenfeld, 2000), and even within domain when training and recognition involve moderately disjoint time periods (Rosenfeld, 1995). The unfortunate outcome is a severe degradation in speech recognition performance compared to the ideal matched situation.

It turns out, for example, that to model casual phone conversation, one is much better off using two million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model trained on Dow–Jones newswire text sees its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period (Rosenfeld, 1996, Rosenfeld, 2000).

In addition, linguistic mismatch is known to affect cross-task recognition accuracy much more than acoustic mismatch. For instance, results of a cross-task experiment using Broadcast News models to recognize TI-digits, recently reported in (Lefevre et al., 2001), show that only about 8% of the word error rate increase was due to the acoustic modeling mismatch, while 92% was imputable to the language model mismatch. In a similar experiment involving ATIS, these figures were approximately 2% and 98%, respectively (Lefevre et al., 2001). Analogous trends were observed in (Bertoldi et al., 2001) for different tasks in a different language.

The above discussion makes a strong case for SLM adaptation, as a means to reduce the degradation in speech recognition performance observed with a new set of operating conditions (Federico and de Mori, 1999). The various techniques that have been proposed to carry out the adaptation procedure can be broadly classified into three major categories. Where a particular technique falls depends on whether its underlying philosophy is based on: (i) model interpolation, (ii) constraint specification, or (iii) meta-information extraction. The latter category refers to knowledge about the recognition task which may not be explicitly observable in the word sequence itself. This includes the underlying discourse topic, general semantic and syntactic information, as well as a combination thereof.

The paper is accordingly organized as follows. The next section poses the adaptation problem and reviews the various ways to gather suitable adaptation data. Section 3 covers interpolation-based approaches, including dynamic cache models. In Section 4, we describe the use of constraints, as typically specified within the maximum entropy framework. Section 5 gives an overview of topic-centered techniques, starting with adaptive mixture n-grams. Alternative integration of semantic knowledge, i.e., triggers and latent semantic analysis, is discussed in Section 6. Section 7 addresses the use of syntactic infrastructure, as implemented in the structured language model, and Section 8 considers the integration of multiple knowledge sources to further increase performance. Finally, in Section 9 we offer some concluding remarks and perspectives on the various trade-offs involved.

Section snippets

Adaptation framework

The general SLM adaptation framework is depicted in Fig. 1. Two text corpora are considered: a (small) adaptation corpus A, relevant to the current recognition task, and a (large) background corpus B, associated with a presumably related but perhaps dated and/or somewhat different task, as discussed above.

Model interpolation

In interpolation-based approaches, the corpus A is used to derive a task-specific (dynamic) SLM, which is then combined with the background (static) SLM. This appealingly simple concept provides fertile grounds for experimentation, depending on the level at which the combination is implemented.

Constraint specification

In approaches based on constraint specification, the corpus A is used to extract features that the adapted SLM is constrained to satisfy. This is arguably more powerful than model interpolation, since in this framework a different weight could presumably be assigned separately for each feature.

Topic information

In approaches exploiting the general topic of the discourse, the corpus A is used to extract information about the underlying subject matter. This information is then used in various ways to improve upon the background model based on semantic classification.

Semantic knowledge

Approaches taking advantage of semantic knowledge purport to exploit not just topic information as above, but the entire semantic fabric of the corpus A, so they usually involve a finer level of granularity and/or some sort of dimensionality reduction.

Syntactic infrastructure

Approaches leveraging syntactic knowledge make the implicit assumption that the background and recognition tasks share a common grammatical infrastructure, so that grammatical constraints are largely portable from corpus B to corpus A. The background SLM is then used for initial syntactic modeling, and the corpus A to re-estimate the associated parameters.

Multiple sources

In approaches exploiting multiple knowledge sources, the corpus A is used to extract information about different aspects of the mismatch between training and recognition conditions. It stands to reason that, if it is helpful to address a particular type of linguistic mismatch in isolation, performance should be even better with an integrated approach to SLM adaptation.

Summary

Language model adaptation refers to the process of exploiting specific, albeit limited, knowledge about the recognition task to compensate for any mismatch between training and recognition. More generally, an adaptive language model seeks to maintain an adequate representation of the domain under changing conditions involving potential variations in vocabulary, syntax, content, and style. This involves gathering up-to-date information about the current recognition task, whether a priori or

References (75)

  • C. Chelba et al.

    Structured language modeling

    Computer, Speech, and Language

    (2000)
  • M. Mohri

    Minimization algorithms for sequential transducers

    Theor. Comp. Sci.

    (2000)
  • R. Rosenfeld

    A maximum entropy approach to adaptive statistical language modeling

  • Adda, G., Jardino, M., Gauvain, J.L., 1999. Language modeling for broadcast news transcription. In: Proc. 1999 Euro....
  • L.R. Bahl et al.

    A maximum likelihood approach to continuous speech recognition

    IEEE Trans. Pattern Anal. Mach. Intel.

    (1983)
  • Bellegarda, J.R., 1998a. Exploiting both local and global constraints for multi-span statistical language modeling. In:...
  • J.R. Bellegarda

    A multi-span language modeling framework for large vocabulary speech recognition

    IEEE Trans. Speech Audio Proc.

    (1998)
  • J.R. Bellegarda

    Large vocabulary speech recognition with multi-span statistical language models

    IEEE Trans. Speech Audio Proc.

    (2000)
  • J.R. Bellegarda

    Exploiting latent semantic information in statistical language modeling

    Proc. IEEE

    (2000)
  • Bellegarda, J.R., 2001. A novel approach to the adaptation of latent semantic information. In: Proc. 2001 ISCA Workshop...
  • J.R. Bellegarda et al.

    Tied mixture continuous parameter modeling for speech recognition

    IEEE Trans. Acoust. Speech Signal Process.

    (1990)
  • Berger, A., Miller, R., 1998. Just-in-time language modelling. In: Proc. 1998 Internat. Conf. Acoust. Speech Signal...
  • Bertoldi, N., Brugnara, F., Cettolo, M., Federico, M., Giuliani, D., 2001. From broadcast news to spontaneous dialogue...
  • Besling, S., Meier, H.G., 1995. Language model speaker adaptation. In: Proc. 1995 Euro. Conf. Speech Comm. Technol.,...
  • Chelba, C., 2001. Portability of syntactic structure for language modeling. In: Proc. 2001 Internat. Conf. Acoust....
  • Chelba, C., Engle, D., Jelinek, F., Jimenez, V., Khudanpur, S., Mangu, L., Printz, H., Ristad, E.S., Rosenfeld, R.,...
  • Chen, L., Huang, T., 1999. An improved MAP method for language model adaptation. In: Proc. 1999 Euro. Conf. Speech...
  • S.F. Chen et al.

    A survey of smoothing techniques for ME models

    IEEE Trans. Speech Audio Proc.

    (2000)
  • Chen, S.F., Seymore, K., Rosenfeld, R., 1998. Topic adaptation for language modeling using unnormalized exponential...
  • K.W. Church

    Phonological Parsing in Speech Recognition

    (1987)
  • Clarkson, P.R., Robinson, A.J., 1997. Language model adaptation using mixtures and an exponentially decaying cache. In:...
  • Coccaro, N., Jurafsky, D., 1998. Towards better integration of semantic predictors in statistical language modeling....
  • J.N. Darroch et al.

    Generalized iterative scaling for log-linear models

    Ann. Math. Statist.

    (1972)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    J. Amer. Soc. Inform. Sci.

    (1990)
  • S. Della Pietra et al.

    Inducing features of random fields

    IEEE Trans. Pattern Anal. Mach. Intel.

    (1997)
  • Della Pietra, S., Della Pietra, V., Mercer, R., Roukos, S., 1992. Adaptive language model estimation using minimum...
  • Donnelly, P.G., Smith, F.J., Sicilia, E., Ming, J., 1999. Language modelling with hierarchical domains. In: Proc. 1999...
  • Federico, M., 1996. Bayesian estimation methods for N-gram language model adaptation. In: Proc. 1996 Internat. Conf....
  • Federico, M., 1999. Efficient language model adaptation through MDI estimation. In: Proc. 1999 Euro. Conf. Speech Comm....
  • M. Federico et al.
  • Galescu, L., Allen, J., 2000. Hierarchical statistical language models: experiments on in-domain adaptation. In: Proc....
  • Gildea, D., Hoffman, T., 1999. Topic-based language modeling using EM. In: Proc. 1999 Euro. Conf. Speech Comm....
  • A.L. Gorin

    On automated language acquisition

    J. Acoust. Soc. Amer.

    (1995)
  • Gretter, R., Riccardi, G., 2001. On-line learning of language models with word error probability distributions. In:...
  • Hofmann, T., 1999a. Probabilistic latent semantic analysis. In: Proc. Fifteenth Conf. Uncertainty in AI, Stockholm,...
  • T. Hofmann

    Probabilistic topic maps: navigating through large text collections

  • R. Iyer et al.

    Modeling long distance dependencies in language: Topic mixtures versus dynamic cache models

    IEEE Trans. Speech Audio Process.

    (1999)
  • Cited by (260)

    • Natural language processing-based approach for automatically coding ship sensor data

      2024, International Journal of Naval Architecture and Ocean Engineering
    View all citing articles on Scopus
    View full text