Statistical language model adaptation: review and perspectives
Introduction
Language modeling plays a pivotal role in automatic speech recognition. It is variously used to constrain the acoustic analysis, guide the search through multiple (partial) text hypotheses, and/or contribute to the determination of the final transcription (Bahl et al., 1983; Jelinek, 1985; Rabiner et al., 1996). Fundamentally, its function is to encapsulate as much as possible of the syntactic, semantic, and pragmatic characteristics for the task considered.
In the search, the successful capture of this information is critical to help determine the most likely sequence of words spoken, because it quantifies which word sequences are acceptable in a given language for a given task, and which are not. In that sense, language modeling can be thought of as a way to impose a collection of constraints on word sequences. Since, generally, many different such sequences can be used to convey the same information, these constraints tend to be statistical in nature (Gorin, 1995). Thus, regularities in natural language are governed by an underlying (unknown) probability distribution on word sequences. The ideal outcome of language modeling, then, would be to derive a good estimate of this distribution.
In the unrestricted case, however, carrying out this task is not feasible: some simplifications are necessary to render the problem tractable. The standard approach is to constrain allowable word sequences to those that can be parsed under the control of a probabilistic context-free grammar (PCFG), a somewhat crude yet well-understood model of natural language (Church, 1987). Unfortunately, because parser complexity tends to be nonlinear, at the present time context-free parsing is simply not practical for any but the most rudimentary applications. The problem is then restricted to a subclass of PCFGs, strongly regular grammars, which can be efficiently mapped into equivalent (weighted) finite state automata with much more attractive computational properties. This has led to a large body of literature exploiting such properties in finite state transducers (Mohri, 2000).
Situations where such stochastic automata are especially easy to deploy include relatively self-contained, constrained vocabulary tasks (Pereira and Riley, 1997). This is often the case, for example, of a typical dialog state in a dialog system. At that point, given one or more input strings as input, the goal is to reestimate state transition probabilities pertaining only to the input set, so input string matching on a finite automaton is a convenient solution. In dictation and other large vocabulary applications, however, the size and complexity of the task complicates the issue of coverage. Generic stochastic automata indiscriminately accepting variable length sequences become unsuitable. To maintain tractability, attention is further restricted to a subclass of probabilistic regular grammars, stochastic n-gram models. Such models have been most prominently used with n=2 and 3, corresponding to classical statistical bigrams and trigrams (Jelinek, 1985).
This leads to the focus of this paper. Statistical n-grams, of course, can also be represented by equivalent (n-gram) finite state automata. In practice, the difference between the stochastic automaton and the original n-gram representation is largely a matter of implementation. Many systems in use today, especially for complex dialog applications, are based on the former (cf., for example, Riccardi and Gorin, 2000; Zue et al., 2000), while the latter is more prevalent amongst transcription systems (see e.g., Adda et al., 1999; Ohtsuki et al., 1999). In what follows, since the discussion is essentially unaffected by implementation details, the terminology “statistical language model” (SLM) will refer to the general concept of a stochastic n-gram.
Natural language is highly variable in several aspects.
First, language evolves as does the world it seeks to describe: contrast the recent surge of the word “proteomics” to the utter demise of “ague” (a burning fever, from Leviticus 26:16, King James translation of the Bible). The effective underlying vocabulary changes dynamically with time on a constant basis.
Second, different domains tend to involve relatively disjoint concepts with markedly different word sequence statistics: consider the relevance of “interest rate” to a banking application, versus a general conversation on gaming platforms. A heterogeneous subject matter drastically affects the underlying semantic characteristics of the discourse at topic boundaries.
Third, people naturally adjust their use of the language based on the task at hand: compare the typical syntax employed in formal technical papers to the one in casual e-mails, for example. While the overall grammatical infrastructure may remain invariant, syntactic clues generally differ from one task to the next.
And finally, people’s style of discourse may independently vary due to a variety of factors such as socio-economic status, emotional state, etc. This last effect, of course, is even more pronounced on spoken natural language.
As a result of this inherent variability, the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks are quite likely to differ. This is bad news for n-gram modeling, as the performance of any statistical approach always suffers from such mismatch. SLMs have indeed been found to be extremely brittle across domains (Rosenfeld, 2000), and even within domain when training and recognition involve moderately disjoint time periods (Rosenfeld, 1995). The unfortunate outcome is a severe degradation in speech recognition performance compared to the ideal matched situation.
It turns out, for example, that to model casual phone conversation, one is much better off using two million words of transcripts from such conversations than using 140 million words of transcripts from TV and radio broadcasts. This effect is quite strong even for changes that seem trivial to a human: a language model trained on Dow–Jones newswire text sees its perplexity doubled when applied to the very similar Associated Press newswire text from the same time period (Rosenfeld, 1996, Rosenfeld, 2000).
In addition, linguistic mismatch is known to affect cross-task recognition accuracy much more than acoustic mismatch. For instance, results of a cross-task experiment using Broadcast News models to recognize TI-digits, recently reported in (Lefevre et al., 2001), show that only about 8% of the word error rate increase was due to the acoustic modeling mismatch, while 92% was imputable to the language model mismatch. In a similar experiment involving ATIS, these figures were approximately 2% and 98%, respectively (Lefevre et al., 2001). Analogous trends were observed in (Bertoldi et al., 2001) for different tasks in a different language.
The above discussion makes a strong case for SLM adaptation, as a means to reduce the degradation in speech recognition performance observed with a new set of operating conditions (Federico and de Mori, 1999). The various techniques that have been proposed to carry out the adaptation procedure can be broadly classified into three major categories. Where a particular technique falls depends on whether its underlying philosophy is based on: (i) model interpolation, (ii) constraint specification, or (iii) meta-information extraction. The latter category refers to knowledge about the recognition task which may not be explicitly observable in the word sequence itself. This includes the underlying discourse topic, general semantic and syntactic information, as well as a combination thereof.
The paper is accordingly organized as follows. The next section poses the adaptation problem and reviews the various ways to gather suitable adaptation data. Section 3 covers interpolation-based approaches, including dynamic cache models. In Section 4, we describe the use of constraints, as typically specified within the maximum entropy framework. Section 5 gives an overview of topic-centered techniques, starting with adaptive mixture n-grams. Alternative integration of semantic knowledge, i.e., triggers and latent semantic analysis, is discussed in Section 6. Section 7 addresses the use of syntactic infrastructure, as implemented in the structured language model, and Section 8 considers the integration of multiple knowledge sources to further increase performance. Finally, in Section 9 we offer some concluding remarks and perspectives on the various trade-offs involved.
Section snippets
Adaptation framework
The general SLM adaptation framework is depicted in Fig. 1. Two text corpora are considered: a (small) adaptation corpus A, relevant to the current recognition task, and a (large) background corpus B, associated with a presumably related but perhaps dated and/or somewhat different task, as discussed above.
Model interpolation
In interpolation-based approaches, the corpus A is used to derive a task-specific (dynamic) SLM, which is then combined with the background (static) SLM. This appealingly simple concept provides fertile grounds for experimentation, depending on the level at which the combination is implemented.
Constraint specification
In approaches based on constraint specification, the corpus A is used to extract features that the adapted SLM is constrained to satisfy. This is arguably more powerful than model interpolation, since in this framework a different weight could presumably be assigned separately for each feature.
Topic information
In approaches exploiting the general topic of the discourse, the corpus A is used to extract information about the underlying subject matter. This information is then used in various ways to improve upon the background model based on semantic classification.
Semantic knowledge
Approaches taking advantage of semantic knowledge purport to exploit not just topic information as above, but the entire semantic fabric of the corpus A, so they usually involve a finer level of granularity and/or some sort of dimensionality reduction.
Syntactic infrastructure
Approaches leveraging syntactic knowledge make the implicit assumption that the background and recognition tasks share a common grammatical infrastructure, so that grammatical constraints are largely portable from corpus B to corpus A. The background SLM is then used for initial syntactic modeling, and the corpus A to re-estimate the associated parameters.
Multiple sources
In approaches exploiting multiple knowledge sources, the corpus A is used to extract information about different aspects of the mismatch between training and recognition conditions. It stands to reason that, if it is helpful to address a particular type of linguistic mismatch in isolation, performance should be even better with an integrated approach to SLM adaptation.
Summary
Language model adaptation refers to the process of exploiting specific, albeit limited, knowledge about the recognition task to compensate for any mismatch between training and recognition. More generally, an adaptive language model seeks to maintain an adequate representation of the domain under changing conditions involving potential variations in vocabulary, syntax, content, and style. This involves gathering up-to-date information about the current recognition task, whether a priori or
References (75)
- et al.
Structured language modeling
Computer, Speech, and Language
(2000) Minimization algorithms for sequential transducers
Theor. Comp. Sci.
(2000)A maximum entropy approach to adaptive statistical language modeling
- Adda, G., Jardino, M., Gauvain, J.L., 1999. Language modeling for broadcast news transcription. In: Proc. 1999 Euro....
- et al.
A maximum likelihood approach to continuous speech recognition
IEEE Trans. Pattern Anal. Mach. Intel.
(1983) - Bellegarda, J.R., 1998a. Exploiting both local and global constraints for multi-span statistical language modeling. In:...
A multi-span language modeling framework for large vocabulary speech recognition
IEEE Trans. Speech Audio Proc.
(1998)Large vocabulary speech recognition with multi-span statistical language models
IEEE Trans. Speech Audio Proc.
(2000)Exploiting latent semantic information in statistical language modeling
Proc. IEEE
(2000)- Bellegarda, J.R., 2001. A novel approach to the adaptation of latent semantic information. In: Proc. 2001 ISCA Workshop...
Tied mixture continuous parameter modeling for speech recognition
IEEE Trans. Acoust. Speech Signal Process.
A survey of smoothing techniques for ME models
IEEE Trans. Speech Audio Proc.
Phonological Parsing in Speech Recognition
Generalized iterative scaling for log-linear models
Ann. Math. Statist.
Indexing by latent semantic analysis
J. Amer. Soc. Inform. Sci.
Inducing features of random fields
IEEE Trans. Pattern Anal. Mach. Intel.
On automated language acquisition
J. Acoust. Soc. Amer.
Probabilistic topic maps: navigating through large text collections
Modeling long distance dependencies in language: Topic mixtures versus dynamic cache models
IEEE Trans. Speech Audio Process.
Cited by (260)
Natural language processing-based approach for automatically coding ship sensor data
2024, International Journal of Naval Architecture and Ocean EngineeringTraining RNN language models on uncertain ASR hypotheses in limited data scenarios
2023, Computer Speech and Language“So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy
2023, International Journal of Information ManagementWord sense disambiguation based on context selection using knowledge-based word similarity
2021, Information Processing and Management