Arc minimization in finite-state decoding graphs with cross-word acoustic context☆
Introduction
In the past, there has been a significant division between the decoding processes used for highly constrained, small vocabulary speech recognition tasks, and those used for large vocabulary unconstrained tasks. In the small vocabulary arena, and in domains where a relatively compact grammar is appropriate, it is common to pre-compile a static state-graph. Given such a graph, a simple and efficient implementation of the Viterbi algorithm can be used for subsequent decoding (Viterbi, 1967). For large vocabulary tasks with n-gram language models (LMs), however, it has traditionally been common to avoid a static search space, and to instead dynamically expand the language model as needed (Jelinek et al., 1975; Odell, 1995; Ney and Ortmanns, 1999). While the latter approach has the advantage of never touching potentially large portions of the search space, it has the important disadvantage that dynamic expansion is significantly more complex, and incurs a run-time overhead of its own.
Remarkably, over the course of the past several years, algorithmic and computational advances have made it possible to handle large vocabulary recognition in essentially the same way as grammar-based tasks. In a recent series of papers (Mohri et al., 1998; Mohri et al., 2000; Willett et al., 2001), it has been shown that it is in fact possible to statically compile a state graph that encodes the constraints of both a state-of-the-art language model, and cross-word acoustic context. One of the main algorithmic methods that is used in the process is that of determinization and minimization of the resulting weighted finite-state transducer.
While this previous work (Mohri et al., 1998; Mohri et al., 2000) has established Viterbi decoding on statically compiled graphs to be an effective method for large vocabulary decoding, its use with very long-span acoustic models presents problems that have not been previously solved. As an initial step in this process, a cross-word acoustic-context model (typically triphone or quinphone) is encoded as a finite-state transducer, and used in subsequent operations. As the amount of acoustic context increases, this transducer grows dramatically in size. Quoting (Mohri et al., 2000) “More generally, when there are n context independent phones, this triphonic construction gives a transducer with O(n2) states and O(n3) transitions. A tetraphonic construction would give a transducer with O(n3) states and O(n4) transitions”. Further evidence of this increase of complexity is given in Chen (2003).
The motivation of this paper arises from a desire to utilize acoustic models with very long-span context sensitivity, where simply writing down a reasonably sized transducer that encodes the context sensitivity becomes a significant challenge. In particular, we are interested in utilizing a full-word of cross-word acoustic context, and in this case the number of arcs required to encode the context is very large (proportional to the square of the vocabulary size), and must be minimized.
To this end, we explore a graph building strategy which introduces supplementary states into selected portions of the decoding graph, in return for a large reduction in the number of arcs. The key difference from previous work is that we present a method for introducing non-determinism and extra states in return for reducing the number of arcs. Classical minimization operates on deterministic graphs.
A first question we study concerns the existence of tractable algorithms for building a graph with a minimal number of arcs. While it is already known that the related problem of minimizing non-deterministic finite-state automata (NFA) is NP-hard (see, e.g., Jiang and Ravikumar, 1993), it is important to note that on the face of it, our problem is significantly more constrained, and therefore perhaps not in the same complexity class. Nonetheless, through a reduction from the known NP-complete optimization problem of Clique Bipartitioning (Feder and Motwani, 1991), we demonstrate that in fact the problem we are faced with is also NP-complete.
We then propose several simple heuristics to reduce the number of arcs in the graph. The key to our factorization methods is the fact that it is relatively straightforward to enumerate – for each word – the sets of predecessor words that give rise to distinct context dependent acoustic realizations. We refer to such sets of predecessor words as context sets. By carefully identifying subsets of words that occur in multiple context sets, we will show that it is possible to factor them in such a way as to produce highly compact graphs, with a full word of acoustic context, even for large-vocabulary systems.
This paper is organized as follows. In Section 2.2, we present the basic structure of the decoding graphs and proceed in 2.3 Left-context graphs, 2.4 Right-context graph to illustrate the problem of arc minimization in the case of a unigram language model with various kinds of cross-word contextual dependencies. When larger n-grams are used, this unigram portion occurs as a subgraph, and accounts for the majority of the context-induced arcs.
In Section 3, we cast the problem formally, show that it is NP-complete, and present two simple heuristics for generating compact graphs. In Section 4, we apply these techniques, and present results on graph size, runtime, and word-error rate with various acoustic and linguistic models on the Switchboard and EARS Rich Transcription evaluations.
Section snippets
Motivations
Over the past few years, the technology of LVCSR systems has significantly evolved, making them able to accommodate increasingly large vocabularies and richer knowledge sources, while keeping computing costs at a reasonable level. In this section, we discuss the benefits of precompiling the search network into a finite-state graph in the context of single pass search strategies.
Problem definition
Before presenting our graph minimization strategies, we introduce some definitions. G=(X=(L,R),E) is a bipartite graph if the vertices in X can be partitioned as X=L∪R, and all edges in E link a vertex in L with a vertex in R. We denote n(G) the order of G (= the total number of vertices). A biclique B=(X′=(L′,R′),E′) in G is a complete partial subgraph of G, meaning that E′ includes every possible edge from L′ to R′. An edge cover of G into biclique is provided by subsets E1,…,Ek of E such
System description
For the experiments reported in Section 4.2, we used a Switchboard system based on a 18K vocabulary, with more than 300K left-context variants. Speech features are derived from 24-dimensional MFCCs, further transformed into a canonical space through the application of normalization techniques (VTLN and FMLLR), and then projected onto a discriminative 60-dimensional space using heteroscedastic discriminant analysis (HDA) (Saon et al., 2000). Acoustic modeling uses cross-word context-dependent
Conclusion and perspectives
In this paper, we have proposed a new methodology for building static decoding graphs for LVCSR systems which can accommodate acoustic models exhibiting patterns of long distance contextual dependencies (up to one word on the left). We have shown that the same minimization problem occurs for both left and right cross-word dependencies, and that this problem is in fact NP-hard. By developing heuristics for decomposing the connectivity in the unigram portion of the graph, we are able now to
Acknowledgements
The authors thank Brian Kingsbury, Lidia Mangu, and Stanley Chen for useful comments, insights, and portions of the language and acoustic models. The authors also thank the two anonymous reviewers for helping to sharpen the presentation.
References (33)
- Aubert, X., 2000. A brief overview of decoding techniques for large vocabulary continuous speech recognition. In:...
- Chen, S.F., 2003. Compiling large context phonetic decision trees into finite-state transducers. In: Proceedings of...
- Chen, S.F., Goodman, J., 1996. An empirical study of smoothing techniques for language modeling. In: Proceedings of the...
- Dolfing, H.J., Hetherington, I.L., 2001. Incremental language models for speech recognition using finite-state...
- Feder, T., Motwani, R., 1991. Clique partitions, graph compression and speeding-up algorithms. In: Proceedings of the...
The NP-completeness of some edge partition problems
SIAM Journal of Computer Science
(1981)- et al.
Introduction to Automata Theory, Languages and Computation
(1979) - et al.
Design of a linguistic statistical decoder for the recognition of continuous speech
IEEE Transactions on Information Theory
(1975) - et al.
Minimal nfa problems are hard
SIAM Journal on Computing
(1993) - Kanthak, S., Ney, H., Riley, M., Mohri, M., 2002. A comparison of two LVR search optimization techniques. In:...
Estimation of probabilities from sparse data for the language model component of a speech recognizer
IEEE Transactions on Acoustics, Speech, and Signal Processing
Minimization of sequential transducers
Lecture Notes in Computer Science
Network optimisation for large vocabulary speech recognition
Speech Communication
Cited by (3)
Direct construction of compact context-dependency transducers from data
2014, Computer Speech and LanguageDirect construction of compact context-dependency transducers from data
2010, Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010Advances in speech transcription at IBM under the DARPA EARS program
2006, IEEE Transactions on Audio, Speech and Language Processing
- ☆
This work was performed while F. Yvon was visiting the IBM T.J. Watson Research Center.