Arc minimization in finite-state decoding graphs with cross-word acoustic context

https://doi.org/10.1016/j.csl.2003.09.006Get rights and content

Abstract

Recent approaches to large vocabulary decoding with weighted finite-state transducers have focused on the use of determinization and minimization algorithms to produce compact decoding graphs. This paper addresses the problem of compiling decoding graphs with long span cross-word context dependency between acoustic models. To this end, we extend the finite-state approach by developing complementary arc factorization techniques that operate on non-deterministic graphs. The use of these techniques allows us to statically compile decoding graphs in which the acoustic models utilize a full word of cross-word context. This is in significant contrast to typical systems which use only a single phone. We show that the particular arc-minimization problem that arises is in fact an NP-complete combinatorial optimization problem. Heuristics for this problem are then presented, and are used in experiments on a Switchboard task, illustrating the moderate sizes and runtimes of the graphs we build.

Introduction

In the past, there has been a significant division between the decoding processes used for highly constrained, small vocabulary speech recognition tasks, and those used for large vocabulary unconstrained tasks. In the small vocabulary arena, and in domains where a relatively compact grammar is appropriate, it is common to pre-compile a static state-graph. Given such a graph, a simple and efficient implementation of the Viterbi algorithm can be used for subsequent decoding (Viterbi, 1967). For large vocabulary tasks with n-gram language models (LMs), however, it has traditionally been common to avoid a static search space, and to instead dynamically expand the language model as needed (Jelinek et al., 1975; Odell, 1995; Ney and Ortmanns, 1999). While the latter approach has the advantage of never touching potentially large portions of the search space, it has the important disadvantage that dynamic expansion is significantly more complex, and incurs a run-time overhead of its own.

Remarkably, over the course of the past several years, algorithmic and computational advances have made it possible to handle large vocabulary recognition in essentially the same way as grammar-based tasks. In a recent series of papers (Mohri et al., 1998; Mohri et al., 2000; Willett et al., 2001), it has been shown that it is in fact possible to statically compile a state graph that encodes the constraints of both a state-of-the-art language model, and cross-word acoustic context. One of the main algorithmic methods that is used in the process is that of determinization and minimization of the resulting weighted finite-state transducer.

While this previous work (Mohri et al., 1998; Mohri et al., 2000) has established Viterbi decoding on statically compiled graphs to be an effective method for large vocabulary decoding, its use with very long-span acoustic models presents problems that have not been previously solved. As an initial step in this process, a cross-word acoustic-context model (typically triphone or quinphone) is encoded as a finite-state transducer, and used in subsequent operations. As the amount of acoustic context increases, this transducer grows dramatically in size. Quoting (Mohri et al., 2000) “More generally, when there are n context independent phones, this triphonic construction gives a transducer with O(n2) states and O(n3) transitions. A tetraphonic construction would give a transducer with O(n3) states and O(n4) transitions”. Further evidence of this increase of complexity is given in Chen (2003).

The motivation of this paper arises from a desire to utilize acoustic models with very long-span context sensitivity, where simply writing down a reasonably sized transducer that encodes the context sensitivity becomes a significant challenge. In particular, we are interested in utilizing a full-word of cross-word acoustic context, and in this case the number of arcs required to encode the context is very large (proportional to the square of the vocabulary size), and must be minimized.

To this end, we explore a graph building strategy which introduces supplementary states into selected portions of the decoding graph, in return for a large reduction in the number of arcs. The key difference from previous work is that we present a method for introducing non-determinism and extra states in return for reducing the number of arcs. Classical minimization operates on deterministic graphs.

A first question we study concerns the existence of tractable algorithms for building a graph with a minimal number of arcs. While it is already known that the related problem of minimizing non-deterministic finite-state automata (NFA) is NP-hard (see, e.g., Jiang and Ravikumar, 1993), it is important to note that on the face of it, our problem is significantly more constrained, and therefore perhaps not in the same complexity class. Nonetheless, through a reduction from the known NP-complete optimization problem of Clique Bipartitioning (Feder and Motwani, 1991), we demonstrate that in fact the problem we are faced with is also NP-complete.

We then propose several simple heuristics to reduce the number of arcs in the graph. The key to our factorization methods is the fact that it is relatively straightforward to enumerate – for each word – the sets of predecessor words that give rise to distinct context dependent acoustic realizations. We refer to such sets of predecessor words as context sets. By carefully identifying subsets of words that occur in multiple context sets, we will show that it is possible to factor them in such a way as to produce highly compact graphs, with a full word of acoustic context, even for large-vocabulary systems.

This paper is organized as follows. In Section 2.2, we present the basic structure of the decoding graphs and proceed in 2.3 Left-context graphs, 2.4 Right-context graph to illustrate the problem of arc minimization in the case of a unigram language model with various kinds of cross-word contextual dependencies. When larger n-grams are used, this unigram portion occurs as a subgraph, and accounts for the majority of the context-induced arcs.

In Section 3, we cast the problem formally, show that it is NP-complete, and present two simple heuristics for generating compact graphs. In Section 4, we apply these techniques, and present results on graph size, runtime, and word-error rate with various acoustic and linguistic models on the Switchboard and EARS Rich Transcription evaluations.

Section snippets

Motivations

Over the past few years, the technology of LVCSR systems has significantly evolved, making them able to accommodate increasingly large vocabularies and richer knowledge sources, while keeping computing costs at a reasonable level. In this section, we discuss the benefits of precompiling the search network into a finite-state graph in the context of single pass search strategies.

Problem definition

Before presenting our graph minimization strategies, we introduce some definitions. G=(X=(L,R),E) is a bipartite graph if the vertices in X can be partitioned as X=LR, and all edges in E link a vertex in L with a vertex in R. We denote n(G) the order of G (= the total number of vertices). A biclique B=(X′=(L′,R′),E′) in G is a complete partial subgraph of G, meaning that E includes every possible edge from L to R. An edge cover of G into biclique is provided by subsets E1,…,Ek of E such

System description

For the experiments reported in Section 4.2, we used a Switchboard system based on a 18K vocabulary, with more than 300K left-context variants. Speech features are derived from 24-dimensional MFCCs, further transformed into a canonical space through the application of normalization techniques (VTLN and FMLLR), and then projected onto a discriminative 60-dimensional space using heteroscedastic discriminant analysis (HDA) (Saon et al., 2000). Acoustic modeling uses cross-word context-dependent

Conclusion and perspectives

In this paper, we have proposed a new methodology for building static decoding graphs for LVCSR systems which can accommodate acoustic models exhibiting patterns of long distance contextual dependencies (up to one word on the left). We have shown that the same minimization problem occurs for both left and right cross-word dependencies, and that this problem is in fact NP-hard. By developing heuristics for decomposing the connectivity in the unigram portion of the graph, we are able now to

Acknowledgements

The authors thank Brian Kingsbury, Lidia Mangu, and Stanley Chen for useful comments, insights, and portions of the language and acoustic models. The authors also thank the two anonymous reviewers for helping to sharpen the presentation.

References (33)

  • Aubert, X., 2000. A brief overview of decoding techniques for large vocabulary continuous speech recognition. In:...
  • Chen, S.F., 2003. Compiling large context phonetic decision trees into finite-state transducers. In: Proceedings of...
  • Chen, S.F., Goodman, J., 1996. An empirical study of smoothing techniques for language modeling. In: Proceedings of the...
  • Dolfing, H.J., Hetherington, I.L., 2001. Incremental language models for speech recognition using finite-state...
  • Feder, T., Motwani, R., 1991. Clique partitions, graph compression and speeding-up algorithms. In: Proceedings of the...
  • I. Holyer

    The NP-completeness of some edge partition problems

    SIAM Journal of Computer Science

    (1981)
  • J.E. Hopcroft et al.

    Introduction to Automata Theory, Languages and Computation

    (1979)
  • F. Jelinek et al.

    Design of a linguistic statistical decoder for the recognition of continuous speech

    IEEE Transactions on Information Theory

    (1975)
  • T. Jiang et al.

    Minimal nfa problems are hard

    SIAM Journal on Computing

    (1993)
  • Kanthak, S., Ney, H., Riley, M., Mohri, M., 2002. A comparison of two LVR search optimization techniques. In:...
  • S.M. Katz

    Estimation of probabilities from sparse data for the language model component of a speech recognizer

    IEEE Transactions on Acoustics, Speech, and Signal Processing

    (1987)
  • Le, A., 2003. Rich transcription 2003: Spring stt evaluation results. Presentation at the RT 2003 Spring Workshop....
  • M. Mohri

    Minimization of sequential transducers

    Lecture Notes in Computer Science

    (1994)
  • M. Mohri et al.

    Network optimisation for large vocabulary speech recognition

    Speech Communication

    (1998)
  • Mohri, M., Riley, M., Kindle, D., Ljolje, A., Pereira, F., 1998. Full expansion of context-dependent networks in large...
  • Mohri, M., Riley, M., Pereira, F.C.N., 2000. Weighted finite-state transducers in speech recognition. In: Proceedings...
  • Cited by (3)

    • Direct construction of compact context-dependency transducers from data

      2010, Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
    • Advances in speech transcription at IBM under the DARPA EARS program

      2006, IEEE Transactions on Audio, Speech and Language Processing

    This work was performed while F. Yvon was visiting the IBM T.J. Watson Research Center.

    View full text