Unlimited vocabulary speech recognition with morph language models applied to Finnish

https://doi.org/10.1016/j.csl.2005.07.002Get rights and content

Abstract

In the speech recognition of highly inflecting or compounding languages, the traditional word-based language modeling is problematic. As the number of distinct word forms can grow very large, it becomes difficult to train language models that are both effective and cover the words of the language well. In the literature, several methods have been proposed for basing the language modeling on sub-word units instead of whole words. However, to our knowledge, considerable improvements in speech recognition performance have not been reported.

In this article, we present a language-independent algorithm for discovering word fragments in an unsupervised manner from text. The algorithm uses the Minimum Description Length principle to find an inventory of word fragments that is compact but models the training text effectively. Language modeling and speech recognition experiments show that n-gram models built over these fragments perform better than n-gram models based on words. In two Finnish recognition tasks, relative error rate reductions between 12% and 31% are obtained. In addition, our experiments suggest that word fragments obtained using grammatical rules do not outperform the fragments discovered from text. We also present our recognition system and discuss how utilizing fragments instead of words affects the decoding process.

Introduction

In certain natural languages, there are interesting key problems that have not been much studied in developing large vocabulary continuous speech recognition (LVCSR) for English. One major problem is related to the number of distinct word forms that appear in every-day use. The conventional way of building statistical language models has been to collect co-occurrence statistics on words, such as n-grams. If the language can be covered well by a lexicon of reasonable size, it is possible to train statistical models using available toolkits, given enough training data and computational resources.

In many languages, however, the word-based approach has some disadvantages. In highly inflecting languages, such as Finnish and Hungarian, there may be thousands of different word forms of the same root, which makes the construction of a fixed lexicon for any reasonable coverage hardly feasible. Also in compounding languages, such as German, Swedish, Greek and Finnish, complex concepts can be expressed in a single word, which considerably increases the number of possible word forms. This leads to data sparsity problems in n-gram language modeling.

During the recent years, some approaches have been proposed to deal with the problem of vocabulary growth in large vocabulary speech recognition for different languages. Geutner et al. (1998) presented a two-pass recognition approach for increasing the vocabulary adaptively. In the first pass, a traditional word lexicon was used to create a word lattice for each speech segment, and before the second pass, the inflectional forms of the words were added to the lattice. In a Serbo-Croatian task, they reported word accuracy improvement from 64.0% to 69.8%. McTait and Adda-Decker (2003), on the other hand, reported that the recognition performance in a German task could be improved by increasing the lexicon size. The use of a lexicon of 300,000 words instead of 60,000 words lowered the word error rate from 20.4% to 18.5%.

Factored language models (Bilmes and Kirchhoff, 2003, Kirchhoff et al., 2003) have recently been proposed for incorporating morphological knowledge in the modeling of inflecting languages. Instead of conditioning probabilities on a few preceding words, the probabilities are conditioned on sets of features derived from words. These features (or factors) can include, for example, morphological, syntactic and semantic information. Vergyri et al. (2004) present experiments on Arabic speech recognition and report minor word error rate reductions.

Another promising direction has been to abandon words as the basic units of language modeling and speech recognition. As prefixes, suffixes and compound words are the cause of the growth of the vocabulary in many languages, a logical idea is to split the words into shorter units. Then the language modeling and recognition can be based on these word fragments. Several approaches have been proposed for different languages, and perplexity reductions have been achieved, but few have reported clear recognition improvements. Byrne et al. (2000) used a morphological analyzer for Czech to split words in stems and endings. A language model based on a vocabulary of 9600 morphemes gave better results when compared to a model based on a vocabulary of 20,000 words. However, with larger vocabularies (61,000 words and 25,000 morphemes), the word based models performed better (Byrne et al., 2001). Kwon and Park (2003) also used a morphological analyzer to obtain morphemes for a Korean recognition task. They reported that merging short morphemes together improved results. Szarvas and Furui (2003) used an analyzer to get morphemes for a Hungarian task. Additionally, morphosyntactic rules were incorporated into the model allowing only grammatical morpheme combinations. Relative morpheme error reductions between 1.7% and 7.2% were obtained.

In contrast to using a morphological analyzer, data-driven algorithms for splitting words in smaller units have also been investigated in speech recognition. Whittaker and Woodland (2000) proposed an algorithm for segmenting a text corpus into fragments that maximize the 2-gram likelihood of the segmented corpus. Small improvements in error rates (2.2% relative) were obtained in an English recognition task when the sub-word model was interpolated with a traditional word-based 3-gram model. Ordelman et al. (2003) presented a method for decomposing Dutch compound words automatically, and reported minor improvements in error rates.

To our knowledge, there is little previous work on basing the language modeling and recognition on sub-word units for Finnish LVCSR. Kneissler and Klakow (2001) segmented a corpus into word fragments that maximize the 1-gram likelihood of the corpus. Four different segmentation strategies were compared in a Finnish dictation task. The strategies required various amounts of input from an expert of the Finnish language. However, no comparisons to traditional word models were performed.

There are a number of works that aim at learning the morphology of a natural language in a fully unsupervised manner from data. Often words are assumed to consist of one stem typically followed by one suffix. Sometimes prefixes are possible. The work by Goldsmith (2001) exemplifies such an approach and gives a survey of the field. The morphologies discovered by these algorithms have not been applied in speech recognition. It seems that this kind of method is not suitable for agglutinative languages, such as Finnish, where words may consist of lengthy sequences of concatenated morphemes.

Morpheme-like units have also been discovered by algorithms for word segmentation, i.e., algorithms that discover word boundaries in text without blanks. Deligne and Bimbot (1997) derive a model structure that can be used both for word segmentation and for detecting variable-length acoustic units in speech data. Their data-driven units do not, however, produce as good results as conventional word models in recognizing the speech of French weather forecasts. Brent (1999) is mainly interested in the acquisition of a lexicon in an incremental fashion and applies his probabilistic model to the segmentation of transcripts of child-directed speech.

In this work, we make use of word fragments in language modeling and speech recognition. To avoid using a huge word vocabulary consisting of hundreds of thousands of distinct word forms, we split the words into frequently occurring sub-word units. We present an algorithm that discovers such word fragments from a text corpus in a fully unsupervised manner. The fragment inventory, or lexicon, is optimized for the given corpus according to a model based on the information-theoretic Minimum Description Length (MDL) principle (Rissanen, 1989).1 The resulting fragments are here referred to as statistical morphs as the boundaries of the fragments often coincide with grammatical morpheme boundaries.

The algorithm is motivated by the following features: The resulting model can cover the whole language obtaining a 0% out-of-vocabulary (OOV) rate by a reasonably sized but still apparently meaningful set of word fragments. The degree of word-splitting is influenced by the size of the training corpus, and foreign words are split as well, because no language-dependent assumptions are involved. A word can be split into a long sequence of fragments, which makes the model suitable for agglutinative languages. An earlier version of the method has already given good results in Finnish and Turkish recognition tasks (Siivola et al., 2003, Hacioglu et al., 2003).

In this article, we give a detailed description of the algorithm for segmenting a text corpus into statistical morphs, and compare the resulting language models with models based on two alternative methods. The other models are also capable of generating the whole language, albeit in a more simplistic manner: words augmented with phonemes, and fragments based on automatic grammatical analysis augmented with phonemes. The language modeling and recognition performance of n-gram models built using these units are evaluated in two Finnish tasks. We also discuss how the use of fragments affects the decoder of our speech recognition system.

In Section 2, we present the two Finnish tasks and performance measures used in the experiments. The central section of the paper is Section 3. It describes the statistical model and algorithm for segmenting a text corpus into word fragments. The alternative approaches for producing complete-coverage vocabularies are also presented with comparative cross-entropy experiments. Section 4 describes the recognition system. The acoustic models are presented briefly, the emphasis being on the duration models for Finnish phonemes, followed by the description of the decoder. The results of the experiments are given in Section 5 with discussion, and Section 6 concludes the work.

Section snippets

The Finnish evaluation task

This chapter describes the LVCSR task that we propose for evaluating the new language models for Finnish. The language research community for Finnish is rather small, and extensive text and speech corpora for language modeling and speech recognition research do not exist yet. However, the Finnish IT center for science has an ongoing project Kielipankki (Language bank) which collects Finnish and Swedish text and speech data that can be obtained for research purposes.

Language modeling with data-driven units

In this section, we propose to solve the problem of large word vocabularies by producing a lexicon of word fragments and estimating n-gram language models over these fragments instead of entire words. Our algorithm learns a set of word fragments from a large text corpus or a corpus vocabulary in an unsupervised manner, and utilizes a model that is based on the MDL principle. The algorithm has obvious merits as it is not language-dependent and it relies on a model with a principled formulation,

Acoustic modeling

In this section, we describe the acoustic models used in the experiments. Since our emphasis is on the language modeling, the acoustic part is not discussed in great detail. The main difference to the corresponding English LVCSR systems, in addition to the Finnish phonemes, is the utilization of monophone models embedded with explicit models for the phone duration.

Setup

The recognition system was evaluated in both the Book and the News tasks (see Section 2). For each of the lexicon types (statistical morphs, grammatical morphs, words, and words-OOV), and for each n-gram model (orders 3–5), development data were used to tune the weight between acoustic and language models. The weight was tuned to optimize the phoneme error rate in the development data. We did not study n-gram models of order 6 or 7 in the recognition, because of the high memory requirements.

Conclusions

In this article, we have described and evaluated language models based on the segmentation of text corpora into suitable word fragments by an unsupervised machine learning algorithm. We have shown how such an automatically derived lexicon can be effectively used in language modeling and speech recognition for Finnish, which is a good example of agglutinative, highly inflecting and compounding languages. Due to the huge amount of distinct word forms, the traditional methods based on full words

Acknowledgments

This work was supported by the Academy of Finland in the projects New information processing principles and New adaptive and learning methods in speech recognition. Funding was also provided by the Finnish National Technology Agency (TEKES) and the Graduate School of Language Technology in Finland. We thank the Finnish Federation of the Visually Impaired, Departments of Speech Science and General Linguistics of the University of Helsinki, and Inger Ekman from the Department of Information

References (43)

  • S.F. Chen et al.

    An empirical study of smoothing techniques for language modeling

    Computer Speech and Language

    (1999)
  • Creutz, M., 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In:...
  • Creutz, M., Lagus, K., 2002. Unsupervised discovery of morphemes. In: Proceedings of the Workshop on Morphological and...
  • Creutz, M., Lagus, K., 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using...
  • Creutz, M., Lindén, K., 2004. Morpheme segmentation gold standards for Finnish and English. Tech. Rep. A77,...
  • M.J.F. Gales

    Semi-tied covariance matrices for hidden Markov models

    IEEE Transactions on Speech and Audio Processing

    (1999)
  • Geutner, P., Finke, M., Scheytt, P., 1998. Adaptive vocabularies for transcribing multilingual broadcast news. In:...
  • J. Goldsmith

    Unsupervised learning of the morphology of a natural language

    Computational Linguistics

    (2001)
  • Goodman, J.T., 2001. A bit of progress in language modeling, extended version. Tech. Rep. MSR-TR-2001-72, Microsoft...
  • Hacioglu, K., Pellom, B., Ciloglu, T., Ozturk, O., Kurimo, M., Creutz, M., 2003. On lexicon creation for Turkish LVCSR....
  • Hakulinen, L., 1979. Suomen kielen rakenne ja kehitys (The structure and development of the Finnish language), 4th Ed.,...
  • Cited by (99)

    • Advances in subword-based HMM-DNN speech recognition across languages

      2021, Computer Speech and Language
      Citation Excerpt :

      For some languages, such as Arabic, it has been popular to use linguistic units such as morphemes as the basic language modeling unit (Choueiter et al., 2006; Kirchhoff et al., 2006; Mousa et al., 2013). For other languages, such as Finnish, the subword segments are often created with data-driven methods (Hirsimäki et al., 2006; Creutz et al., 2007). Multiple data-driven methods for subword segmentation have been used in speech recognition.

    • Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

      2021, Computer Speech and Language
      Citation Excerpt :

      We have also evaluated the word boundary modelling with mostly similar results. For Finnish, the dedicated word boundary symbol Whittaker and Woodland (2000); Hirsimäki et al. (2006) has so far been the most effective approach, whereas for Estonian, using the redundant approach has sometimes resulted in a small improvement. For the experiments in this work, the dedicated word boundary symbol was used because it provided the best or equal results in all cases.

    View all citing articles on Scopus
    View full text