Unlimited vocabulary speech recognition with morph language models applied to Finnish

doi:10.1016/j.csl.2005.07.002

Computer Speech & Language

Volume 20, Issue 4, October 2006, Pages 515-541

https://doi.org/10.1016/j.csl.2005.07.002 Get rights and content

Abstract

In the speech recognition of highly inflecting or compounding languages, the traditional word-based language modeling is problematic. As the number of distinct word forms can grow very large, it becomes difficult to train language models that are both effective and cover the words of the language well. In the literature, several methods have been proposed for basing the language modeling on sub-word units instead of whole words. However, to our knowledge, considerable improvements in speech recognition performance have not been reported.

In this article, we present a language-independent algorithm for discovering word fragments in an unsupervised manner from text. The algorithm uses the Minimum Description Length principle to find an inventory of word fragments that is compact but models the training text effectively. Language modeling and speech recognition experiments show that n-gram models built over these fragments perform better than n-gram models based on words. In two Finnish recognition tasks, relative error rate reductions between 12% and 31% are obtained. In addition, our experiments suggest that word fragments obtained using grammatical rules do not outperform the fragments discovered from text. We also present our recognition system and discuss how utilizing fragments instead of words affects the decoding process.

Introduction

In certain natural languages, there are interesting key problems that have not been much studied in developing large vocabulary continuous speech recognition (LVCSR) for English. One major problem is related to the number of distinct word forms that appear in every-day use. The conventional way of building statistical language models has been to collect co-occurrence statistics on words, such as n-grams. If the language can be covered well by a lexicon of reasonable size, it is possible to train statistical models using available toolkits, given enough training data and computational resources.

In many languages, however, the word-based approach has some disadvantages. In highly inflecting languages, such as Finnish and Hungarian, there may be thousands of different word forms of the same root, which makes the construction of a fixed lexicon for any reasonable coverage hardly feasible. Also in compounding languages, such as German, Swedish, Greek and Finnish, complex concepts can be expressed in a single word, which considerably increases the number of possible word forms. This leads to data sparsity problems in n-gram language modeling.

During the recent years, some approaches have been proposed to deal with the problem of vocabulary growth in large vocabulary speech recognition for different languages. Geutner et al. (1998) presented a two-pass recognition approach for increasing the vocabulary adaptively. In the first pass, a traditional word lexicon was used to create a word lattice for each speech segment, and before the second pass, the inflectional forms of the words were added to the lattice. In a Serbo-Croatian task, they reported word accuracy improvement from 64.0% to 69.8%. McTait and Adda-Decker (2003), on the other hand, reported that the recognition performance in a German task could be improved by increasing the lexicon size. The use of a lexicon of 300,000 words instead of 60,000 words lowered the word error rate from 20.4% to 18.5%.

Factored language models (Bilmes and Kirchhoff, 2003, Kirchhoff et al., 2003) have recently been proposed for incorporating morphological knowledge in the modeling of inflecting languages. Instead of conditioning probabilities on a few preceding words, the probabilities are conditioned on sets of features derived from words. These features (or factors) can include, for example, morphological, syntactic and semantic information. Vergyri et al. (2004) present experiments on Arabic speech recognition and report minor word error rate reductions.

Another promising direction has been to abandon words as the basic units of language modeling and speech recognition. As prefixes, suffixes and compound words are the cause of the growth of the vocabulary in many languages, a logical idea is to split the words into shorter units. Then the language modeling and recognition can be based on these word fragments. Several approaches have been proposed for different languages, and perplexity reductions have been achieved, but few have reported clear recognition improvements. Byrne et al. (2000) used a morphological analyzer for Czech to split words in stems and endings. A language model based on a vocabulary of 9600 morphemes gave better results when compared to a model based on a vocabulary of 20,000 words. However, with larger vocabularies (61,000 words and 25,000 morphemes), the word based models performed better (Byrne et al., 2001). Kwon and Park (2003) also used a morphological analyzer to obtain morphemes for a Korean recognition task. They reported that merging short morphemes together improved results. Szarvas and Furui (2003) used an analyzer to get morphemes for a Hungarian task. Additionally, morphosyntactic rules were incorporated into the model allowing only grammatical morpheme combinations. Relative morpheme error reductions between 1.7% and 7.2% were obtained.

In contrast to using a morphological analyzer, data-driven algorithms for splitting words in smaller units have also been investigated in speech recognition. Whittaker and Woodland (2000) proposed an algorithm for segmenting a text corpus into fragments that maximize the 2-gram likelihood of the segmented corpus. Small improvements in error rates (2.2% relative) were obtained in an English recognition task when the sub-word model was interpolated with a traditional word-based 3-gram model. Ordelman et al. (2003) presented a method for decomposing Dutch compound words automatically, and reported minor improvements in error rates.

To our knowledge, there is little previous work on basing the language modeling and recognition on sub-word units for Finnish LVCSR. Kneissler and Klakow (2001) segmented a corpus into word fragments that maximize the 1-gram likelihood of the corpus. Four different segmentation strategies were compared in a Finnish dictation task. The strategies required various amounts of input from an expert of the Finnish language. However, no comparisons to traditional word models were performed.

There are a number of works that aim at learning the morphology of a natural language in a fully unsupervised manner from data. Often words are assumed to consist of one stem typically followed by one suffix. Sometimes prefixes are possible. The work by Goldsmith (2001) exemplifies such an approach and gives a survey of the field. The morphologies discovered by these algorithms have not been applied in speech recognition. It seems that this kind of method is not suitable for agglutinative languages, such as Finnish, where words may consist of lengthy sequences of concatenated morphemes.

Morpheme-like units have also been discovered by algorithms for word segmentation, i.e., algorithms that discover word boundaries in text without blanks. Deligne and Bimbot (1997) derive a model structure that can be used both for word segmentation and for detecting variable-length acoustic units in speech data. Their data-driven units do not, however, produce as good results as conventional word models in recognizing the speech of French weather forecasts. Brent (1999) is mainly interested in the acquisition of a lexicon in an incremental fashion and applies his probabilistic model to the segmentation of transcripts of child-directed speech.

In this work, we make use of word fragments in language modeling and speech recognition. To avoid using a huge word vocabulary consisting of hundreds of thousands of distinct word forms, we split the words into frequently occurring sub-word units. We present an algorithm that discovers such word fragments from a text corpus in a fully unsupervised manner. The fragment inventory, or lexicon, is optimized for the given corpus according to a model based on the information-theoretic Minimum Description Length (MDL) principle (Rissanen, 1989).¹ The resulting fragments are here referred to as statistical morphs as the boundaries of the fragments often coincide with grammatical morpheme boundaries.

The algorithm is motivated by the following features: The resulting model can cover the whole language obtaining a 0% out-of-vocabulary (OOV) rate by a reasonably sized but still apparently meaningful set of word fragments. The degree of word-splitting is influenced by the size of the training corpus, and foreign words are split as well, because no language-dependent assumptions are involved. A word can be split into a long sequence of fragments, which makes the model suitable for agglutinative languages. An earlier version of the method has already given good results in Finnish and Turkish recognition tasks (Siivola et al., 2003, Hacioglu et al., 2003).

In this article, we give a detailed description of the algorithm for segmenting a text corpus into statistical morphs, and compare the resulting language models with models based on two alternative methods. The other models are also capable of generating the whole language, albeit in a more simplistic manner: words augmented with phonemes, and fragments based on automatic grammatical analysis augmented with phonemes. The language modeling and recognition performance of n-gram models built using these units are evaluated in two Finnish tasks. We also discuss how the use of fragments affects the decoder of our speech recognition system.

In Section 2, we present the two Finnish tasks and performance measures used in the experiments. The central section of the paper is Section 3. It describes the statistical model and algorithm for segmenting a text corpus into word fragments. The alternative approaches for producing complete-coverage vocabularies are also presented with comparative cross-entropy experiments. Section 4 describes the recognition system. The acoustic models are presented briefly, the emphasis being on the duration models for Finnish phonemes, followed by the description of the decoder. The results of the experiments are given in Section 5 with discussion, and Section 6 concludes the work.

Section snippets

The Finnish evaluation task

This chapter describes the LVCSR task that we propose for evaluating the new language models for Finnish. The language research community for Finnish is rather small, and extensive text and speech corpora for language modeling and speech recognition research do not exist yet. However, the Finnish IT center for science has an ongoing project Kielipankki (Language bank) which collects Finnish and Swedish text and speech data that can be obtained for research purposes.

Language modeling with data-driven units

In this section, we propose to solve the problem of large word vocabularies by producing a lexicon of word fragments and estimating n-gram language models over these fragments instead of entire words. Our algorithm learns a set of word fragments from a large text corpus or a corpus vocabulary in an unsupervised manner, and utilizes a model that is based on the MDL principle. The algorithm has obvious merits as it is not language-dependent and it relies on a model with a principled formulation,

Acoustic modeling

In this section, we describe the acoustic models used in the experiments. Since our emphasis is on the language modeling, the acoustic part is not discussed in great detail. The main difference to the corresponding English LVCSR systems, in addition to the Finnish phonemes, is the utilization of monophone models embedded with explicit models for the phone duration.

Setup

The recognition system was evaluated in both the Book and the News tasks (see Section 2). For each of the lexicon types (statistical morphs, grammatical morphs, words, and words-OOV), and for each n-gram model (orders 3–5), development data were used to tune the weight between acoustic and language models. The weight was tuned to optimize the phoneme error rate in the development data. We did not study n-gram models of order 6 or 7 in the recognition, because of the high memory requirements.

Conclusions

In this article, we have described and evaluated language models based on the segmentation of text corpora into suitable word fragments by an unsupervised machine learning algorithm. We have shown how such an automatically derived lexicon can be effectively used in language modeling and speech recognition for Finnish, which is a good example of agglutinative, highly inflecting and compounding languages. Due to the huge amount of distinct word forms, the traditional methods based on full words

Acknowledgments

This work was supported by the Academy of Finland in the projects New information processing principles and New adaptive and learning methods in speech recognition. Funding was also provided by the Finnish National Technology Agency (TEKES) and the Graduate School of Language Technology in Finland. We thank the Finnish Federation of the Visually Impaired, Departments of Speech Science and General Linguistics of the University of Helsinki, and Inger Ekman from the Department of Information

References (43)

X.L. Aubert
An overview of decoding techniques for large vocabulary continuous speech recognition
Computer Speech and Language
(2002)
S. Deligne et al.
Inference of variable-length linguistic and acoustic units by multigrams
Speech Communication
(1997)
O.-W. Kwon et al.
Korean large vocabulary continuous speech recognition with morpheme-based recognition units
Speech Communication
(2003)
Argamon, S., Akiva, N., Amir, A., Kapah., O., 2004. Efficient unsupervised recursive word segmentation using minimum...
Bilmes, J.A., Kirchhoff, K., 2003. Factored language models and generalized parallel backoff. In: Proceedings of the...
Bonafonte, A., Ros, X., Mariño, J.B., 1993. An efficient algorithm to find the best state sequence in HSMM. In:...
M.R. Brent
An efficient, probabilistically sound algorithm for segmentation and word discovery
Machine Learning
(1999)
Byrne, W., Hajič, J., Ircing, P., Jelinek, F., Khudanpur, S., Krbec, P., Psutka, J., 2001. On large vocabulary...
Byrne, W.J., Hajič, J., Krbec, P., Ircing, P., Psutka, J., 2000. Morpheme based language models for speech recognition...
Chen, S.F., 1996. Building probabilistic models for natural language. Ph.D. Thesis, Harvard...

S.F. Chen et al.

An empirical study of smoothing techniques for language modeling

Computer Speech and Language

(1999)

Creutz, M., 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In:...

Creutz, M., Lagus, K., 2002. Unsupervised discovery of morphemes. In: Proceedings of the Workshop on Morphological and...

Creutz, M., Lagus, K., 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using...

Creutz, M., Lindén, K., 2004. Morpheme segmentation gold standards for Finnish and English. Tech. Rep. A77,...

M.J.F. Gales

Semi-tied covariance matrices for hidden Markov models

IEEE Transactions on Speech and Audio Processing

(1999)

Geutner, P., Finke, M., Scheytt, P., 1998. Adaptive vocabularies for transcribing multilingual broadcast news. In:...

J. Goldsmith

Unsupervised learning of the morphology of a natural language

Computational Linguistics

(2001)

Goodman, J.T., 2001. A bit of progress in language modeling, extended version. Tech. Rep. MSR-TR-2001-72, Microsoft...

Hacioglu, K., Pellom, B., Ciloglu, T., Ozturk, O., Kurimo, M., Creutz, M., 2003. On lexicon creation for Turkish LVCSR....

Hakulinen, L., 1979. Suomen kielen rakenne ja kehitys (The structure and development of the Finnish language), 4th Ed.,...

Cited by (99)

Advances in subword-based HMM-DNN speech recognition across languages
2021, Computer Speech and Language
Citation Excerpt :
For some languages, such as Arabic, it has been popular to use linguistic units such as morphemes as the basic language modeling unit (Choueiter et al., 2006; Kirchhoff et al., 2006; Mousa et al., 2013). For other languages, such as Finnish, the subword segments are often created with data-driven methods (Hirsimäki et al., 2006; Creutz et al., 2007). Multiple data-driven methods for subword segmentation have been used in speech recognition.
We describe a novel way to implement subword language models in speech recognition systems based on weighted finite state transducers, hidden Markov models, and deep neural networks. The acoustic models are built on graphemes in a way that no pronunciation dictionaries are needed, and they can be used together with any type of subword language model, including character models. The advantages of short subword units are good lexical coverage, reduced data sparsity, and avoiding vocabulary mismatches in adaptation. Moreover, constructing neural network language models (NNLMs) is more practical, because the input and output layers are small. We also propose methods for combining the benefits of different types of language model units by reconstructing and combining the recognition lattices. We present an extensive evaluation of various subword units on speech datasets of four languages: Finnish, Swedish, Arabic, and English. The results show that the benefits of short subwords are even more consistent with NNLMs than with traditional n-gram language models. Combination across different acoustic models and language models with various units improve the results further. For all the four datasets we obtain the best results published so far. Our approach performs well even for English, where the phoneme-based acoustic models and word-based language models typically dominate: The phoneme-based baseline performance can be reached and improved by 4% using graphemes only when several grapheme-based models are combined. Furthermore, combining both grapheme and phoneme models yields the state-of-the-art error rate of 15.9% for the MGB 2018 dev17b test. For all four languages we also show that the language models perform reasonably well when only limited training data is available.
Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
2021, Computer Speech and Language
Citation Excerpt :
We have also evaluated the word boundary modelling with mostly similar results. For Finnish, the dedicated word boundary symbol Whittaker and Woodland (2000); Hirsimäki et al. (2006) has so far been the most effective approach, whereas for Estonian, using the redundant approach has sometimes resulted in a small improvement. For the experiments in this work, the dedicated word boundary symbol was used because it provided the best or equal results in all cases.
We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.
Brain activity reflects the predictability of word sequences in listened continuous speech: Brain activity predicts word sequences
2020, NeuroImage
Natural speech builds on contextual relations that can prompt predictions of upcoming utterances. To study the neural underpinnings of such predictive processing we asked 10 healthy adults to listen to a 1-h-long audiobook while their magnetoencephalographic (MEG) brain activity was recorded. We correlated the MEG signals with acoustic speech envelope, as well as with estimates of Bayesian word probability with and without the contextual word sequence (N-gram and Unigram, respectively), with a focus on time-lags. The MEG signals of auditory and sensorimotor cortices were strongly coupled to the speech envelope at the rates of syllables (4–8 Hz) and of prosody and intonation (0.5–2 Hz). The probability structure of word sequences, independently of the acoustical features, affected the ≤ 2-Hz signals extensively in auditory and rolandic regions, in precuneus, occipital cortices, and lateral and medial frontal regions. Fine-grained temporal progression patterns occurred across brain regions 100–1000 ms after word onsets. Although the acoustic effects were observed in both hemispheres, the contextual influences were statistically significantly lateralized to the left hemisphere. These results serve as a brain signature of the predictability of word sequences in listened continuous speech, confirming and extending previous results to demonstrate that deeply-learned knowledge and recent contextual information are employed dynamically and in a left-hemisphere-dominant manner in predicting the forthcoming words in natural speech.
Complex prosodic focus marking in Finnish: Expanding the data landscape
2016, Journal of Phonetics
By investigating prosody beyond pitch and duration, this article provides a detailed and multifaceted picture of focus marking in a language that differs substantially from more extensively studied languages like English. A production study examined prosodic focus marking in Finnish based on acoustic analyses of 947 short SVO sentences spoken by 17 native speakers. The results indicated effects of information structure on five acoustic measures: f0, duration, intensity, the use of pauses and non-modal voice quality. Words in narrow focus had a larger f0 range, longer duration, larger intensity range and were followed by pauses more often than words in other information structural conditions. Conversely, contextually given words showed a smaller f0 range, shorter duration, and, in post-focal condition, lower intensity. Moreover, realisations with non-modal voice quality were more frequent for all syllables of post-focal given words compared to the broad focus condition, whereas for words in narrow focus, non-modal realisations were more frequent only on the last syllable. Observing these effects in parallel, the findings exceeded previous studies in scope, providing encouragement for a broader approach to the investigation of prosodic focus marking.
Comparing human and automatic speech recognition in a perceptual restoration experiment
2016, Computer Speech and Language
Speech that has been distorted by introducing spectral or temporal gaps is still perceived as continuous and complete by human listeners, so long as the gaps are filled with additive noise of sufficient intensity. When such perceptual restoration occurs, the speech is also more intelligible compared to the case in which noise has not been added in the gaps. This observation has motivated so-called ‘missing data’ systems for automatic speech recognition (ASR), but there have been few attempts to determine whether such systems are a good model of perceptual restoration in human listeners. Accordingly, the current paper evaluates missing data ASR in a perceptual restoration task. We evaluated two systems that use a new approach to bounded marginalisation in the cepstral domain, and a bounded conditional mean imputation method. Both methods model available speech information as a clean-speech posterior distribution that is subsequently passed to an ASR system. The proposed missing data ASR systems were evaluated using distorted speech, in which spectro-temporal gaps were optionally filled with additive noise. Speech recognition performance of the proposed systems was compared against a baseline ASR system, and with human speech recognition performance on the same task. We conclude that missing data methods improve speech recognition performance in a manner that is consistent with perceptual restoration in human listeners.
Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling
2023, Eurasip Journal on Audio, Speech, and Music Processing

View all citing articles on Scopus

View full text

Unlimited vocabulary speech recognition with morph language models applied to Finnish

Abstract

Introduction

Section snippets

The Finnish evaluation task

Language modeling with data-driven units

Acoustic modeling

Setup

Conclusions

Acknowledgments

Computer Speech and Language

Speech Communication

Speech Communication

An efficient, probabilistically sound algorithm for segmentation and word discovery

Machine Learning

An empirical study of smoothing techniques for language modeling

Computer Speech and Language

Semi-tied covariance matrices for hidden Markov models

IEEE Transactions on Speech and Audio Processing

Unsupervised learning of the morphology of a natural language

Computational Linguistics