Improvement of Language Models Using Dual-Source Backoff

Cho, Sehyeong

doi:10.1007/978-3-540-28633-2_94

Sehyeong Cho²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3157))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1320 Accesses

Abstract

Language models are essential in predicting the next word in a spoken sentence, thereby enhancing the speech recognition accuracy, among other things. However, spoken language domains are too numerous, and therefore developers suffer from the lack of corpora with sufficient sizes. This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but incorrectly biased. With our approach, two n-gram statistics are combined by extending the idea of Katz’s backoff and therefore is called a dual-source backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus.

Area: Natural Language Processing

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Good, I.J.: The Population frequencies of species and the Estimation of Population parameters. Biometrica 40, part 3,4, 237–264
Google Scholar
Witten, I.H., Bell, T.C.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37-4, 1085–1094 (1991)
Article Google Scholar
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and signal Processing ASSP-35, 400–401 (1987)
Article Google Scholar
Goodman, J.T.: A Bit of Progress in Language Modeling. Computer Speech and Language 15, 403–434 (2001)
Article Google Scholar
Jelinek, F., et al.: Perplexity – A Measure of the difficulty of speech recognition tasks. Journal of the Acoustics Society of America 62, suppl. 1, S63 (1977)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice-Hall, Englewood Cliffs (2000)
Google Scholar
Rosenfeld, R.: Adaptive Statistical Language Modeling: A Maximum Entropy Approach, Ph.D. dissertation, Carnegie-Mellon University (April 1994)
Google Scholar
Akiba, T., Itou, K., Fujii, A., Ishikawa, T.: Selective Backoff smoothing for incorporating grammatical constraints into the n-gram language model. In: Proc. International Conference on Spoken Language Processing, pp. 881–884 (September 2002)
Google Scholar
Chen, S.F., et al.: Topic Adaptation for Language Modeling Using Unnormalized Exponential Models. In: Proc. ICASSP 1998, May 12-15, vol. 2, pp. 681–684 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, MyongJi University, San 38-2 Yong In, KyungGi, Korea
Sehyeong Cho

Authors

Sehyeong Cho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang
Department of Computer Science, Auckland University of Technology, 1020, Auckland, New Zealand
Hans W. Guesgen
Artificial Intelligence Technology Centre, Auckland University of Technology, Auckland, New Zealand
Wai-Kiang Yeap

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cho, S. (2004). Improvement of Language Models Using Dual-Source Backoff. In: Zhang, C., W. Guesgen, H., Yeap, WK. (eds) PRICAI 2004: Trends in Artificial Intelligence. PRICAI 2004. Lecture Notes in Computer Science(), vol 3157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28633-2_94

Download citation

DOI: https://doi.org/10.1007/978-3-540-28633-2_94
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22817-2
Online ISBN: 978-3-540-28633-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics