Overcoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics

Cho, Sehyeong; Kim, SangHun; Park, Jun; Lee, YoungJik

doi:10.1007/978-3-540-24630-5_48

Sehyeong Cho⁵,
SangHun Kim⁶,
Jun Park⁶ &
…
YoungJik Lee⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2945))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

941 Accesses

Abstract

This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but inadequately biased. Two n-gram models are combined by extending the idea of Katz’s backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus.

Download to read the full chapter text

Chapter PDF

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Building Large Resources for Text Mining: The Leipzig Corpora Collection

Web as a Corpus: Going Beyond the n-gram

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Rosenfeld, R.: Adaptive Statistical Language Modeling: A Maximum Entropy Approach, Ph.D. dissertation, Carnegie-Mellon University (April 1994)
Google Scholar
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and signal Processing ASSP-35, 400–401 (1987)
Article Google Scholar
Good, I.J.: The Population frequencies of species and the Estimation of Population parameters. Biometrica 40, parts 3,4, 237–264
Google Scholar
Goodman, J.T.: A Bit of Progress in Language Modeling. Computer Speech and Language 15, 403–434 (2001)
Article Google Scholar
Bahl, L., Brown, P., de Souza, P., Mercer, R.: A tree-based Statistical Language Model for natural Language Speech Recognition. IEEE Tr. on Acoustics, Speech, and Signal Processing 37, 1001–1008 (1989)
Article Google Scholar
Akiba, T., Itou, K., Fujii, A., Ishikawa, T.: Selective Backoff smoothing for incorporating grammatical constraints into the n-gram language model. In: Proc. International Conference on Spoken Language Processing, September 2002, pp. 881–884 (2002)
Google Scholar
Chen, S.F., et al.: Topic Adaptation for Language Modeling Using Unnormalized Exponential Models. In: Proc. ICASSP 1998, May 12-15, vol. 2, pp. 681–684 (1998)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice-Hall, Englewood Cliffs (2000)
Google Scholar
Jelinek, F., et al.: Perplexity – A Measure of the difficulty of speech recognition tasks. Journal of the Acoustics Society of America 62, S63(suppl. 1) (1977)
Google Scholar
Cho, S.: http://nlp.mju.ac.kr/dualsource/

Download references

Author information

Authors and Affiliations

Department of Computer Science, MyongJi University, San 38-2 Yong In, KyungGi, Korea
Sehyeong Cho
Electronics and Telecommunication Research Institute, Yusong, Daejon, Korea
SangHun Kim, Jun Park & YoungJik Lee

Authors

Sehyeong Cho
View author publications
You can also search for this author in PubMed Google Scholar
SangHun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jun Park
View author publications
You can also search for this author in PubMed Google Scholar
YoungJik Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cho, S., Kim, S., Park, J., Lee, Y. (2004). Overcoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_48

Download citation

DOI: https://doi.org/10.1007/978-3-540-24630-5_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Overcoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics

Abstract

Chapter PDF

Similar content being viewed by others

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Building Large Resources for Text Mining: The Leipzig Corpora Collection

Web as a Corpus: Going Beyond the n-gram

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Overcoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics

Abstract

Chapter PDF

Similar content being viewed by others

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Building Large Resources for Text Mining: The Leipzig Corpora Collection

Web as a Corpus: Going Beyond the n-gram

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation