A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling

Xiao, Jinghui; Liu, Bingquan; Wang, Xiaolong; Li, Bing

doi:10.1007/11579427_77

Jinghui Xiao²¹,
Bingquan Liu²¹,
Xiaolong Wang²¹ &
…
Bing Li²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3789))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1540 Accesses

Abstract

Data sparseness problem is inherent and severe in language modeling. Smoothing techniques are usually widely used to solve this problem. However, traditional smoothing techniques are all based on statistical hypotheses without concerning about linguistic knowledge. This paper introduces semantic information into smoothing technique and proposes a similarity-based smoothing method which is based on both statistical hypothesis and linguistic hypothesis. An experiential iterative algorithm is presented to optimize system parameters. Experiment results prove that compared with traditional smoothing techniques, our method can greatly improve the performance of language model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Training Set Similarity Based Parameter Selection for Statistical Machine Translation

Automatic Stopwords Identification from Very Small Corpora

An Improved Hierarchical Word Sequence Language Model Using Word Association

References

Jelinek, F.: Self-Organized Language Modeling for Speech Recognition. In: IEEE ICASSP (1989)
Google Scholar
Nagy, G.: At the Frontier of OCR. Processing of IEEE 80(7) (1992)
Google Scholar
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2) (1992)
Google Scholar
Bingquan, L., Xiaolong, W., Yuying, W.: Incorporating Linguistic Rules in Statistical Chinese Language Model for Pinyin-to-Character Conversion. High Technology Letters 7(2), 8–13 (2001)
Google Scholar
Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Jeffreys, H.: Theory of Probability, 2nd edn. Clarendon Press, Oxford (1948)
MATH Google Scholar
Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)
MATH MathSciNet Google Scholar
Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing 35(3), 400–401 (1987)
Article Google Scholar
Jelinek, F., Mercer, R.L.: Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, 381–397 (1980)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–394 (1999)
Article Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database[EB], Cognitive Science Laboratory Princeton University (1993)
Google Scholar
http://www.keenage.com
Jiaju, M.: Chinese thesaurus “Tongyici Cilin”. Shanghai thesaurus Press (1983)
Google Scholar
Essen, U., Steinbiss, V.: Coocurrence smoothing for stochastic language modeling. In: Proceedings of ICASSP, vol. I, pp. 161–164 (1992)
Google Scholar
Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1-3), 43–69 (1999)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Techniques, Harbin Institute of Technology, Harbin, 150001, China
Jinghui Xiao, Bingquan Liu, Xiaolong Wang & Bing Li

Authors

Jinghui Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Bingquan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bing Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh
Technológico de Monterrey (ITESM), Campus Ciudad de México (CCM), Calle del Puente 222, Col. Ejudos de Huipulco, 14360 DF, Tlalpan, Mexico
Álvaro de Albornoz
Center for Intelligent Systems, Tecnológico de Monterrey, Campus Monterrey, 64849, Monterrey, N.L., Mexico
Hugo Terashima-Marín

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, J., Liu, B., Wang, X., Li, B. (2005). A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling. In: Gelbukh, A., de Albornoz, Á., Terashima-Marín, H. (eds) MICAI 2005: Advances in Artificial Intelligence. MICAI 2005. Lecture Notes in Computer Science(), vol 3789. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11579427_77

Download citation

DOI: https://doi.org/10.1007/11579427_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29896-0
Online ISBN: 978-3-540-31653-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling

Abstract

Access this chapter

Preview

Similar content being viewed by others

Training Set Similarity Based Parameter Selection for Statistical Machine Translation

Automatic Stopwords Identification from Very Small Corpora

An Improved Hierarchical Word Sequence Language Model Using Word Association

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling

Abstract

Access this chapter

Preview

Similar content being viewed by others

Training Set Similarity Based Parameter Selection for Statistical Machine Translation

Automatic Stopwords Identification from Very Small Corpora

An Improved Hierarchical Word Sequence Language Model Using Word Association

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation