A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation

Mao, Jun; Cheng, Gang; He, Yanxiang; Xing, Zehuan

doi:10.1007/978-3-540-73814-5_26

Jun Mao¹,
Gang Cheng¹,
Yanxiang He¹ &
…
Zehuan Xing²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4613))

Included in the following conference series:

International Workshop on Frontiers in Algorithmics

658 Accesses
3 Citations

Abstract

We address the problem of segmenting a Chinese text into words. In this paper, we propose a trigram model algorithm for segmenting a Chinese text. We also discuss why statistical language model is appropriate to be applied to Chinese word segmentation and give an algorithm for segmenting a Chinese text into words. In particular, we solve the problem of searching which often leads to low performance brought by trigram model. Finally, the issue of OOV word identification is discussed and merged to trigram model based method in order to improve the accuracy of segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cheng, K.-S., Young, G.H., Wong, K.-F.: A study on word-based and integral-bit Chinese text compression algorithms. Journal of the American Society for Information Science 50(4), 18 C228 (1999)
Google Scholar
Zou, F.: The Identification of Stop Words and Keywords: A Study of Automatic Term Weighting in Natural Language Text Processing. MPhil Thesis (June 2006)
Google Scholar
Mao, J., Cheng, G., He, Y.: Phrase-based Statistical Language Modeling from Bilingual Parallel Corpus. In: The International Symposium on Combinatorics, Algorithms, Probabilistic and Experimental methodologies (April 2007)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing. Prentice-Hall, Englewood Cliffs (2006)
Google Scholar
Gao, J., Wu, A., Li, M., Huang, C.-N.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics 31(4), 531–574 (2005)
Article Google Scholar
Stolcke, A.: SRILM - An Extensible Language Modeling Toolkit. In: Proceeding of International Conference of Spoken Language Processing (September 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer School, Wuhan University, Wuhan 430072, P. R. China
Jun Mao, Gang Cheng & Yanxiang He
Department of Linguistics, Central China Normal University, Wuhan 430079, P. R. China
Zehuan Xing

Authors

Jun Mao
View author publications
You can also search for this author in PubMed Google Scholar
Gang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yanxiang He
View author publications
You can also search for this author in PubMed Google Scholar
Zehuan Xing
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Franco P. Preparata Qizhi Fang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mao, J., Cheng, G., He, Y., Xing, Z. (2007). A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation. In: Preparata, F.P., Fang, Q. (eds) Frontiers in Algorithmics. FAW 2007. Lecture Notes in Computer Science, vol 4613. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73814-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-540-73814-5_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73813-8
Online ISBN: 978-3-540-73814-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics