An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation

Le Trung, Hieu; Le Anh, Vu; Le Trung, Kien

doi:10.1007/978-3-642-12101-2_21

Hieu Le Trung²²,
Vu Le Anh²³ &
Kien Le Trung²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5991))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

2076 Accesses

Abstract

There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statistical functions which improves contrarily the new words recognition. Hence, the probabilistic model will converge to the optimum one.

Our experimented corpus is generated from about 250.000 online news articles, which consist of about 19.000.000 sentences. The accuracy of the segmented algorithm is over 90%. Our Vietnamese word and phrase dictionary contains more than 150.000 elements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Thai Words Segmentation Using an Unsupervised Learning Technique

Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture

One Novel Word Segmentation Method Based on N-Shortest Path in Vietnamese

References

Cao, X.H.: Vietnamese - Some Questions on Phonetics, Syntax and Semantics. Nxb Giao duc, Hanoi (2000)
Google Scholar
Chu, M.N., Nghieu, V.Đ., Phien, H.T.: Cơ sở ngôn ngữ học và tiẽ́ng Việt. Nxb Giáo dục. Hanoi, pp. 142–152 (1997)
Google Scholar
Dien, D., Kiem, H., Toan, N.V.: Vietnamese Word Segmentation. In: The Sixth Natural Language Processing Pacific Rim Symposium, Tokyo, Japan, pp. 749–756 (2001)
Google Scholar
Giap, N.T.: Từ vựng học tiẽ́ng Việt. H., Nxb Giao duc (2003)
Google Scholar
Thu, C.B., Hien, P.: Về một xu hướng mới của từ điển giải thích (2007), http://ngonngu.net/index.php?p=319
Ha, L.A.: A method for word segmentation in Vietnamese. In: Proceedings of Corpus Linguistics 2003, Lancaster, UK (2003)
Google Scholar
Le, H.P., Nguyen, T.M.H., Roussanaly, A., Ho, T.V.: A hybrid approach to word segmentation of Vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)
Google Scholar
Nguyen, C.T., Nguyen, T.K., Phan, X.H., Nguyen, L.M., Ha, Q.T.: Vietnamese word segmentationwith CRFs and SVMs: An investigation. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation (PACLIC 2006), Wuhan, CH (2006)
Google Scholar
Nguyen, T.V., Tran, H.K., Nguyen, T.T.T., Nguyen, H.: Word segmentation for Vietnamese text categorization: an online corpus approach. In: Research, Innovation and Vision for the Future, The 4th International Conference on Computer Sciences (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

St. Petersburg State University, Saint Petersburg, Russia
Hieu Le Trung
Hoa Sen University, 8. Nguyen Van Trang, Q1, Ho Chi Minh City, Vietnam
Vu Le Anh
Institue of Mathematics, Arndt University, Germany
Kien Le Trung

Authors

Hieu Le Trung
View author publications
You can also search for this author in PubMed Google Scholar
Vu Le Anh
View author publications
You can also search for this author in PubMed Google Scholar
Kien Le Trung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Informatics, Wroclaw University of Technology, Str. Wyb. Wyspianskiego 27, 50-370, Poland
Ngoc Thanh Nguyen
Hue University, Str. Le Loi 3, Hue City, Vietnam
Manh Thanh Le
Faculty of Computer Science and Management, Wroclaw University of Technology, Str. Lukasiewicza, 50-370, Wroclaw, Poland
Jerzy Świątek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le Trung, H., Le Anh, V., Le Trung, K. (2010). An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation. In: Nguyen, N.T., Le, M.T., Świątek, J. (eds) Intelligent Information and Database Systems. ACIIDS 2010. Lecture Notes in Computer Science(), vol 5991. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12101-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-12101-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12100-5
Online ISBN: 978-3-642-12101-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics