Abstract
The Thai written language is one of the languages that does not have word boundaries. In order to discover the meaning of the document, all texts must be separated into syllables, words, sentences, and paragraphs. This paper develops a novel method to segment the Thai text by combining a non-dictionary based technique with a dictionary-based technique. This method first applies the Thai language grammar rules to the text for identifying syllables. The hidden Markov model is then used for merging possible syllables into words. The identified words are verified with a lexical dictionary and a decision tree is employed to discover the words unidentified by the lexical dictionary. Documents used in the litigation process of Thai court proceedings have been used in experiments. The results which are segmented words, obtained by the proposed method outperform the results obtained by other existing methods.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aroonmanakul, W.: Collocation and Thai Word Segmentation. In: Joint International Conference of SNLP-Oriental COCOSDA, Thailand, pp. 68–75 (2002)
Christen, P., Churches, T., Hegland, M., Lim, K., Nielsen, O.M., Roberts, S., Zhu, J.: High-Performance Computing Techniques for Record Linkage. In: Australian Health Outcomes Conference, Canberra, Australia, pp.1–14 (2002)
Church, K.W., Robert, L., Mark, L.Y.: A Status Report on ACL/DCL. In: 7th Annual Conference of the UW Centre New OED and Text Research: Using Corpora, Canada, pp. 84—91 (1991)
Civil court of Thailand, http://www.cvcourt.com
Kawtrakul, A., Thumkanon, C., Poovarawan, Y., Varasrai, P., Suktarachan, M.: Automatic Thai Unknown Word Recognition. In: Natural Language Processing Pacific Rim Symposium, Phuket, Thailand, pp. 341–346 (1997)
Nagata, M.: Context-based spelling correction for Japanese OCR. In: 16th conference on Computational linguistics, New Jersey, USA, pp. 806–811 (1996)
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufman, USA (1993)
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. IEEE 77(2), 257–285 (1989)
Sornlertlamvanich, V., Potipiti, T., Charoenporn, T.: Automatic corpus-based Thai word extraction with the C4.5 learning algorithm. In: 18th conference on Computational linguistics. Saarbrücken, Germany, pp. 802–807 (2000)
Sudprasert, S., Kawtrakul, A.: Thai word segmentation based on Global and Local Unsupervised learning. In: NCSEC, Chonburi, Thailand (2003)
Thai Computational Linguistics Laboratory.: TCL’s Computational Lexicon, http://www.tcllab.org/tcllex/
Theeramunkong, T., Usanavasin, S.: Non-dictionary-based Thai word segmentation using decision trees. In: The first international conference on Human language technology research, New Jersey, USA, pp. 1–5 (2001)
Unicode Consortium.: The Unicode Standard 4.0: Southeast Asian Scripts. Addison Westley, California (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bheganan, P., Nayak, R., Xu, Y. (2009). Thai Word Segmentation with Hidden Markov Model and Decision Tree. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)