Abstract
Syllabification is a process of extracting syllables from a word. Problems of syllabification are majorly caused from unknown and ambiguous words. This research aims to resolve these problems in Thai language by exploiting relationships among characters in the word. A character clustering scheme is proposed to generate units smaller than a syllable, called Thai Minimum Clusters (TMCs), from a word. TMCs are then merged into syllables using a trigram statistical model. Experimental evaluations are performed to assess the effectiveness of the proposed technique on a standard data set of 77,303 words. The results show that the technique yields 97.61% accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Christopher, D.M., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press, England (2008)
Trigram Algorithm, http://ii.nlm.nih.gov/MTI/trigram.shtml (accessed September 28, 2010)
Mao, J., Cheng, G., He, Y., Xing, Z.: A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation. In: Preparata, F.P., Fang, Q. (eds.) FAW 2007. LNCS, vol. 4613, pp. 271–280. Springer, Heidelberg (2007)
Kanchanacheewa, N.: Principles of Thai Language. Thai Wattana Panich Co., Ltd, Thailand (1996)
Khruahong, S., Nitsuwat, S., Limmaneepraserth, P.: Thai Syllable Segmentation for Text-to-Speech Synthesis by Using Suited-Syllable-Structure Mapping. In: International Conference on Computer Science and Information Technology (2003)
Lorchirachoonkul, V., Khuwinphunt, C.: Thai Soundex Algorithm and Thai-Syllable Separation Algorithm. Research Report. School of Applied Statistics, National Institute of Development Administration, Bangkok (1982)
Thai Script, http://en.wikipedia.org/wiki/Thai_script (accessed October 2, 2010)
Aroonmanakun, W.: Collocation and Thai Word Segmentation. In: Proceedings of the Fifth Symposium on Natural Language Processing & the Fifth Oriental COCOSDA Workshop, Pathumthani, pp. 68–75 (2002)
Aroonmanakun, W., Rivepiboon, W.: A Unified Model of Thai Romanization and Word Segmentation. In: Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation, Tokyo, pp. 205–214 (2004)
Poowarawan, Y.: Dictionary-based Thai Syllable Separation. In: Proceeding of Ninth Electronics Engineering Conference, Khon Kaen (1986)
Theeramunkong, T., Sornlertlamvanich, V.: Character Cluster Based Thai Information Retrieval. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, pp. 75–80 (2000)
Inrut, J., Yuanghirun, P., Paludkong, S., Nitsuwat, S., Limmaneepraserth, P.: Thai Word Segmentation Using Combination of Forward and Backward Longest Matching Techniques. In: International Symposium on Communications and Information Technology, Chiang Mai, pp. 37–40 (2001)
Kongsupanich, S.: The Transformation of Thai Morphemes to Phonetic Symbols for Thai Speech Synthesis System. Master Thesis. Faculty of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Bangkok (1997)
Paludkong, S.: Developing Thai-Vernacular-to-Romanization Transcriptor Using Ratchabandittayasatan Method. Master Thesis, King Mongkut’s Institute of Technology North Bangkok, Bangkok (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jucksriporn, C., Sornil, O. (2011). A Minimum Cluster-Based Trigram Statistical Model for Thai Syllabification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-19437-5_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)