ABSTRACT
Thai language is considered as an unsegmented language in which words are written continuously without the use of word delimiters. To index Thai texts via the inverted index, a word segmentation algorithm is usually required to tokenize a text into a series of terms. Recent works on word segmentation reported Conditional Random Fields (CRFs) as the best machine learning algorithm, outperforming the dictionary-based approach and other machine learning algorithms. Our main contribution is to propose a new hybrid approach, LearnLexTo, which further improves the CRF model by integrating the dictionary-based approach. The key idea is to solve the ambiguity problem in the CRF model by using the dictionary-based approach which relies on a valid word set. Experimental results showed that the proposed hybrid approach yields the highest F1 value of 88.46%, compared to 82.07% by using the dictionary-based approach and 85.71% by using the CRF model.
- W. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, Prentice Hall, 1992. Google ScholarDigital Library
- C. Haruechaiyasak et al., "A Collaborative Framework for Collecting Thai Unknown Words from the Web," In Proc. of the COLING/ACL on Main Conference Poster Sessions, pp. 345--352, 2006. Google ScholarDigital Library
- C. Kruengkrai and H. Isahara, "A Conditional Random Field Framework for Thai Morphological Analysis," In Proc. of the Fifth Int. Conf. on Language Resources and Evaluation (LREC-2006), 2006.Google Scholar
- T. Kudo, K. Yamamoto, and Y. Matsumoto, "Applying Conditional Random Fields to Japanese Morphological Analysis," In Proc. of EMNLP, pp. 230--237, 2004.Google Scholar
- J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 282--289, 2001. Google ScholarDigital Library
- F. Peng, F. Feng, and A. McCallum, "Chinese Segmentation and New Word Detection Using Conditional Random Fields," In Proc. of the 20th COLING, 2004. Google ScholarDigital Library
- V. Sornlertlamvanich, "Word Segmentation for Thai in Machine Translation System," Machine Translation, National Electronics and Computer Technology Center, Bangkok, 1993.Google Scholar
Index Terms
- LearnLexTo: a machine-learning based word segmentation for indexing Thai texts
Recommendations
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataTokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
A Basic Language Resource Kit Implementation for the IgboNLP Project
Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
Towards Better Text Processing Tools for the Ainu Language
Human Language Technology. Challenges for Computer Science and LinguisticsAbstractIn this paper we present our research devoted to the development of Natural Language Processing technologies for the Ainu language, a critically endangered language isolate spoken by the Ainu people, the native inhabitants of northern parts of the ...
Comments