skip to main content
10.1145/1460027.1460042acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

Authors Info & Claims
Published:30 October 2008Publication History

ABSTRACT

Thai language is considered as an unsegmented language in which words are written continuously without the use of word delimiters. To index Thai texts via the inverted index, a word segmentation algorithm is usually required to tokenize a text into a series of terms. Recent works on word segmentation reported Conditional Random Fields (CRFs) as the best machine learning algorithm, outperforming the dictionary-based approach and other machine learning algorithms. Our main contribution is to propose a new hybrid approach, LearnLexTo, which further improves the CRF model by integrating the dictionary-based approach. The key idea is to solve the ambiguity problem in the CRF model by using the dictionary-based approach which relies on a valid word set. Experimental results showed that the proposed hybrid approach yields the highest F1 value of 88.46%, compared to 82.07% by using the dictionary-based approach and 85.71% by using the CRF model.

References

  1. W. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, Prentice Hall, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Haruechaiyasak et al., "A Collaborative Framework for Collecting Thai Unknown Words from the Web," In Proc. of the COLING/ACL on Main Conference Poster Sessions, pp. 345--352, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Kruengkrai and H. Isahara, "A Conditional Random Field Framework for Thai Morphological Analysis," In Proc. of the Fifth Int. Conf. on Language Resources and Evaluation (LREC-2006), 2006.Google ScholarGoogle Scholar
  4. T. Kudo, K. Yamamoto, and Y. Matsumoto, "Applying Conditional Random Fields to Japanese Morphological Analysis," In Proc. of EMNLP, pp. 230--237, 2004.Google ScholarGoogle Scholar
  5. J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Peng, F. Feng, and A. McCallum, "Chinese Segmentation and New Word Detection Using Conditional Random Fields," In Proc. of the 20th COLING, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Sornlertlamvanich, "Word Segmentation for Thai in Machine Translation System," Machine Translation, National Electronics and Computer Technology Center, Bangkok, 1993.Google ScholarGoogle Scholar

Index Terms

  1. LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching
        October 2008
        112 pages
        ISBN:9781605584164
        DOI:10.1145/1460027

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 October 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader