short-paper

LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

Authors:
Choochart Haruechaiyasak

National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand

National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand
View Profile

,
Sarawoot Kongyoung

National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand

National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand
View Profile

,
Chaianun Damrongrat

National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand

National Electronics and Computer Technology Center (NECTEC), Pathumthani, Thailand
View Profile

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searchingOctober 2008Pages 85–88https://doi.org/10.1145/1460027.1460042

Published:30 October 2008Publication History

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

Pages 85–88

ABSTRACT

Thai language is considered as an unsegmented language in which words are written continuously without the use of word delimiters. To index Thai texts via the inverted index, a word segmentation algorithm is usually required to tokenize a text into a series of terms. Recent works on word segmentation reported Conditional Random Fields (CRFs) as the best machine learning algorithm, outperforming the dictionary-based approach and other machine learning algorithms. Our main contribution is to propose a new hybrid approach, LearnLexTo, which further improves the CRF model by integrating the dictionary-based approach. The key idea is to solve the ambiguity problem in the CRF model by using the dictionary-based approach which relies on a valid word set. Experimental results showed that the proposed hybrid approach yields the highest F1 value of 88.46%, compared to 82.07% by using the dictionary-based approach and 85.71% by using the CRF model.

References

W. Frakes and R. Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, Prentice Hall, 1992. Google ScholarDigital Library
C. Haruechaiyasak et al., "A Collaborative Framework for Collecting Thai Unknown Words from the Web," In Proc. of the COLING/ACL on Main Conference Poster Sessions, pp. 345--352, 2006. Google ScholarDigital Library
C. Kruengkrai and H. Isahara, "A Conditional Random Field Framework for Thai Morphological Analysis," In Proc. of the Fifth Int. Conf. on Language Resources and Evaluation (LREC-2006), 2006.Google Scholar
T. Kudo, K. Yamamoto, and Y. Matsumoto, "Applying Conditional Random Fields to Japanese Morphological Analysis," In Proc. of EMNLP, pp. 230--237, 2004.Google Scholar
J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," In Proc. of the Eighteenth Int. Conf. on Machine Learning (ICML), pp. 282--289, 2001. Google ScholarDigital Library
F. Peng, F. Feng, and A. McCallum, "Chinese Segmentation and New Word Detection Using Conditional Random Fields," In Proc. of the 20th COLING, 2004. Google ScholarDigital Library
V. Sornlertlamvanich, "Word Segmentation for Thai in Machine Translation System," Machine Translation, National Electronics and Computer Technology Center, Bangkok, 1993.Google Scholar

Index Terms

LearnLexTo: a machine-learning based word segmentation for indexing Thai texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
Read More
A Basic Language Resource Kit Implementation for the IgboNLP Project

Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach ...
Read More
Towards Better Text Processing Tools for the Ainu Language
Human Language Technology. Challenges for Computer Science and Linguistics
Abstract
In this paper we present our research devoted to the development of Natural Language Processing technologies for the Ainu language, a critically endangered language isolate spoken by the Ainu people, the native inhabitants of northern parts of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching
October 2008
112 pages
ISBN:9781605584164
DOI:10.1145/1460027
Program Chairs:
Fotis Lazarinis
Technological Educational Institute of Mesolongli, Greece
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jesus Vilares
University of A Coruna, Spain
,
John I. Tait
Information Retrieval Facility, Austria
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
indexing
tokenization
word segmentation
Qualifiers
- short-paper
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 372
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel Arabic lemmatization algorithm

A Basic Language Resource Kit Implementation for the IgboNLP Project

Towards Better Text Processing Tools for the Ainu Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel Arabic lemmatization algorithm

A Basic Language Resource Kit Implementation for the IgboNLP Project

Towards Better Text Processing Tools for the Ainu Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media