Supervised and Unsupervised Learning Algorithms for Thai Web Pages Identification

Kijsirikul, Boonserm; Sasiphongpairoege, Puay; Soonthornphisaj, Nuanwan; Meknavin, Surapant

doi:10.1007/3-540-44533-1_69

Boonserm Kijsirikul³,
Puay Sasiphongpairoege³,
Nuanwan Soonthornphisaj³ &
…
Surapant Meknavin⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1886))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

922 Accesses

Abstract

The paper presents a learning method, called iterative cross-training (ICT) for identifying Thai Web pages. Our method combines two classifiers, i.e. a word segmentation classifier and a naive Bayes classifier, that use unlabeled examples to train each other. We compare ICT against other supervised and unsupervised learning methods: a supervised word segmentation classifier (S-Word), a supervised naive Bayes classifier (S-Bayes), an unsupervised naive Bayes classifier using the EM algorithm (U-Bayes-EM), and a co-training-style classifier (Co Training). The experimental results show that ICT gives the best performance, followed by S-Bayes, CoTraining U-Bayes-EM and S-Word.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apte, C, and Damerau, F. (1994) Automated learning of decision rules for text categorization. ACM TOIS 12(2):233–251.
Article Google Scholar
Blum, A. and Mitchell, T. (1998) Combining labeled and unlabeled data with co-training. Proceeding of the Eleventh Annual Conference on Computational Learning Theory.
Google Scholar
Cohen, W. W. and Singer, Y. (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, Vol. 17, No. 2, 141–173.
Article Google Scholar
Dempster, A. P., Laird, N. M., and Rubin D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
MATH Google Scholar
Joachims, T. (1998) Text categorization with support vector machines: Learning with many relevant features. Proceedings Tenth European Conference on Machine Learning, Springer Verlag.
Google Scholar
Lewis, D. (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of the Tenth European Conference on Machine Learning.
Google Scholar
Meknavin, S., Charoenpornsawat, P. and Kijsirikul, B. (1997) Feature-based Thai word segmentation. Proceeding of Natural Language Processing Pacific Rim Symposium’ 97.
Google Scholar
Mitchell, T. (1997) Machine Learning. pp. 180–184, McGraw-Hill. New York.
MATH Google Scholar
Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3): 103–134.
Article MATH Google Scholar
Yang, Y. (1999) An evaluation of statistical approaches to text categorization, Information Retrieval Journal.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Chulalongkorn University, Phathumwan, Bangkok, 10330, Thailand
Boonserm Kijsirikul, Puay Sasiphongpairoege & Nuanwan Soonthornphisaj
Siamguru Co., Ltd., 2922/103 Charn Issara Tower II, 126-7 New Petchburi Rd., Bangkapi, Huay Kwang, Bangkok, 10310, Thailand
Surapant Meknavin

Authors

Boonserm Kijsirikul
View author publications
You can also search for this author in PubMed Google Scholar
Puay Sasiphongpairoege
View author publications
You can also search for this author in PubMed Google Scholar
Nuanwan Soonthornphisaj
View author publications
You can also search for this author in PubMed Google Scholar
Surapant Meknavin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan
Riichiro Mizoguchi
Computer Sciences Laboratory, Research School of Information Sciences and Engineering, Australian National University, Canberra, ACT, 0200, Australia
John Slaney

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kijsirikul, B., Sasiphongpairoege, P., Soonthornphisaj, N., Meknavin, S. (2000). Supervised and Unsupervised Learning Algorithms for Thai Web Pages Identification. In: Mizoguchi, R., Slaney, J. (eds) PRICAI 2000 Topics in Artificial Intelligence. PRICAI 2000. Lecture Notes in Computer Science(), vol 1886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44533-1_69

Download citation

DOI: https://doi.org/10.1007/3-540-44533-1_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67925-7
Online ISBN: 978-3-540-44533-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics