Skip to main content

Supervised and Unsupervised Learning Algorithms for Thai Web Pages Identification

  • Conference paper
PRICAI 2000 Topics in Artificial Intelligence (PRICAI 2000)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1886))

Included in the following conference series:

  • 922 Accesses

Abstract

The paper presents a learning method, called iterative cross-training (ICT) for identifying Thai Web pages. Our method combines two classifiers, i.e. a word segmentation classifier and a naive Bayes classifier, that use unlabeled examples to train each other. We compare ICT against other supervised and unsupervised learning methods: a supervised word segmentation classifier (S-Word), a supervised naive Bayes classifier (S-Bayes), an unsupervised naive Bayes classifier using the EM algorithm (U-Bayes-EM), and a co-training-style classifier (Co Training). The experimental results show that ICT gives the best performance, followed by S-Bayes, CoTraining U-Bayes-EM and S-Word.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C, and Damerau, F. (1994) Automated learning of decision rules for text categorization. ACM TOIS 12(2):233–251.

    Article  Google Scholar 

  2. Blum, A. and Mitchell, T. (1998) Combining labeled and unlabeled data with co-training. Proceeding of the Eleventh Annual Conference on Computational Learning Theory.

    Google Scholar 

  3. Cohen, W. W. and Singer, Y. (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, Vol. 17, No. 2, 141–173.

    Article  Google Scholar 

  4. Dempster, A. P., Laird, N. M., and Rubin D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.

    MATH  Google Scholar 

  5. Joachims, T. (1998) Text categorization with support vector machines: Learning with many relevant features. Proceedings Tenth European Conference on Machine Learning, Springer Verlag.

    Google Scholar 

  6. Lewis, D. (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of the Tenth European Conference on Machine Learning.

    Google Scholar 

  7. Meknavin, S., Charoenpornsawat, P. and Kijsirikul, B. (1997) Feature-based Thai word segmentation. Proceeding of Natural Language Processing Pacific Rim Symposium’ 97.

    Google Scholar 

  8. Mitchell, T. (1997) Machine Learning. pp. 180–184, McGraw-Hill. New York.

    MATH  Google Scholar 

  9. Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3): 103–134.

    Article  MATH  Google Scholar 

  10. Yang, Y. (1999) An evaluation of statistical approaches to text categorization, Information Retrieval Journal.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kijsirikul, B., Sasiphongpairoege, P., Soonthornphisaj, N., Meknavin, S. (2000). Supervised and Unsupervised Learning Algorithms for Thai Web Pages Identification. In: Mizoguchi, R., Slaney, J. (eds) PRICAI 2000 Topics in Artificial Intelligence. PRICAI 2000. Lecture Notes in Computer Science(), vol 1886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44533-1_69

Download citation

  • DOI: https://doi.org/10.1007/3-540-44533-1_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67925-7

  • Online ISBN: 978-3-540-44533-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics