Text Classification Using Web Corpora and EM Algorithms

Hung, Chen-Ming; Chien, Lee-Feng

doi:10.1007/978-3-540-31871-2_2

Text Classification Using Web Corpora and EM Algorithms

Chen-Ming Hung²⁰ &
Lee-Feng Chien^20,21

Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Abstract

The insufficiency and irrelevancy of training corpora is always the main task to overcome while doing text classification. This paper proposes a Web-based text classification approach to train a text classifier without the pre-request of labeled training data. Under the assumption that each class of concern is associated with several relevant concept classes, the approach first applies a greedy EM algorithm to find a proper number of concept clusters for each class, via clustering the documents retrieved by sending the class name itself to Web search engines. It then retrieves more training data through the keywords generated from the clusters and set the initial parameters of the text classifier. It further refines the initial classifier by an augmented EM algorithm. Experimental results have shown the great potential of the proposed approach in creating text classifiers without the pre-request of labeled training data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the Fifteenth National Conference on Artificaial Intelligence (AAAI 1998), pp. 509–516 (1998)
Google Scholar
Huang, C.C., Chuang, S.L., Chien, L.F.: LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora. WWW (2004)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Shavlik, J., Eliassi-Rad, T.: Intelligent agents for Web-based tasks: An advice-taking approach. In: AAAI-1998 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05. AAAI Press, Menlo Park (1998)
Google Scholar
Verbeek, J.J., Vlassis, N., Krose, B.J.A.: Efficient Greedy Learning of Gaussian Mixture Models. Neural Computation 15(2), 469–485 (2003)
Article MATH Google Scholar
Vlassis, N., Likas, A.: A greedy algorithm for Gaussian Mixture Learning. Neural Processing Letters (15), 77–87 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chen-Ming Hung & Lee-Feng Chien
Department of Information Management, National Taiwan University, Taipei, Taiwan
Lee-Feng Chien

Authors

Chen-Ming Hung
View author publications
You can also search for this author in PubMed Google Scholar
Lee-Feng Chien
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng
The Key Laboratory of Power System Protection and Dynamic Security Monitoring and Control under Ministry of Education, North China Electric Power University, Zhuxinzhuang Dewai, 102206, Beijing, China
Ming Zhou
Department of Systems Engineering and Engineering Management, Shatin, The Chinese University of Hong Kong, Hong Kong, N.T.
Kam-Fai Wong
5F, Beijing Sigma Center, Microsoft Research Asia, No. 49 Zhichun Road Haidian District, 100080, Beijing, China
Hong-Jiang Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hung, CM., Chien, LF. (2005). Text Classification Using Web Corpora and EM Algorithms. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-31871-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics