Abstract
The insufficiency and irrelevancy of training corpora is always the main task to overcome while doing text classification. This paper proposes a Web-based text classification approach to train a text classifier without the pre-request of labeled training data. Under the assumption that each class of concern is associated with several relevant concept classes, the approach first applies a greedy EM algorithm to find a proper number of concept clusters for each class, via clustering the documents retrieved by sending the class name itself to Web search engines. It then retrieves more training data through the keywords generated from the clusters and set the initial parameters of the text classifier. It further refines the initial classifier by an augmented EM algorithm. Experimental results have shown the great potential of the proposed approach in creating text classifiers without the pre-request of labeled training data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the Fifteenth National Conference on Artificaial Intelligence (AAAI 1998), pp. 509–516 (1998)
Huang, C.C., Chuang, S.L., Chien, L.F.: LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora. WWW (2004)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Shavlik, J., Eliassi-Rad, T.: Intelligent agents for Web-based tasks: An advice-taking approach. In: AAAI-1998 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05. AAAI Press, Menlo Park (1998)
Verbeek, J.J., Vlassis, N., Krose, B.J.A.: Efficient Greedy Learning of Gaussian Mixture Models. Neural Computation 15(2), 469–485 (2003)
Vlassis, N., Likas, A.: A greedy algorithm for Gaussian Mixture Learning. Neural Processing Letters (15), 77–87 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hung, CM., Chien, LF. (2005). Text Classification Using Web Corpora and EM Algorithms. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-31871-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)