Abstract
Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in [16] assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the one-to-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the one-to-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of COLT 1998, pp. 92–100 (1998)
Boyapati, V.: Improving hierarchical text classification using unlabeled data. In: Proceedings of SIGIR (2002)
Bollmann, P., Cherniavsky, V.: Measurement-theoretical investigation of the mz-metric. Information Retrieval Research, 256–267 (1981)
Cohen, W.: Automatically extracting features for concept learning from the Web. In: Proceedings of the ICML (2000)
Craven, M., DiPasquo, D., Freitag, D., MaCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of AAAI 1998, pp. 509–516 (1998)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)
Ghani, R.: Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In: Proceedings of the ICML (2002)
Goldman, S., Zhou, Y.: Enhanced supervised learning with unlabeled data. In: Proceedings of the ICML (2000)
Jaakkola, T., Meila, M., Jebara, T.: Maximum entropy discrimination. Advances in Neural Information Pcocessing Systems 12, 470–476 (2000)
Joachims, T.: Text categorization with Support Vector Machines: learning with many relevant features. In: Proceedings of ECML 1998, pp. 137–142 (1998)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of ICML 1999, pp. 200–209 (1999)
Lang, K.N.: Learning to filter netnews. In: Proceedings of ICML, pp. 331–339 (1995)
Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of SIGIR 1994, pp. 3–12 (1994)
McCallum, A., Nigam, K.: A comparison of event models for naïve Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, AAAI Press, Menlo Park (1998)
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Ninth International Conference on Information and Knowledge Management, pp. 86–93 (2000)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Raskutti, B., Ferra, H., Kowalczyk, A.: Combining Clustering and Co-training to Enhance Text Classification Using Unlabelled Data. In: Proceedings of the KDD (2002)
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)
Zelikovitz, S., Hirsh, H.: using LSI for text classification in the presence of background text. In: Proceedings of the CIKM (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cong, G., Lee, W.S., Wu, H., Liu, B. (2004). Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, KY., Lee, D. (eds) Database Systems for Advanced Applications. DASFAA 2004. Lecture Notes in Computer Science, vol 2973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24571-1_45
Download citation
DOI: https://doi.org/10.1007/978-3-540-24571-1_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21047-4
Online ISBN: 978-3-540-24571-1
eBook Packages: Springer Book Archive