Skip to main content

Semi-supervised Text Classification Using Partitioned EM

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2973))

Abstract

Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in [16] assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the one-to-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the one-to-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of COLT 1998, pp. 92–100 (1998)

    Google Scholar 

  2. Boyapati, V.: Improving hierarchical text classification using unlabeled data. In: Proceedings of SIGIR (2002)

    Google Scholar 

  3. Bollmann, P., Cherniavsky, V.: Measurement-theoretical investigation of the mz-metric. Information Retrieval Research, 256–267 (1981)

    Google Scholar 

  4. Cohen, W.: Automatically extracting features for concept learning from the Web. In: Proceedings of the ICML (2000)

    Google Scholar 

  5. Craven, M., DiPasquo, D., Freitag, D., MaCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of AAAI 1998, pp. 509–516 (1998)

    Google Scholar 

  6. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  7. Ghani, R.: Combining Labeled and Unlabeled Data for MultiClass Text Categorization. In: Proceedings of the ICML (2002)

    Google Scholar 

  8. Goldman, S., Zhou, Y.: Enhanced supervised learning with unlabeled data. In: Proceedings of the ICML (2000)

    Google Scholar 

  9. Jaakkola, T., Meila, M., Jebara, T.: Maximum entropy discrimination. Advances in Neural Information Pcocessing Systems 12, 470–476 (2000)

    Google Scholar 

  10. Joachims, T.: Text categorization with Support Vector Machines: learning with many relevant features. In: Proceedings of ECML 1998, pp. 137–142 (1998)

    Google Scholar 

  11. Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of ICML 1999, pp. 200–209 (1999)

    Google Scholar 

  12. Lang, K.N.: Learning to filter netnews. In: Proceedings of ICML, pp. 331–339 (1995)

    Google Scholar 

  13. Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceedings of SIGIR 1994, pp. 3–12 (1994)

    Google Scholar 

  14. McCallum, A., Nigam, K.: A comparison of event models for naïve Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, AAAI Press, Menlo Park (1998)

    Google Scholar 

  15. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Ninth International Conference on Information and Knowledge Management, pp. 86–93 (2000)

    Google Scholar 

  16. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)

    Article  MATH  Google Scholar 

  17. Raskutti, B., Ferra, H., Kowalczyk, A.: Combining Clustering and Co-training to Enhance Text Classification Using Unlabelled Data. In: Proceedings of the KDD (2002)

    Google Scholar 

  18. Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1, 67–88 (1999)

    Google Scholar 

  19. Zelikovitz, S., Hirsh, H.: using LSI for text classification in the presence of background text. In: Proceedings of the CIKM (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cong, G., Lee, W.S., Wu, H., Liu, B. (2004). Semi-supervised Text Classification Using Partitioned EM. In: Lee, Y., Li, J., Whang, KY., Lee, D. (eds) Database Systems for Advanced Applications. DASFAA 2004. Lecture Notes in Computer Science, vol 2973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24571-1_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24571-1_45

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21047-4

  • Online ISBN: 978-3-540-24571-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics