Skip to main content

Semi-supervised Text Categorization by Considering Sufficiency and Diversity

  • Conference paper
Natural Language Processing and Chinese Computing (NLPCC 2013)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 400))

  • 1896 Accesses

Abstract

In text categorization (TC), labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for TC to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a novel bootstrapping approach to semi-supervised TC. First of all, we give two basic preferences, i.e., sufficiency and diversity for a possibly successful bootstrapping. After carefully considering the diversity preference, we modify the traditional bootstrapping algorithm by training the involved classifiers with random feature subspaces instead of the whole feature space. Moreover, we further improve the random feature subspace-based bootstrapping with some constraints on the subspace generation to better satisfy the diversity preference. Experimental evaluation shows the effectiveness of our modified bootstrapping approach in both topic and sentiment-based TC tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Abney, S.: Bootstrapping. In: Proceedings of ACL 2002, pp. 360–367 (2002)

    Google Scholar 

  2. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In: Proceedings of ACL 2007, pp. 440–447 (2007)

    Google Scholar 

  3. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of COLT 1998, pp. 92–100 (1998)

    Google Scholar 

  4. Braga, I., Monard, M., Matsubara, E.: Combining Unigrams and Bigrams in Semi-supervised Text Classification. In: Proceedings of EPIA 2009: The 14th Portuguese Conference on Artificial Intelligence, pp. 489–500 (2009)

    Google Scholar 

  5. Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised text classification using partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to Extract Symbolic Knowledge from the World Wide Web. In: Proceedings of AAAI 1998, pp. 509–516 (1998)

    Google Scholar 

  7. Dasgupta, S., Ng, V.: Mine the Easy, Classify the Hard: A Semi-Supervised Approach to Automatic Sentiment Classification. In: Proceedings of ACL-IJCNLP 2009, pp. 701–709 (2009)

    Google Scholar 

  8. Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  9. Joachims, T.: Transductive Inference for Text Classification Using Support Vector Machines. In: Proceedings of ICML 1999, pp. 200–209 (1999)

    Google Scholar 

  10. Kullback, S., Leibler, R.: On Information and Sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  11. Li, S., Huang, C., Zhou, G., Lee, S.: Employing Personal/Impersonal Views in Supervised and Semi-supervised Sentiment Classification. In: Proceedings of ACL 2010, pp. 414–423 (2010)

    Google Scholar 

  12. Mallapragada, P., Jin, R., Jain, A., Liu, Y.: SemiBoost: Boosting for Semi-Supervised Learning. IEEE Transaction on Pattern Analysis and Machine Intelligence 31(11), 2000–2014 (2009)

    Article  Google Scholar 

  13. McCallum, A., Nigam, K.: Employing EM and Pool-Based Active Learning for Text Classification. In: Proceedings of ICML 1998, pp. 350–358 (1998)

    Google Scholar 

  14. Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)

    Article  Google Scholar 

  15. Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-training. In: Proceedings of CIKM 2000, pp. 86–93 (2000)

    Google Scholar 

  16. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Proceedings of EMNLP 2002, pp. 79–86 (2002)

    Google Scholar 

  17. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  18. Xia, R., Zong, C., Li, S.: Ensemble of Feature Sets and Classification Algorithms for Sentiment Classification. Information Sciences 181, 1138–1152 (2011)

    Article  Google Scholar 

  19. Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, ICML 1997, pp. 412–420 (1997)

    Google Scholar 

  20. Yarowsky, D.: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proceedings of ACL 2005, pp. 189–196 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, S., Lee, S.Y.M., Gao, W., Huang, CR. (2013). Semi-supervised Text Categorization by Considering Sufficiency and Diversity. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41644-6_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41643-9

  • Online ISBN: 978-3-642-41644-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics