Semi-supervised Text Categorization by Considering Sufficiency and Diversity

Li, Shoushan; Lee, Sophia Yat Mei; Gao, Wei; Huang, Chu-Ren

doi:10.1007/978-3-642-41644-6_11

Shoushan Li^4,5,
Sophia Yat Mei Lee⁵,
Wei Gao⁴ &
…
Chu-Ren Huang⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 400))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1896 Accesses

Abstract

In text categorization (TC), labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for TC to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a novel bootstrapping approach to semi-supervised TC. First of all, we give two basic preferences, i.e., sufficiency and diversity for a possibly successful bootstrapping. After carefully considering the diversity preference, we modify the traditional bootstrapping algorithm by training the involved classifiers with random feature subspaces instead of the whole feature space. Moreover, we further improve the random feature subspace-based bootstrapping with some constraints on the subspace generation to better satisfy the diversity preference. Experimental evaluation shows the effectiveness of our modified bootstrapping approach in both topic and sentiment-based TC tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Co-training Based on Multi-type Text Features

Semi-supervised learning in large scale text categorization

Article 30 May 2017

BoWT: A Hybrid Text Representation Model for Improving Text Categorization Based on AdaBoost.MH

References

Abney, S.: Bootstrapping. In: Proceedings of ACL 2002, pp. 360–367 (2002)
Google Scholar
Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In: Proceedings of ACL 2007, pp. 440–447 (2007)
Google Scholar
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of COLT 1998, pp. 92–100 (1998)
Google Scholar
Braga, I., Monard, M., Matsubara, E.: Combining Unigrams and Bigrams in Semi-supervised Text Classification. In: Proceedings of EPIA 2009: The 14th Portuguese Conference on Artificial Intelligence, pp. 489–500 (2009)
Google Scholar
Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised text classification using partitioned EM. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 482–493. Springer, Heidelberg (2004)
Chapter Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to Extract Symbolic Knowledge from the World Wide Web. In: Proceedings of AAAI 1998, pp. 509–516 (1998)
Google Scholar
Dasgupta, S., Ng, V.: Mine the Easy, Classify the Hard: A Semi-Supervised Approach to Automatic Sentiment Classification. In: Proceedings of ACL-IJCNLP 2009, pp. 701–709 (2009)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Joachims, T.: Transductive Inference for Text Classification Using Support Vector Machines. In: Proceedings of ICML 1999, pp. 200–209 (1999)
Google Scholar
Kullback, S., Leibler, R.: On Information and Sufficiency. Annals of Mathematical Statistics 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Li, S., Huang, C., Zhou, G., Lee, S.: Employing Personal/Impersonal Views in Supervised and Semi-supervised Sentiment Classification. In: Proceedings of ACL 2010, pp. 414–423 (2010)
Google Scholar
Mallapragada, P., Jin, R., Jain, A., Liu, Y.: SemiBoost: Boosting for Semi-Supervised Learning. IEEE Transaction on Pattern Analysis and Machine Intelligence 31(11), 2000–2014 (2009)
Article Google Scholar
McCallum, A., Nigam, K.: Employing EM and Pool-Based Active Learning for Text Classification. In: Proceedings of ICML 1998, pp. 350–358 (1998)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article Google Scholar
Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-training. In: Proceedings of CIKM 2000, pp. 86–93 (2000)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Proceedings of EMNLP 2002, pp. 79–86 (2002)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Xia, R., Zong, C., Li, S.: Ensemble of Feature Sets and Classification Algorithms for Sentiment Classification. Information Sciences 181, 1138–1152 (2011)
Article Google Scholar
Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, ICML 1997, pp. 412–420 (1997)
Google Scholar
Yarowsky, D.: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proceedings of ACL 2005, pp. 189–196 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, China
Shoushan Li & Wei Gao
CBS, The Hong Kong Polytechnic University, Hong Kong
Shoushan Li, Sophia Yat Mei Lee & Chu-Ren Huang

Authors

Shoushan Li
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Yat Mei Lee
View author publications
You can also search for this author in PubMed Google Scholar
Wei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chu-Ren Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Soochow University, 1 Shizi Street, 215006, Suzhou, China
Guodong Zhou
Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Juanzi Li
Institute of Computer Science & Technology, Peking University, 100871, Beijing, China
Dongyan Zhao & Yansong Feng &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S., Lee, S.Y.M., Gao, W., Huang, CR. (2013). Semi-supervised Text Categorization by Considering Sufficiency and Diversity. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-41644-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41643-9
Online ISBN: 978-3-642-41644-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics