Related Text Discovery Through Consecutive Filtering and Supervised Learning

Wu, Daqing; Ma, Jinwen

doi:10.1007/978-3-030-01313-4_22

Daqing Wu¹⁸ &
Jinwen Ma¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 539))

Included in the following conference series:

International Conference on Intelligence Science

1108 Accesses

Abstract

In a related or topic-based text discovery task, there are often a small number of related or positive texts in contrast to a large number of unrelated or negative texts. So, the related and unrelated classes of the texts can be strongly imbalanced so that the classification or detection is very difficult because the recall of positive class is very low. In order to overcome this difficulty, we propose a consecutive filtering and supervised learning method, i.e., consecutive supervised bagging. That is, in each consecutive learning stage, we firstly delete some negative texts with the higher degree of confidence via the classifier trained in the previous stage. We then train the classifier on the retained texts. We repeat this procedure until the ratio of the negative and positive texts becomes reasonable and finally obtain a tree-like filtering and recognition system. It is demonstrated by the experimental results on 20NewsGroups data (English data) and THUCNews (Chinese data) that our proposed method is much better than AdaBoost and Rocchio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MCVIE: An Effective Batch-Mode Active Learning for Multi-label Text Classification

An Improved PLDA Model for Short Text

Research on Multi-label Text Classification Method Based on tALBERT-CNN

Article Open access 13 December 2021

References

Soboroff, I., Nicholas, C.: Combining content and collaboration in text filtering. In: IJCAI 1999 Workshop: Machine Learning for Information Filtering, pp. 86–91 (1999)
Google Scholar
Liu, Y., Jiang, C., Zhao, H.: Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums. Decis. Support. Syst. 105, 1–12 (2018)
Article Google Scholar
Kang, M., Ahn, J., Lee, K.: Opinion mining using ensemble text Hidden Markov Models for text classification. Expert. Syst. Appl. 94, 218–227 (2018)
Article Google Scholar
Lu, Z., Liu, W., Zhou, Y., et al.: An effective approach for Chinese news headline classification based on multi-representation mixed model with attention and ensemble learning. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 339–350 (2017)
Chapter Google Scholar
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–223 (1998)
Google Scholar
Rocchio, J.J.: The SMART Retrieval System: Experiments in Automatic Document Processing. Relevance Feedback in Information Retrieval, pp. 313–323 (1971)
Google Scholar
Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990)
Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst., Man, Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)
Article Google Scholar
Freund, Y., Shapire, R.E.: Experiments with a new boosting algorithm. In: 13th ICML, pp. 148–156 (1996)
Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
MATH Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newslett. 6(1), 80–89 (2004)
Article Google Scholar
Rong, T., Gong, H., Ng, W.W.Y.: Stochastic sensitivity oversampling technique for imbalanced data. In: Wang, X., Pedrycz, W., Chan, P., He, Q. (eds.) ICMLC 2014. CCIS, vol. 481, pp. 161–171. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45652-1_18
Chapter Google Scholar

Download references

Acknowledgment

This work is supported by the Natural Science Foundation of China for Grant 61171138. We also acknowledge Zhengzhou Shuneng Science and Technology Limited Company for the contribution of the data set THUCNews.

Author information

Authors and Affiliations

Department of Information Science, School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, People’s Republic of China
Daqing Wu & Jinwen Ma

Authors

Daqing Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jinwen Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinwen Ma .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
University of Amsterdam, Amsterdam, The Netherlands
Cyriel Pennartz
Peking University, Beijing, China
Tiejun Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, D., Ma, J. (2018). Related Text Discovery Through Consecutive Filtering and Supervised Learning. In: Shi, Z., Pennartz, C., Huang, T. (eds) Intelligence Science II. ICIS 2018. IFIP Advances in Information and Communication Technology, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-030-01313-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-01313-4_22
Published: 02 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01312-7
Online ISBN: 978-3-030-01313-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)