Skip to main content

Related Text Discovery Through Consecutive Filtering and Supervised Learning

  • Conference paper
  • First Online:
Intelligence Science II (ICIS 2018)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 539))

Included in the following conference series:

  • 1108 Accesses

Abstract

In a related or topic-based text discovery task, there are often a small number of related or positive texts in contrast to a large number of unrelated or negative texts. So, the related and unrelated classes of the texts can be strongly imbalanced so that the classification or detection is very difficult because the recall of positive class is very low. In order to overcome this difficulty, we propose a consecutive filtering and supervised learning method, i.e., consecutive supervised bagging. That is, in each consecutive learning stage, we firstly delete some negative texts with the higher degree of confidence via the classifier trained in the previous stage. We then train the classifier on the retained texts. We repeat this procedure until the ratio of the negative and positive texts becomes reasonable and finally obtain a tree-like filtering and recognition system. It is demonstrated by the experimental results on 20NewsGroups data (English data) and THUCNews (Chinese data) that our proposed method is much better than AdaBoost and Rocchio.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Soboroff, I., Nicholas, C.: Combining content and collaboration in text filtering. In: IJCAI 1999 Workshop: Machine Learning for Information Filtering, pp. 86–91 (1999)

    Google Scholar 

  2. Liu, Y., Jiang, C., Zhao, H.: Using contextual features and multi-view ensemble learning in product defect identification from online discussion forums. Decis. Support. Syst. 105, 1–12 (2018)

    Article  Google Scholar 

  3. Kang, M., Ahn, J., Lee, K.: Opinion mining using ensemble text Hidden Markov Models for text classification. Expert. Syst. Appl. 94, 218–227 (2018)

    Article  Google Scholar 

  4. Lu, Z., Liu, W., Zhou, Y., et al.: An effective approach for Chinese news headline classification based on multi-representation mixed model with attention and ensemble learning. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 339–350 (2017)

    Chapter  Google Scholar 

  5. Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–223 (1998)

    Google Scholar 

  6. Rocchio, J.J.: The SMART Retrieval System: Experiments in Automatic Document Processing. Relevance Feedback in Information Retrieval, pp. 313–323 (1971)

    Google Scholar 

  7. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990)

    Google Scholar 

  8. Galar, M., Fernandez, A., Barrenechea, E., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst., Man, Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)

    Article  Google Scholar 

  9. Freund, Y., Shapire, R.E.: Experiments with a new boosting algorithm. In: 13th ICML, pp. 148–156 (1996)

    Google Scholar 

  10. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)

    MATH  Google Scholar 

  11. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  12. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newslett. 6(1), 80–89 (2004)

    Article  Google Scholar 

  13. Rong, T., Gong, H., Ng, W.W.Y.: Stochastic sensitivity oversampling technique for imbalanced data. In: Wang, X., Pedrycz, W., Chan, P., He, Q. (eds.) ICMLC 2014. CCIS, vol. 481, pp. 161–171. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45652-1_18

    Chapter  Google Scholar 

Download references

Acknowledgment

This work is supported by the Natural Science Foundation of China for Grant 61171138. We also acknowledge Zhengzhou Shuneng Science and Technology Limited Company for the contribution of the data set THUCNews.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinwen Ma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, D., Ma, J. (2018). Related Text Discovery Through Consecutive Filtering and Supervised Learning. In: Shi, Z., Pennartz, C., Huang, T. (eds) Intelligence Science II. ICIS 2018. IFIP Advances in Information and Communication Technology, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-030-01313-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01313-4_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01312-7

  • Online ISBN: 978-3-030-01313-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics