Skip to main content
Log in

Text length considered adaptive bagging ensemble learning algorithm for text classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Ensemble learning constructs strong classifiers by training multiple weak classifiers, and is widely used in text classification field. In order to improve the text classification accuracy, a text length considered adaptive bootstrap aggregating (Bagging) ensemble learning algorithm (called TC_Bagging) for text classification is proposed. Firstly, the performances of different typical deep learning methods in processing long and short texts are compared, and the optimal base classifier groups are constructed for long and short texts. Secondly, an adaptive threshold group based random sampling method is proposed to realize the training of long text and short text sample subsets while retaining the proportions of samples in different categories. Finally, in order to avoid the problem that the sampling process may decrease the accuracy, the smooth inverse frequency (SIF) based text vector generation algorithm is combined with the traditional weighted voting classifier ensemble method to obtain the final classification result. By comparing TC_Bagging with several other baseline methods on three datasets, our evaluation suggests that the results of TC_Bagging are approximately 0.120, 0.300 and 0.060 better than that of RF, WAVE, RF_WMVE and RF_WAVE in terms of average F1, average sensitivity and average specificity measurements, respectively, showing that TC_Bagging has obvious advantage over typical ensemble learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Ali A, Zhu Y, Chen Q, et al (2019) Leveraging spatio-temporal patterns for predicting citywide traffic crowd flows using deep hybrid neural networks. In 2019 IEEE 25th international conference on parallel and distributed systems (ICPADS). IEEE. 125-132

  2. Ali A, Zhu Y, Zakarya M (2021) A data aggregation based approach to exploit dynamic spatio-temporal correlations for citywide crowd flows prediction in fog computing. Multimed Tools Appl 80(20):31401–31433

    Article  Google Scholar 

  3. Ali A, Zhu Y, Zakarya M (2022) Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw 145:233–247

    Article  Google Scholar 

  4. Arora S, Li YZ, Liang YY, Ma T, Risteski A (2016) A latent variable model approach to PMI-based word embeddings. Transac Assoc Comput Linguis 4:385–399

    Article  Google Scholar 

  5. Arora S, Liang YY, Ma TY (2017) A simple but tough-to-beat baseline for sentence embedding. In Proceedings of ICLR

  6. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    Article  MATH  Google Scholar 

  7. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482

    Chapter  Google Scholar 

  8. Charbuty B, Abdulazeez A (2021) Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2(01):20–28

    Article  Google Scholar 

  9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  10. Cui YM, Che WX, Liu T, et al (2019) Pre-training with whole word masking for Chinese BERT arXiv preprint arXiv: 1906.08101.

  11. De M, Romero FAB, Vasconcelos GC (2019) Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing 343:3–18

    Article  Google Scholar 

  12. De'ath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11):3178–3192

    Article  Google Scholar 

  13. Deng J, Cheng L, Wang Z (2021) Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput Speech Lang 68:101182

    Article  Google Scholar 

  14. Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In proceedings of NAACL-HLT. pages 4171-4186

  15. Diao S, Xu R, Su H, et al (2021) Taming pre-trained language models with N-gram representations for low-resource domain adaptation. In proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 3336-3349.

  16. Ding H, Wei B, Gu Z, Yu Z, Zheng H, Zheng B, Li J (2020) KA-ensemble: towards imbalanced image classification ensembling under-sampling and over-sampling. Multimed Tools Appl 79(21):14871–14888

    Article  Google Scholar 

  17. Dogan A, Birant D (2019) A weighted majority voting ensemble approach for classification. In 4th International Conference on Computer Science and Engineering (UBMK). IEEE, 1-6

  18. Du C, Huang L (2018) Text classification research with attention-based recurrent neural networks. Int J Comput Commun Contr 13(1):50–61

    Article  Google Scholar 

  19. Fanny F, Muliono Y, Tanzil F (2018) A comparison of text classification methods k-NN, Naïve Bayes, and support vector machine for news classification. Jurnal Informatika: Jurnal Pengembangan IT 3(2):157–160

    Google Scholar 

  20. Galar M, Fernandez A, Barrenechea E (2012) A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches. IEEE Transac Syst Man Cybern Part C Appl Revi 42(4):463–484

    Article  Google Scholar 

  21. Garcia S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  MathSciNet  Google Scholar 

  22. Giveki D (2021) Scale-space multi-view bag of words for scene categorization. Multimed Tools Appl 80(1):1223–1245

    Article  Google Scholar 

  23. Guo B, Zhang C, Liu J, Ma X (2019) Improving text classification with weighted word embeddings via a multi-channel TextCNN model. Neurocomputing 363:366–374

    Article  Google Scholar 

  24. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887

    Google Scholar 

  25. He H, Bai Y, Garcia EA, et al (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning, In: Proceedings of the (IEEE world congress on computational intelligence). IEEE International Joint Conference on Neural Networks, IJCNN, IEEE. pp 1322–1328

  26. Hsu KW, Srivastava J (2012) Improving bagging performance through multi-algorithm ensembles. Front Comput Sci 6(5):498–512

    MathSciNet  MATH  Google Scholar 

  27. Huang L, Ma D, Li S, et al (2019) Text level graph neural network for text classification. arXiv preprint arXiv:1910.02356

  28. Johnson R, Zhang T (2017) Deep pyramid convolutional neural networks for text categorization. In proceedings of ACL. pages 562-570

  29. Joulin A, Grave E, Bojanowski P, (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759

  30. Khoshgoftaar TM, Van HJ, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Pattern Anal Mach Intell 41(3):552–568

    Google Scholar 

  31. Kim Y (2014) Convolutional neural networks for sentence classification. In proceedings of EMNLP, pages 1746-1751

  32. Kim H, Kim H, Moon H, Ahn H (2011) A weight-adjusted voting algorithm for ensembles of classifiers. J Korean Statis Soc 40(4):437–449

    Article  MathSciNet  MATH  Google Scholar 

  33. Kim A, Myung J, Kim H (2020) Random forest ensemble using a weight-adjusted voting algorithm. J Korean Data Inform Sci Soc 31(2):427–438

    Google Scholar 

  34. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  35. Lacy SE, Lones MA, Smith SL (2015) A comparison of evolved linear and non-linear ensemble vote aggregators. In: IEEE congress on evolutionary computation (CEC). IEEE. 758-763

  36. Lan Z, Chen M, Goodman S, et al (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In proceedings of ICLR

  37. Li S, Zhao Z, Hu RF, et al (2018) Analogical reasoning on Chinese morphological and semantic relations. In Proceedings of ACL

  38. Li C, Peng X, Peng H, et al (2021) TextGTL: Graph-based Transductive Learning for Semi-Supervised Text Classification via Structure-Sensitive Interpolation. In proceedings of IJCAI

  39. Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L (2022) A survey on text classification: from traditional to deep learning. ACM Transac Intel Syst Technol (TIST) 13(2):1–41

    Google Scholar 

  40. Liu G, Guo J (2019) Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337(APR.14):325–338

    Article  Google Scholar 

  41. Liu XY, Wu JX, Zhou ZH (2009) Exploratory undersampling for classimbalance learning. IEEE Transac Syst, Man, Cyberne, Part B: Cybernetics 39(2):539–550

    Article  Google Scholar 

  42. Luengo J, Fernández A, Garica S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936

    Article  Google Scholar 

  43. Luo W, Zhang L (2022) Question text classification method of tourism based on deep learning model. Wirel Commun Mob Comput 2022:4330701–4330709

    Google Scholar 

  44. Marcińczuk M, Gniewkowski M, Walkowiak T, et al (2021) Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In proceedings of the 11th global Wordnet conference, 207-214

  45. Matloob F, Ghazal TM, Taleb N, Aftab S, Ahmad M, Khan MA, Abbas S, Soomro TR (2021) Software defect prediction using ensemble learning: a systematic literature review. IEEE Access 9:98754–98771

    Article  Google Scholar 

  46. Murphree DH, Arabmakki E, Ngufor C, Storlie CB, McCoy RG (2018) Stacked classifiers for individualized prediction of glycemic control following initiation of metformin therapy in type 2 diabetes. Comput Biol Med 103:109–115

    Article  Google Scholar 

  47. Pappagari R, Zelasko P, Villalba J, et al (2019) Hierarchical transformers for long document classification. In IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 838-844

  48. Peng H, Li J, He Y, et al (2018) Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In proceedings of the 2018 world wide web conference (WWW). 1063-1072

  49. Shah K, Patel H, Sanghvi D, Shah M (2020) A comparative analysis of logistic regression, random forest and KNN models for the text classification. Aug Human Res 5(1):1–16

    Google Scholar 

  50. Sun B, Chen HY, Wang JD et al (2018) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comput Sci China 012(002):331–350

    Article  Google Scholar 

  51. Tang DY, Qin B, Feng XC, et al (2016) Effective LSTMs for target-dependent sentiment classification. Proceedings of COLING

  52. Vaswani A, Shazeer N, Parmar N (2017) Attention is all you need. In proceedings of NIPS

  53. Xie J, Hou Y, Wang Y, Wang Q, Li B, Lv S, Vorotnitsky YI (2020) Chinese text classification based on attention mechanism and feature-enhanced fusion neural network. Computing 102(6):683–700

    Article  MathSciNet  MATH  Google Scholar 

  54. Xu J, Cai Y, Wu X, Lei X, Huang Q, Leung HF, Li Q (2020) Incorporating context-relevant concepts into convolutional neural networks for short text classification. Neurocomputing 386:42–53

    Article  Google Scholar 

  55. Yan P, Li H, Wang Z (2021) WNTC: an efficient weight news text classification model. 2021 Asia-Pacific conference on communications technology and computer science (ACCTCS). pp. 271-276

  56. Yang ZC, Yang DY, Dyer C, et al (2016) Hierarchical attention networks for document classification. In proceedings of NAACL, pages 1480-1489

  57. Yang M, Tu W, Wang J, et al (2017) Attention-based LSTM for target-dependent sentiment classification, in: Proceedings of the 31st AAAI conference on artificial intelligence, AAAI press, San Francisco, CA, United states, p. 5013–5014

  58. Yao L, Mao C S, Luo Y (2017) Graph convolutional networks for text classification. In proceedings of AAAI

  59. Ye Z, Geng Y, Chen J, et al (2020) Zero-shot text classification via reinforced self-training. In proceedings of the 58th annual meeting of the Association for Computational Linguistics. 3014-3024

  60. Zhang H, Zhang J (2020) Text graph transformer for document classification. In proceedings of EMNLP

  61. Zhang YF, Yu XL, Cui ZY, et al (2020) Every document owns its structure: inductive text classification via graph neural networks. In proceedings of ACL

  62. Zhou ZH (2021) Ensemble learning, machine learning. Springer, Singapore, pp 181–210

    Book  Google Scholar 

  63. Zhou Y, Mazzuchi TA, Sarkani S (2020) M-AdaBoost-A based ensemble system for network intrusion detection [J]. Expert Syst Appl 162(6):113864

    Article  Google Scholar 

  64. Zulqarnain M, Ghazali R, Hassim YMM, Aamir M (2021) An enhanced gated recurrent unit with auto-encoder for solving text classification problems. Arab J Sci Eng 46(9):8953–8967

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by the National Natural Science Foundation of China (No. 61906220), the Ministry of education of Humanities and Social Science project (No. 19YJCZH178), National Social Science Foundation of China (No.18CTJ008), the Natural Science Foundation of Tianjin Province (No. 18JCQNJC69600), the National Key R&D Program of China (2017YFB1400700) and the Emerging Interdisciplinary Project of CUFE.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lizhou Feng.

Ethics declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Liu, J. & Feng, L. Text length considered adaptive bagging ensemble learning algorithm for text classification. Multimed Tools Appl 82, 27681–27706 (2023). https://doi.org/10.1007/s11042-023-14578-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14578-9

Keywords

Navigation