TOMBoost: a topic modeling based boosting approach for learning with class imbalance

Santhiappan, Sudarsun; Chelladurai, Jeshuren; Ravindran, Balaraman

doi:10.1007/s41060-022-00363-8

TOMBoost: a topic modeling based boosting approach for learning with class imbalance

Regular Paper
Published: 12 October 2022

Volume 17, pages 389–409, (2024)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Sudarsun Santhiappan ORCID: orcid.org/0000-0001-5769-2405¹,
Jeshuren Chelladurai^1,2 &
Balaraman Ravindran^1,2

334 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Classification of data with imbalanced characteristics is an essential research problem as the data from most real-world applications follow non-uniform class proportions. Solutions to handle class imbalance depend on how important one data point is versus the other. Directed data sampling and data-level cost-sensitive methods use the data point importance information to sample from the dataset such that the essential data points are retained and possibly oversampled. In this paper, we propose a novel topic modeling-based weighting framework to assign importance to the data points in an imbalanced dataset based on the topic posterior probabilities estimated using the latent Dirichlet allocation and probabilistic latent semantic analysis models. We also propose TOMBoost, a topic modeled boosting scheme based on the weighting framework, particularly tuned for learning with class imbalance. In an empirical study spanning 40 datasets, we show that TOMBoost wins or ties with 37 datasets on an average against other boosting and sampling methods. We also empirically show that TOMBoost minimizes the model bias faster than the other popular boosting methods for class imbalance learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Article 10 May 2023

Subset Labeled LDA: A Topic Model for Extreme Multi-label Classification

DOLDA: a regularized supervised topic model for high-dimensional multi-class regression

Article Open access 12 June 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998)
Article Google Scholar
Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2), 1–50 (2016)
Article Google Scholar
Haixiang, G., et al.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017). https://doi.org/10.1016/j.eswa.2016.12.035
Article Google Scholar
Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1), 42 (2018). https://doi.org/10.1186/s40537-018-0151-6
Article Google Scholar
Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data (2019). https://doi.org/10.1186/s40537-019-0192-5
Article Google Scholar
Mease, D., Wyner, A., Buja, A.: Boosted classification trees and class probability/quantile estimation. J. Mach. Learn. Res. 8, 409–439 (2007)
Google Scholar
Lopez, V., Fernandez, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). https://doi.org/10.1016/j.ins.2013.07.007
Article Google Scholar
Lopez, V., Fernandez, A., Moreno-Torres, J.G., Herrera, F.: Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 39(7), 6585–6608 (2012). https://doi.org/10.1016/j.eswa.2011.12.043
Article Google Scholar
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications, 1st edn. Wiley-IEEE Press (2013)
Guo, H., et al.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Article Google Scholar
Agrawal, A., Viktor, H.L., Paquet, E., Fred, A.L.N., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J.: SCUT: multi-class imbalanced data classification using SMOTE and cluster-based undersampling. In: Fred, A.L.N., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) KDIR, pp. 226–234. SciTePress (2015)
Hofmann, T.: Probabilistic Latent Semantic Analysis, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)
Kim, Y.-M., Pessiot, J.-F., Amini, M.-R., Gallinari, P., Shanahan, J.G. et al.: An extension of PLSA for document clustering. In: Shanahan, J.G. et al. (eds.) CIKM, pp. 1345–1346. ACM (2008). http://dblp.uni-trier.de/db/conf/cikm/cikm2008.html#KimPAG08
Wang, L., Li, X., Tu, Z., Jia, J.: Discriminative clustering via generative feature mapping, pp. 1–7 (2012). https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5034
Santhosh, K.K., Dogra, D.P., Roy, P.P.: Temporal unknown incremental clustering model for analysis of traffic surveillance videos. IEEE Trans. Intell. Transp. Syst. 20(5), 1762–1773 (2019). https://doi.org/10.1109/TITS.2018.2834958
Article Google Scholar
Griffiths, A.J., Gelbart, W.M., Lewontin, R.C., Miller, J.H.: Modern Genetic Analysis: Integrating Genes and Genomes, vol. 1. Macmillan (2002)
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000)
Article Google Scholar
Santhosh, K.K., Dogra, D.P., Roy, P.P., Chaudhuri, B.B.: Trajectory-based scene understanding using Dirichlet process mixture model. IEEE Trans. Cybern. 51(8), 4148–4161 (2021). https://doi.org/10.1109/TCYB.2019.2931139
Article Google Scholar
Kennedy, T.F., et al.: Topic Models for RFID Data Modeling and Localization, pp. 1438–1446 (2017)
Chen, X., Huang, K., Jiang, H.: Detecting changes in the spatiotemporal pattern of bike sharing: a change-point topic model. IEEE Trans. Intell. Transp. Syst. (2022). https://doi.org/10.1109/TITS.2022.3161623
Article Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997). https://doi.org/10.1006/jcss.1997.1504
Article MathSciNet Google Scholar
Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution, pp. 592–602. IEEE Computer Society (2006). http://dblp.uni-trier.de/db/conf/icdm/icdm2006.html#SunKW06
Schapire, R.E.: Boosting: Foundations and Algorithms (2013)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, pp. 1–8 (2003). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.6858& rep=rep1& type=pdf
Holte, R.C., Acker, L., Porter, B.W., Sridharan, N.S. Concept learning and the problem of small disjuncts. In: Sridharan, N.S. (ed.) IJCAI, pp. 813–818. Morgan Kaufmann (1989). http://dblp.uni-trier.de/db/conf/ijcai/ijcai89.html#HolteAP89
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Han, H., Wang, W. & Mao, B. Huang, D.-S., Zhang, X.-P., Huang, G.-B. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC (1), Lecture Notes in Computer Science, vol. 3644, pp. 878–887. Springer (2005). http://dblp.uni-trier.de/db/conf/icic/icic2005-1.html#HanWM05
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2014)
Article Google Scholar
Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manag. 47(2), 202–214 (2011). https://doi.org/10.1016/j.ipm.2010.07.003
Article Google Scholar
Barredo Arrieta, A., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fus. (2020). https://doi.org/10.1016/j.inffus.2019.12.012, arXiv:1910.10045
Bellinger, C., Drummond, C., Japkowicz, N.: Manifold-based synthetic oversampling with manifold conformance estimation. Mach. Learn. 107(3), 605–637 (2018). https://doi.org/10.1007/s10994-017-5670-4
Article MathSciNet Google Scholar
Santhiappan, S., Chelladurai, J., Ravindran, B.: A novel topic modeling based weighting framework for class imbalance learning. In: CoDS-COMAD’18, pp. 20–29. ACM, New York (2018). https://doi.org/10.1145/3152494.3152496
Peng, Y. Bonet, B., Koenig, S.: Adaptive sampling with optimal cost for class-imbalance learning. In: Bonet, B. & Koenig, S. (eds.) AAAI, pp. 2921–2927. AAAI Press (2015). http://dblp.uni-trier.de/db/conf/aaai/aaai2015.html#Peng15
Nekooeimehr, I., Lai-Yuen, S.K.: Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst. Appl. 46, 405–416 (2016)
Article Google Scholar
Mustafa, G., Niu, Z., Yousif, A., Tarus, J.: Distribution based ensemble for class imbalance learning, pp. 5–10 (2015)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012). https://doi.org/10.1109/TSMCC.2011.2161285
Article Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W., Lavrac, N., Gamberger, D., Blockeel, H., Todorovski, L.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrac, N., Gamberger, D., Blockeel, H., Todorovski, L. (eds.) PKDD, Lecture Notes in Computer Science, vol. 2838, pp. 107–119. Springer (2003). http://dblp.uni-trier.de/db/conf/pkdd/pkdd2003.html#ChawlaLHB03
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor. 6(1), 30–39 (2004). https://doi.org/10.1145/1007730.1007736
Article Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A 40(1), 185–197 (2010)
Article Google Scholar
Rayhan, F. et al. Cusboost: cluster-based under-sampling with boosting for imbalanced classification. CoRR (2017). arXiv:1712.04356
Lin, W.-C., Tsai, C.-F., Hu, Y.-H., Jhang, J.-S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017). https://doi.org/10.1016/j.ins.2017.05.008
Article Google Scholar
Lingchi, C., Xiaoheng, D., Hailan, S., Congxu, Z., Le, C.: Dycusboost: Adaboost-based imbalanced learning using dynamic clustering and undersampling, pp. 208–215 (2018)
Ge, J.-F., Luo, Y.-P.: A comprehensive study for asymmetric AdaBoost and its application in object detection. Acta Automatica Sinica 35(11), 1403–1409 (2009)
Google Scholar
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K., Bratko, I., Dzeroski, S.: AdaCost: misclassification cost-sensitive boosting. In: Bratko, I., Dzeroski, S. (eds.) ICML, pp. 97–105. Morgan Kaufmann (1999). http://dblp.uni-trier.de/db/conf/icml/icml1999.html#FanSZC99
Domingos, P.M. Fayyad, U.M., Chaudhuri, S., Madigan, D.: MetaCost: a general method for making classifiers cost-sensitive. In: Fayyad, U.M., Chaudhuri, S., Madigan, D. (eds.) KDD, pp. 155–164. ACM (1999). http://dblp.uni-trier.de/db/conf/kdd/kdd99.html#Domingos99
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting, p. 435. IEEE Computer Society (2003). http://dblp.uni-trier.de/db/conf/icdm/icdm2003.html#ZadroznyLA03
Yang, Y., Xiao, P., Cheng, Y., Liu, W., Huang, Z.: Ensemble strategy for hard classifying samples in class-imbalanced data set, pp. 170–175 (2018)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. CoRR (2016). arXiv:1603.02754
Gong, J., Kim, H.: Rhsboost: improving classification performance in imbalance data. Comput. Stat. Data Anal. 111, 1–13 (2017). https://doi.org/10.1016/j.csda.2017.01.005
Article MathSciNet Google Scholar
Lunardon, N., Menardi, G., Torelli, N.: ROSE: a package for binary imbalanced learning. R J. 6(1), 82–92 (2014)
Article Google Scholar
Lu, W., Li, Z., Chu, J.: Adaptive ensemble undersampling-boost: a novel learning framework for imbalanced data. J. Syst. Softw. 132, 272–282 (2017). https://doi.org/10.1016/j.jss.2017.07.006
Article Google Scholar
Tsai, C.-F., Lin, W.-C., Hu, Y.-H., Yao, G.-T.: Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 477, 47–54 (2019). https://doi.org/10.1016/j.ins.2018.10.029
Article Google Scholar
Sun, L., Song, J., Hua, C., Shen, C., Song, M.: Value-aware resampling and loss for imbalanced classification. In: CSAE’18, pp. 1–6. ACM, New York (2018). https://doi.org/10.1145/3207677.3278084
Hofmann, T.: Unsupervised Learning from Dyadic Data, pp. 466–472. MIT Press (1998)
Sakai, Y., Iwata, K.: Extremal relations between Shannon entropy and $\ell \alpha $-norm, pp. 428–432 (2016)
Blei, D.M.: Introduction to probabilistic topic models. Commun. ACM (2011). http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
Xiao, H., Stibor, T.: Efficient collapsed Gibbs sampling for latent Dirichlet allocation. J. Mach. Learn. Res. Proc. Track 13, 63–78 (2010)
Google Scholar
Phan, X.-H., Nguyen, C.-T.: gibbslda (2008). http://gibbslda.sourceforge.net/
Blei, D.M.: lda-c (2003). http://www.cs.princeton.edu/~blei/lda-c/
Leães, A., Fernandes, P., Lopes, L., Assunção, J.: Classifying with adaboost.m1: the training error threshold myth, pp. 1–7 (2017). https://aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15498
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning, pp. 1322–1328. IEEE (2008). http://dblp.uni-trier.de/db/conf/ijcnn/ijcnn2008.html#HeBGL08
Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36, 5718–5727 (2006). https://doi.org/10.1016/j.eswa.2008.06.108
Article Google Scholar
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
Article Google Scholar
Smith, M.R., Martinez, T., Giraud-Carrier, C.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2014). https://doi.org/10.1007/s10994-013-5422-z
Article MathSciNet Google Scholar
Last, F., Douzas, G., Bação, F.: Oversampling for imbalanced learning based on k-means and SMOTE. CoRR (2017). arXiv:1711.00837
Zhang, J., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction, pp. 1–7 (2003)
Kubat, M.: Addressing the curse of imbalanced training sets: one-sided selection. In: Fourteenth International Conference on Machine Learning (2000)
Batista, G., Bazzan, A., Monard, M.-C.: Balancing training data for automated annotation of keywords: a case study, pp. 10–18 (2003)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 7(2), 679–772 (1976)
MathSciNet Google Scholar
Zhao, W., et al.: A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinform. 16 Suppl 13(Suppl 13), S8–S8 (2015). https://doi.org/10.1186/1471-2105-16-S13-S8
Article Google Scholar
Terragni, S., Fersini, E., Galuzzi, B. G., Tropeano, P., Candelieri, A.: OCTIS: comparing and optimizing topic models is simple!, pp. 263–270. Association for Computational Linguistics, Online (2021). https://aclanthology.org/2021.eacl-demos.31
Terragni, S., Fersini, E.: Fersini, E., Passarotti, M., Patti, V.: OCTIS 2.0: optimizing and comparing topic models in Italian is even simpler!. In: Fersini, E., Passarotti, M., Patti, V. (eds.) Proceedings of the Eighth Italian Conference on Computational Linguistics, CLiC-it 2021, Milan, Italy, January 26–28, 2022, CEUR Workshop Proceedings, vol. 3033. CEUR-WS.org (2021). http://ceur-ws.org/Vol-3033/paper55.pdf
Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14, 106 (2013)
Article Google Scholar

Download references

Acknowledgements

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, 600036, Tamilnadu, India
Sudarsun Santhiappan, Jeshuren Chelladurai & Balaraman Ravindran
Robert Bosch Centre for Data Science and AI (RBC-DSAI), Chennai, 600036, Tamilnadu, India
Jeshuren Chelladurai & Balaraman Ravindran

Authors

Sudarsun Santhiappan
View author publications
You can also search for this author inPubMed Google Scholar
Jeshuren Chelladurai
View author publications
You can also search for this author inPubMed Google Scholar
Balaraman Ravindran
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Sudarsun Santhiappan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Santhiappan, S., Chelladurai, J. & Ravindran, B. TOMBoost: a topic modeling based boosting approach for learning with class imbalance. Int J Data Sci Anal 17, 389–409 (2024). https://doi.org/10.1007/s41060-022-00363-8

Download citation

Received: 30 December 2021
Accepted: 04 September 2022
Published: 12 October 2022
Issue Date: May 2024
DOI: https://doi.org/10.1007/s41060-022-00363-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TOMBoost: a topic modeling based boosting approach for learning with class imbalance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Subset Labeled LDA: A Topic Model for Extreme Multi-label Classification

DOLDA: a regularized supervised topic model for high-dimensional multi-class regression

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now