Abstract
In multi-label classification, each instance could be assigned multiple labels at the same time. In such a situation, the relationships between labels and the class imbalance are two serious issues that should be addressed. Despite the important number of existing multi-label classification methods, the widespread class imbalance among labels has not been adequately addressed. Two main issues should be solved to come up with an effective classifier for imbalanced multi-label data. On the one hand, the imbalance could occur between labels and/or within a label. The “Between-labels imbalance” occurs where the imbalance is between labels however the “Within-label imbalance” occurs where the imbalance is in the label itself and it could occur across multiple labels. On the other hand, the labels’ processing order heavily influences the quality of a multi-label classifier. To deal with these challenges, we propose in this paper a bi-level evolutionary approach for the optimized induction of multivariate decision trees, where the upper-level role is to design the classifiers while the lower-level approximates the optimal labels’ ordering for each classifier. Our proposed method, named BIMLC-GA (Bi-level Imbalanced Multi-Label Classification Genetic Algorithm), is compared to several state-of-the-art methods across a variety of imbalanced multi-label data sets from several application fields and then applied on the miRNA-related diseases case study. The statistical analysis of the obtained results shows the merits of our proposal.
Similar content being viewed by others
Data Availability
The data sets analysed during the current study are available in http://www.uco.es/kdis/mllresources/. The real human miRNA-disease associations were retrieved from HMDD v3.0 database [52].
References
Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with dte-sbd: decision tree ensemble based on smote and bagging with differentiated sampling rates. Inf Sci 425:76–91
Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowledge-Based Syst 158:81–93
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowledge-Based Syst 174:137–143
Zhang M-L, Zhou Z-H (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333
Dembczynski K, Cheng W, Hüllermeier E (2010) Bayes optimal multilabel classification via probabilistic classifier chains. In: ICML, pp. 279–286
Read J, Martino L, Luengo D (2013) Efficient monte carlo optimization for multi-label classifier chains. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3457–3461. IEEE
Hernandez-Leal P, Orihuela-Espina F, Sucar E, Morales EF (2012) Hybrid binary-chain multi-label classifiers. In: Procceeding 6th European Workshop Probabilistic Graphical Models, pp. 139–146. Citeseer
Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recognit 45(9):3084–3104
Tsoumakas G, Partalas I, Vlahavas I (2008) A taxonomy and short review of ensemble selection. In: Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 1–6
Gibaja E, Ventura S (2015) A tutorial on multilabel learning. ACM Comput Surv (CSUR) 47(3):1–38
Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Annal Op Res 153(1):235–256
Cerrada M, Sánchez R-V, Pacheco F, Cabrera D, Zurita G, Li C (2016) Hierarchical feature selection based on relative dependency for gear fault diagnosis. Appl Intell 44(3):687–703
Bennett KP, Kunapuli G, Hu J, Pang J-S (2008) Bilevel optimization and machine learning. In: IEEE World Congress on Computational Intelligence, pp. 25–47. Springer
Weng W, Li Y-W, Liu J-H, Wu S-X, Chen C-L (2021) Multi-label classification review and opportunities. J Netw Intell 6(2):255–275
Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16
Li, L., Wang, H.: Towards label imbalance in multi-label classification with many labels. http://arxiv.org/abs/1604.01304 (2016)
Moyano JM, Gibaja EL, Cios KJ, Ventura S (2020) Combining multi-label classifiers based on projections of the output space using evolutionary algorithms. Knowledge-Based Syst 196:105770
Rastin N, Jahromi MZ, Taheri M (2020) A generalized weighted distance k-nearest neighbor for multi-label problems. Pattern Recognit 45:107526
Cheng K, Gao S, Dong W, Yang X, Wang Q, Yu H (2020) Boosting label weighted extreme learning machine for classifying multi-label imbalanced data. Neurocomputing 403:360–370
Zhang M-L, Li Y-K, Yang H, Liu X-Y (2020) Towards class-imbalance aware multi-label learning. IEEE Trans Cybernet 52:4459
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2019) Remedial-hwr: Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing 326:110–122
Ding M, Yang Y, Lan Z (2018) Multi-label imbalanced classification based on assessments of cost and value. Appl Intell 48(10):3577–3590
Tao Y, Jiang B, Xue L, Xie C, Zhang Y (2021) Evolutionary synthetic oversampling technique and cocktail ensemble model for warfarin dose prediction with imbalanced data. Neural Computing and Applications 33(17):11203–11221
Slowik A, Kwasnicka H (2020) Evolutionary algorithms and their applications to engineering problems. Neural Comput Appl 32(16):12363–12379
Moyano JM, Gibaja EL, Cios KJ, Ventura S (2019) An evolutionary approach to build ensembles of multi-label classifiers. Inf Fusion 50:168–180
Moyano JM, Gibaja EL, Cios KJ, Ventura S (2020) Generating ensembles of multi-label classifiers using cooperative coevolutionary algorithms. In: ECAI 2020, pp. 1379–1386. IOS Press,
Cerri R, Basgalupp MP, Barros RC, de Carvalho AC (2019) Inducing hierarchical multi-label classification rules with genetic algorithms. Appl Soft Comput 77:584–604
Omozaki, Y., Masuyama, N., Nojima, Y., Ishibuchi, H.: Multiobjective fuzzy genetics-based machine learning for multi-label classification. In: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8 (2020). IEEE
Zitzler E, Künzli S (2004) Indicator-based selection in multiobjective search. In: International Conference on Parallel Problem Solving from Nature, pp. 832–842. Springer
Basseur M, Burke EK (2007) Indicator-based multi-objective local search. In: 2007 IEEE Congress on Evolutionary Computation, pp. 3100–3107. IEEE
Chawla NV (2009) Data mining for imbalanced datasets: an overview. Data mining and knowledge discovery handbook, 875–886
Said R, Bechikh S, Louati A, Aldaej A, Said LB (2020) Solving combinatorial multi-objective bi-level optimization problems using multiple populations and migration schemes. IEEE Access 8:141674–141695
Chaabani A, Bechikh S, Said LB (2018) A new co-evolutionary decomposition-based algorithm for bi-level combinatorial optimization. Appl Intell 48(9):2847–2872
Gad AF (2021) Pygad: an intuitive genetic algorithm python library. http://arxiv.org/abs/2106.06158
Olson RS, Moore JH (2016) Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning, pp. 66–74. PMLR
Read J (2010) Scalable multi-label classification. PhD thesis, University of Waikato
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets’’ for all pairwise comparisons. J Mach Learn Res 9:2677–2694
García S, Fernández A, Luengo J, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10):959
Sheskin DJ (2003) Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC, UK
Triguero I, González S, Moyano JM, García S, Alcalá-Fdez J, Luengo J, Fernández A, del Jesús MJ, Sánchez L, Herrera F (2017) Keel 3.0: an open source software for multi-stage analysis in data mining. Int J Comput Intell Syst 10(1):1238–1249
Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian J Stat 45:65–70
Shaffer JP (1986) Modified sequentially rejective multiple test procedures. J Am Stat Assoc 81(395):826–831
Ambros V (2004) The functions of animal micrornas. Nature 431(7006):350–355
Bartel DP (2004) Micrornas: genomics, biogenesis, mechanism, and function. Cell 116(2):281–297
Kozomara A, Griffiths-Jones S (2014) mirbase: annotating high confidence micrornas using deep sequencing data. Nucl Acids Res 42(D1):68–73
Friedman RC, Farh KK-H, Burge CB, Bartel DP (2009) Most mammalian mrnas are conserved targets of micrornas. Genome Res 19(1):92–105
Esteller M (2011) Non-coding rnas in human disease. Nat Rev Genetics 12(12):861–874
Stricker M, Asim MN, Dengel A, Ahmed S (2021) Circnet: an encoder-decoder-based convolution neural network (cnn) for circular rna identification. Neural Comput Appl 10:1–12
Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, Zhou Y, Cui Q (2019) Hmdd v3. 0: a database for experimentally supported human microrna-disease associations. Nucl Acids Res 47(D1):1013–1017
Kibbe WA, Arze C, Felix V, Mitraka E, Bolton E, Fu G, Mungall CJ, Binder JX, Malone J, Vasant D et al (2015) Disease ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucl Acids Res 43(D1):1071–1078
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chabbouh, M., Bechikh, S., Mezura-Montes, E. et al. Imbalanced multi-label data classification as a bi-level optimization problem: application to miRNA-related diseases diagnosis. Neural Comput & Applic 35, 16285–16303 (2023). https://doi.org/10.1007/s00521-023-08458-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08458-4