Skip to main content

Advertisement

Log in

Imbalanced multi-label data classification as a bi-level optimization problem: application to miRNA-related diseases diagnosis

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In multi-label classification, each instance could be assigned multiple labels at the same time. In such a situation, the relationships between labels and the class imbalance are two serious issues that should be addressed. Despite the important number of existing multi-label classification methods, the widespread class imbalance among labels has not been adequately addressed. Two main issues should be solved to come up with an effective classifier for imbalanced multi-label data. On the one hand, the imbalance could occur between labels and/or within a label. The “Between-labels imbalance” occurs where the imbalance is between labels however the “Within-label imbalance” occurs where the imbalance is in the label itself and it could occur across multiple labels. On the other hand, the labels’ processing order heavily influences the quality of a multi-label classifier. To deal with these challenges, we propose in this paper a bi-level evolutionary approach for the optimized induction of multivariate decision trees, where the upper-level role is to design the classifiers while the lower-level approximates the optimal labels’ ordering for each classifier. Our proposed method, named BIMLC-GA (Bi-level Imbalanced Multi-Label Classification Genetic Algorithm), is compared to several state-of-the-art methods across a variety of imbalanced multi-label data sets from several application fields and then applied on the miRNA-related diseases case study. The statistical analysis of the obtained results shows the merits of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

The data sets analysed during the current study are available in http://www.uco.es/kdis/mllresources/. The real human miRNA-disease associations were retrieved from HMDD v3.0 database [52].

Notes

  1. https://colab.research.google.com.

  2. https://pygad.readthedocs.io/en/latest/.

  3. https://pypi.org/project/scikit-obliquetree/.

  4. http://epistasislab.github.io/tpot/.

References

  1. Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with dte-sbd: decision tree ensemble based on smote and bagging with differentiated sampling rates. Inf Sci 425:76–91

    Article  MathSciNet  Google Scholar 

  2. Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowledge-Based Syst 158:81–93

    Article  Google Scholar 

  3. Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowledge-Based Syst 174:137–143

    Article  Google Scholar 

  4. Zhang M-L, Zhou Z-H (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048

    Article  MATH  Google Scholar 

  5. Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333

    Article  MathSciNet  Google Scholar 

  6. Dembczynski K, Cheng W, Hüllermeier E (2010) Bayes optimal multilabel classification via probabilistic classifier chains. In: ICML, pp. 279–286

  7. Read J, Martino L, Luengo D (2013) Efficient monte carlo optimization for multi-label classifier chains. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3457–3461. IEEE

  8. Hernandez-Leal P, Orihuela-Espina F, Sucar E, Morales EF (2012) Hybrid binary-chain multi-label classifiers. In: Procceeding 6th European Workshop Probabilistic Graphical Models, pp. 139–146. Citeseer

  9. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recognit 45(9):3084–3104

    Article  Google Scholar 

  10. Tsoumakas G, Partalas I, Vlahavas I (2008) A taxonomy and short review of ensemble selection. In: Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 1–6

  11. Gibaja E, Ventura S (2015) A tutorial on multilabel learning. ACM Comput Surv (CSUR) 47(3):1–38

    Article  Google Scholar 

  12. Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Annal Op Res 153(1):235–256

    Article  MathSciNet  MATH  Google Scholar 

  13. Cerrada M, Sánchez R-V, Pacheco F, Cabrera D, Zurita G, Li C (2016) Hierarchical feature selection based on relative dependency for gear fault diagnosis. Appl Intell 44(3):687–703

    Article  Google Scholar 

  14. Bennett KP, Kunapuli G, Hu J, Pang J-S (2008) Bilevel optimization and machine learning. In: IEEE World Congress on Computational Intelligence, pp. 25–47. Springer

  15. Weng W, Li Y-W, Liu J-H, Wu S-X, Chen C-L (2021) Multi-label classification review and opportunities. J Netw Intell 6(2):255–275

    Google Scholar 

  16. Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750

    Article  Google Scholar 

  17. Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16

    Article  Google Scholar 

  18. Li, L., Wang, H.: Towards label imbalance in multi-label classification with many labels. http://arxiv.org/abs/1604.01304 (2016)

  19. Moyano JM, Gibaja EL, Cios KJ, Ventura S (2020) Combining multi-label classifiers based on projections of the output space using evolutionary algorithms. Knowledge-Based Syst 196:105770

    Article  Google Scholar 

  20. Rastin N, Jahromi MZ, Taheri M (2020) A generalized weighted distance k-nearest neighbor for multi-label problems. Pattern Recognit 45:107526

    Google Scholar 

  21. Cheng K, Gao S, Dong W, Yang X, Wang Q, Yu H (2020) Boosting label weighted extreme learning machine for classifying multi-label imbalanced data. Neurocomputing 403:360–370

    Article  Google Scholar 

  22. Zhang M-L, Li Y-K, Yang H, Liu X-Y (2020) Towards class-imbalance aware multi-label learning. IEEE Trans Cybernet 52:4459

    Article  Google Scholar 

  23. Charte F, Rivera AJ, del Jesus MJ, Herrera F (2019) Remedial-hwr: Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing 326:110–122

    Article  Google Scholar 

  24. Ding M, Yang Y, Lan Z (2018) Multi-label imbalanced classification based on assessments of cost and value. Appl Intell 48(10):3577–3590

    Article  Google Scholar 

  25. Tao Y, Jiang B, Xue L, Xie C, Zhang Y (2021) Evolutionary synthetic oversampling technique and cocktail ensemble model for warfarin dose prediction with imbalanced data. Neural Computing and Applications 33(17):11203–11221

    Article  Google Scholar 

  26. Slowik A, Kwasnicka H (2020) Evolutionary algorithms and their applications to engineering problems. Neural Comput Appl 32(16):12363–12379

    Article  Google Scholar 

  27. Moyano JM, Gibaja EL, Cios KJ, Ventura S (2019) An evolutionary approach to build ensembles of multi-label classifiers. Inf Fusion 50:168–180

    Article  Google Scholar 

  28. Moyano JM, Gibaja EL, Cios KJ, Ventura S (2020) Generating ensembles of multi-label classifiers using cooperative coevolutionary algorithms. In: ECAI 2020, pp. 1379–1386. IOS Press,

  29. Cerri R, Basgalupp MP, Barros RC, de Carvalho AC (2019) Inducing hierarchical multi-label classification rules with genetic algorithms. Appl Soft Comput 77:584–604

    Article  Google Scholar 

  30. Omozaki, Y., Masuyama, N., Nojima, Y., Ishibuchi, H.: Multiobjective fuzzy genetics-based machine learning for multi-label classification. In: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8 (2020). IEEE

  31. Zitzler E, Künzli S (2004) Indicator-based selection in multiobjective search. In: International Conference on Parallel Problem Solving from Nature, pp. 832–842. Springer

  32. Basseur M, Burke EK (2007) Indicator-based multi-objective local search. In: 2007 IEEE Congress on Evolutionary Computation, pp. 3100–3107. IEEE

  33. Chawla NV (2009) Data mining for imbalanced datasets: an overview. Data mining and knowledge discovery handbook, 875–886

  34. Said R, Bechikh S, Louati A, Aldaej A, Said LB (2020) Solving combinatorial multi-objective bi-level optimization problems using multiple populations and migration schemes. IEEE Access 8:141674–141695

    Article  Google Scholar 

  35. Chaabani A, Bechikh S, Said LB (2018) A new co-evolutionary decomposition-based algorithm for bi-level combinatorial optimization. Appl Intell 48(9):2847–2872

    Article  Google Scholar 

  36. Gad AF (2021) Pygad: an intuitive genetic algorithm python library. http://arxiv.org/abs/2106.06158

  37. Olson RS, Moore JH (2016) Tpot: a tree-based pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning, pp. 66–74. PMLR

  38. Read J (2010) Scalable multi-label classification. PhD thesis, University of Waikato

  39. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  40. Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets’’ for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  41. García S, Fernández A, Luengo J, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10):959

    Article  Google Scholar 

  42. Sheskin DJ (2003) Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC, UK

    Book  MATH  Google Scholar 

  43. Triguero I, González S, Moyano JM, García S, Alcalá-Fdez J, Luengo J, Fernández A, del Jesús MJ, Sánchez L, Herrera F (2017) Keel 3.0: an open source software for multi-stage analysis in data mining. Int J Comput Intell Syst 10(1):1238–1249

    Article  Google Scholar 

  44. Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian J Stat 45:65–70

    MathSciNet  MATH  Google Scholar 

  45. Shaffer JP (1986) Modified sequentially rejective multiple test procedures. J Am Stat Assoc 81(395):826–831

    Article  MATH  Google Scholar 

  46. Ambros V (2004) The functions of animal micrornas. Nature 431(7006):350–355

    Article  Google Scholar 

  47. Bartel DP (2004) Micrornas: genomics, biogenesis, mechanism, and function. Cell 116(2):281–297

    Article  Google Scholar 

  48. Kozomara A, Griffiths-Jones S (2014) mirbase: annotating high confidence micrornas using deep sequencing data. Nucl Acids Res 42(D1):68–73

    Article  Google Scholar 

  49. Friedman RC, Farh KK-H, Burge CB, Bartel DP (2009) Most mammalian mrnas are conserved targets of micrornas. Genome Res 19(1):92–105

    Article  Google Scholar 

  50. Esteller M (2011) Non-coding rnas in human disease. Nat Rev Genetics 12(12):861–874

    Article  Google Scholar 

  51. Stricker M, Asim MN, Dengel A, Ahmed S (2021) Circnet: an encoder-decoder-based convolution neural network (cnn) for circular rna identification. Neural Comput Appl 10:1–12

    Google Scholar 

  52. Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, Zhou Y, Cui Q (2019) Hmdd v3. 0: a database for experimentally supported human microrna-disease associations. Nucl Acids Res 47(D1):1013–1017

    Article  Google Scholar 

  53. Kibbe WA, Arze C, Felix V, Mitraka E, Bolton E, Fu G, Mungall CJ, Binder JX, Malone J, Vasant D et al (2015) Disease ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucl Acids Res 43(D1):1071–1078

    Article  Google Scholar 

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marwa Chabbouh.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chabbouh, M., Bechikh, S., Mezura-Montes, E. et al. Imbalanced multi-label data classification as a bi-level optimization problem: application to miRNA-related diseases diagnosis. Neural Comput & Applic 35, 16285–16303 (2023). https://doi.org/10.1007/s00521-023-08458-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08458-4

Keywords

Navigation