Handling imbalance in hierarchical classification problems using local classifiers approaches

Pereira, Rodolfo M.; Costa, Yandre M. G.; Silla, Carlos N.

doi:10.1007/s10618-021-00762-8

Handling imbalance in hierarchical classification problems using local classifiers approaches

Published: 13 May 2021

Volume 35, pages 1564–1621, (2021)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Rodolfo M. Pereira ORCID: orcid.org/0000-0003-1272-5378^1,2,
Yandre M. G. Costa³ &
Carlos N. Silla Jr.¹

929 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

The task of learning from imbalanced datasets has been widely investigated in the binary, multi-class and multi-label classification scenarios. Although this problem also affects hierarchical datasets, there are few work in the literature dealing with it. Meanwhile, the local classifier approaches are the most used techniques in the literature to deal with Hierarchical Classification problems. In this paper, we present new ways to handle data imbalance in hierarchical classification problems when using local classifiers approaches. We propose three different resampling schemas, according to the local classification approach: (1) Local Classifiers per Node; (2) Local Classifiers per Parent Node; and (3) Local Classifiers per Level. In order to define how imbalanced a certain hierarchical dataset is, we also propose three novel metrics to measure the imbalance in hierarchical datasets considering the different local classification approaches. The experimental evaluation in eight well-known datasets showed that the imbalance metrics can indeed measure the datasets imbalance and the proposed resampling schemas are able to improve the classification results when compared to baselines, state-of-the-art and related work approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

Notes

Available at http://sites.labic.icmc.usp.br/jeanmetz/datasets.html.
Available at https://github.com/mdeff/fma.
Available at https://www.imageclef.org/2009/medanno.
Available at http://lshtc.iit.demokritos.gr/.
Available at https://dtai.cs.kuleuven.be/clus/.
Available at https://cs.gmu.edu/~mlbio/HierCost/.
Available at http://scikit-learn.org/.
Available at https://github.com/tsoumakas/mulan/.
Available at https://github.com/scikit-learn-contrib/imbalanced-learn.
Available at https://github.com/rodolfomp123/imb-mulan.

References

Ariyaratne HB, Zhang D (2012) A novel automatic hierachical approach to music genre classification. In: Proceedings of the IEEE international conference on multimedia and expo workshops, pp 564–569
Bader-El-Den M, Teitei E, Perry T (2018) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netw Learn Syst
Bannour H, Hudelot C (2012) Hierarchical image annotation using semantic hierarchies. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 2431–2434
Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57(1):289–300
MathSciNet MATH Google Scholar
Bennett PN, Nguyen N (2009) Refined experts: improving classification in large taxonomies. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp 11–18
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
MathSciNet MATH Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority oversampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, Bangkok, Thailand, pp 475–482
Castellanos FJ, Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2018) Oversampling imbalanced data in the string space. Pattern Recogn Lett 103:32–38
Article Google Scholar
Cesa-Bianchi N, Valentini G (2009) Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. In: Machine learning in systems biology, pp 14–29
Cesa-Bianchi N, Re M, Valentini G (2012) Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn 88(1–2):209–241
Article MathSciNet MATH Google Scholar
Charte F, Rivera A, del Jesus MJ, Herrera F (2013) A first approach to deal with imbalance in multi-label datasets. In: Proceedings of the international conference on hybrid artificial intelligence systems, pp 150–160
Charte F, Rivas AJR, del Jesus M, Herrera F (2014) MLeNN: a first approach to heuristic multilabel undersampling. In: Proceedings of the international conference on intelligent data engineering and automated learning, pp 1–9
Charte F, Rivera A, del Jesus M, Herrera F (2015a) Addressing imbalance in multilabel classification: measures and random resampling algorithms. J Neurocomputing 163:3–16
Charte F, Rivera A, del Jesus M, Herrera F (2015b) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Charuvaka A, Rangwala H (2015) Hiercost: improving large scale hierarchical classification with cost sensitive learning. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 675–690
Chawla N, Bowyer K, Hall L, Kegelmeyer P (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Chen B, Hu J (2010) Hierarchical multi-label classification incorporating prior information for gene function prediction. In: 2010 10th International conference on intelligent systems design and applications. IEEE, pp 231–236
Chen B, Hu J (2012) Hierarchical multi-label classification based on over-sampling and hierarchy constraint for gene function prediction. IEEJ Trans Electr Electron Eng 7(2):183–189
Article Google Scholar
Chen B, Duan L, Hu J (2012) Composite kernel based SVM for hierarchical multi-label gene function classification. In: Proceedings of the international joint conference on neural networks (IJCNN). IEEE, pp 1–6
Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24(1):136–158
Article MathSciNet MATH Google Scholar
Colonna JG, Gama J, Nakamura EF (2018) A comparison of hierarchical multi-output recognition approaches for anuran classification. Mach Learn 107(11):1651–1671
Article MathSciNet MATH Google Scholar
Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017) FMA: A dataset for music analysis. In: Proceedings of the international society for music information retrieval conference, Suzhou, China, pp 316–323
Diamantini C, Potena D (2009) Bayes vector quantizer for class-imbalance problem. IEEE Trans Knowl Data Eng 21(5):638–651
Article Google Scholar
Dimitrovski I, Kocev D, Loskovska S, Dzeroski S (2011) Hierarchical annotation of medical images. Pattern Recogn 44(10):2436–2449
Article Google Scholar
Dumais S, Chen H (2000) Hierarchical classification of web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp 256–263
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64
Article MathSciNet MATH Google Scholar
Fagni T, Sebastiani F (2007) On the selection of negative examples for hierarchical text categorization. In: Proceedings of the language & technology conference, pp 24–28
Fernández A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl Based Syst 42:97–110
Article Google Scholar
García-Pedrajas N, Pérez-Rodríguez J, García-Pedrajas M, Ortiz-Boyer D, Fyfe C (2012) Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl Based Syst 25(1):22–34
Article Google Scholar
Gopal S, Yang Y (2015) Hierarchical Bayesian inference and recursive regularization for large-scale classification. ACM Trans Knowl Discov Data 9(3):1–23
Article Google Scholar
Ha-Thuc V, Renders JM (2011) Large-scale hierarchical text classification without labelled data. In: Proceedings of the fourth ACM international conference on Web search and data mining, pp 685–694
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new oversampling method in imbalanced datasets learning. In: International conference on intelligent computing. Hefei, China, pp 878–887
Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516
Article Google Scholar
Hastie T, Tibshirani R (1998) Classification by pairwise coupling. Adv Neural Inf Process Syst 11(1):507–513
MATH Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference neural networks, Hong Kong, pp 1322–1328
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Article MATH Google Scholar
Jeni LA, Cohn JF, De La Torre F (2013) Facing imbalanced data: recommendations for the use of performance metrics. In: Proceedings of the humaine association conference on affective computing and intelligent interaction, pp 245–251
Jung SH, Bang H, Young S (2005) Sample size calculation for multiple testing in microarray data analysis. Biostatistics 6(1):157–169
Article MATH Google Scholar
Kiritchenko S, Matwin S, Famili F (2005) Functional annotation of genes using hierarchical text categorization. In: Proceedings of the ACL workshop on linking biological literature, Detroit, USA
Kocev D, Vens C, Struyf J, Džeroski S (2013) Tree ensembles for predicting structured outputs. Pattern Recogn 46(3):817–833
Article Google Scholar
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Article Google Scholar
Kumar S, Rowley HA, Wang X, Rodrigues JJM (2015) Hierarchical classification in credit card data extraction. US Patent 9,213,907
Li D, Ju Y, Zou Q (2016) Protein folds prediction with hierarchical structured SVM. Curr Proteom 13(2):79–85
Article Google Scholar
Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, Washington DC, USA, vol 126
McNamara DS, Crossley SA, Roscoe RD, Allen LK, Dai J (2015) A hierarchical classification approach to automated essay scoring. Assess Writ 23:35–59
Article Google Scholar
Mieth B, Kloft M, Rodríguez JA, Sonnenburg S, Vobruba R, Morcillo-Suárez C, Farré X, Marigorta UM, Fehr E, Dickhaus T (2016) Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci Rep 6:36671
Article Google Scholar
Mukaka MM (2012) A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 24(3):69–71
Google Scholar
Naik A, Rangwala H (2016) Large-scale hierarchical classification with rare categories and inconsistencies. AI Matters 2(3):27–29
Article Google Scholar
Naik A, Rangwala H (2018) Large scale hierarchical classification: state of the art. Springer, Berlin
Book Google Scholar
Naik A, Rangwala H (2019) Improving large-scale hierarchical classification by rewiring: a data-driven filter based approach. J Intell Inf Syst 52(1):141–164
Article Google Scholar
Nakano FK, Lietaert M, Vens C (2019) Machine learning for discovering missing or wrong protein function annotations. BMC Bioinform 20(1):485
Article Google Scholar
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. International conference on rough sets and current trends in computing, Warsaw, Poland, pp 158–167
Notaro M, Schubach M, Robinson PN, Valentini G (2017) Prediction of human phenotype ontology terms by means of hierarchical ensemble methods. BMC Bioinform 18(1):449
Article Google Scholar
Obozinski G, Lanckriet G, Grant C, Jordan MI, Noble WS (2008) Consistent probabilistic outputs for protein function prediction. Genome Biol 9(1):S6
Article Google Scholar
Paes BC, Plastino A, Freitas AA (2012) Improving local per level hierarchical classification. J Inf Data Manag 3(3):394–394
Google Scholar
Partalas I, Kosmopoulos A, Baskiotis N, Artières T, Paliouras G, Gaussier É, Androutsopoulos I, Amini M, Gallinari P (2015) LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581
Pereira RM, da Costa YMG, Silla Jr CN (2018) Dealing with imbalanceness in hierarchical multi-label datasets using multi-label resampling techniques. In: IEEE 30th international conference on tools with artificial intelligence (ICTAI), pp 818–824
Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek link undersampling algorithm. Neurocomputing 383:95–105
Article Google Scholar
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
MathSciNet MATH Google Scholar
Roy A, Cruz RMO, Sabourin R, Cavalcanti GDC (2018) A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286:179–192
Article Google Scholar
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M et al (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 32(18):5539–5545
Article Google Scholar
Sarnal Barbedo JG, Lopes A (2006) Automatic genre classification of musical signals. EURASIP J Adv Signal Process 2007(1):064960
Article MathSciNet Google Scholar
Schietgat L, Vens C, Struyf J, Blockeel H, Kocev D, Džeroski S (2010) Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinform 11(1):1–14
Article MATH Google Scholar
Silla CN Jr, Freitas AA (2009) Novel top-down approaches for hierarchical classification and their application to automatic music genre classification. In: 2009 IEEE international conference on systems, man and cybernetics. IEEE, pp 3499–3504
Silla CN Jr, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Disc 22(1–2):31–72
Article MathSciNet MATH Google Scholar
Sitompul OS, Nababan EB et al (2018) Biased support vector machine and weighted-smote in handling class imbalance problem. Int J Adv Intell Inform 4(1):21–27
Article Google Scholar
Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of the Australasian joint conference on artificial intelligence, pp 1015–1021
Soleymani R, Granger E, Fumera G (2020) F-measure curves: a tool to visualize classifier performance under imbalance. Pattern Recogn 100:107146
Article Google Scholar
Song Y, Roth D (2014) On dataless hierarchical text classification. In: Twenty-eighth AAAI conference on artificial intelligence
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery, Italy, Turin, pp 283–292
Stein RA, Jaques PA, Valiati JF (2019) An analysis of hierarchical text classification using word embeddings. Inf Sci 471:216–232
Article Google Scholar
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Article MATH Google Scholar
Szalkai B, Grolmusz V, Hancock J (2018) Seclaf: a webserver and deep neural network design tool for hierarchical biological sequence classification. Bioinformatics 1:3
Google Scholar
Tang H, Wang Y, Tang S, Chu D, Li C (2019) A randomized clustering forest approach for efficient prediction of protein functions. IEEE Access 7:12360–12372
Article Google Scholar
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6(6):448–452
MathSciNet MATH Google Scholar
Tsoumakas G, Vlahavas I (2007) Random k-labelsets: an ensemble method for multilabel classification. In: European conference on machine learning. Springer, pp 406–417
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185
Article Google Scholar
Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B (Cybern) 42(4):1119–1130
Article Google Scholar
Xu C, Geng X (2019) Hierarchical classification based on label distribution learning. Proc AAAI Conf Artif Intell 33:5533–5540
Google Scholar
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article Google Scholar
Yu L, Zhou R, Tang L, Chen R (2018) A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Appl Soft Comput 69:192–202
Article Google Scholar
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3):321–334
Article Google Scholar
Zhou ZH, Liu XY (2010) On multi-class cost-sensitive learning. Comput Intell 26(3):232–257
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the Brazilian Research Support Agencies: Coordination for the Improvement of Higher Education Personnel (CAPES), National Council for Scientific and Technological Development (CNPq) and Araucaria Foundation (FA) for their financial support. We also thank the anonymous reviewers and the Action Editor Grigorios Tsoumakas for their valuable feedback on the earlier versions of this manuscript.

Author information

Authors and Affiliations

Pontifícia Universidade Católica do Paraná, Curitiba, PR, Brazil
Rodolfo M. Pereira & Carlos N. Silla Jr.
Instituto Federal do Paraná, Pinhais, PR, Brazil
Rodolfo M. Pereira
Universidade Estadual de Maringá, Maringá, PR, Brazil
Yandre M. G. Costa

Authors

Rodolfo M. Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Yandre M. G. Costa
View author publications
You can also search for this author in PubMed Google Scholar
Carlos N. Silla Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodolfo M. Pereira.

Additional information

Responsible editor: Grigorios Tsoumakas.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

In the appendix we present all the Tables of classification and metrics results generated in the experiments of this work, which were summarized into charts in the main part of paper. In Tables 24–27, the lines in italic represent the average ranking of the approaches. Besides the raw results we also present here the Tables of the statistics, which were applied over the results in order to give statistical background in the responses of the Analysis and Discussion section (Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 and 32).

Table 7 F-Score results for the proposed approaches in the Cell-cycle dataset

Handling imbalance in hierarchical classification problems using local classifiers approaches

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A comparative analysis of gradient boosting algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation