Abstract
Multi-label imbalanced data comprise data with a disproportionate number of samples in the classes. Traditional classifiers are more suitable for classifying balanced data because the classification performance declines dramatically when the class sizes are imbalanced in multi-label data. In this study, we propose an algorithm that assesses the cost of the majority class and the value of the minority classes to handle the multi-label imbalanced data classification problem. The main idea of our algorithm is to provide a quantitative assessment of the cost of the majority class and the value of the minority class based on an imbalance ratio. In the data preprocessing step, we employ a penalty function to determine the number of majority class instances for elimination. The contributions of an instance determine whether a majority class instance is to be eliminated. In the classification step, we propose a metric to control the cost of the majority class and the value of the minority class. Experiments showed that this algorithm can improve the performance of multi-label imbalanced data classification.
Similar content being viewed by others
References
Bielza C, Li G, Larranga P (2011) Multi-dimensional classification with Bayesian network. Int J Proximate Reason 52:705– 727
Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans 8:1819–1831
Ying Y, Pedrycz W, Miao D (2014) Multi-label classification by exploiting label correlations. Expert Syst Appl 41:2989–3004
Vens C, Struyf J, Schietgat L (2008) Decision trees for hierarchical multi-label classification. Mach Leaning 73:185–214. https://doi.org/10.1007/s10994-008-5077-3
Blockeel H, Schietgat L, Struyf J, Dzeroki S et al (2006) Decision tree for hierarchical multilabel classification: a case study in functional genomics, vol 2006. Springer, Berlin, pp 18–29
Goncalves T, Quaresma P (2008) A preliminary approach to the multilabel classification problem of portuguese juridical documents, progress in artificial intelligence. EPIA 2003. Springer, Berlin, pp 435–444
Hllermeier E, Frnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16-17):1897–1916
Tsoumakas G, Vlahavas I (2007) Random k-Labelsets: an ensemble method for multilabel classification. In: Machine learning ECML 2007. Lecture notes in computer science, vol 4701. Springer, Berlin, Heidelberg
Schapire RE, Singer Y (2000) BoosTexter: a boosting-based system for text categorization. Mach Learn 39:135–168. https://doi.org/10.1023/A:1007649029923
Zhang M-L, Zhou Z-H (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28:92–122. https://doi.org/10.1007/s10618-012-0295-5
Mrquez-Vera C, Cano A, Romero C et al (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38:315–330. https://doi.org/10.1007/s10489-012-0374-8
Giraldo-Forero AF, Jaramillo-Garzn JA, Ruiz-Muoz JF, Castellanos-Domnguez CG (2013) Managing imbalanced data sets in multi-label problems: a case study with the SMOTE algorithm. In: Proceedings of the 18th Iberoamerican congress, CIARP 2013. Springer, pp 334–342
Lin W, Xu D (2016) Imbalanced Muli-label learning for identifying antimicrobial peptides and their functional types. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw560
Charte F, Rivera A, del Jesus MJ, Herrera F (2013) A first approach to deal with imbalance in multi-label datasets. Springer, Berlin, pp 150–160
Akkasi A, Varoglu E, Dimililer N (2017) Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell. https://doi.org/10.1007/s10489-017-0920-5
Fang M, Xiao Y, Wang C, Xie J (2014) Multi-label classification: dealing with imbalance by combining labels. In: IEEE international conference on TOOLS with artificial intelligence, pp 233–237
Zhang M-L, Li Y-K, Liu X-Y (2015) Towards class-imbalance aware multi-label learning. In: Proceedings of the twenty-fourth international joint conference on artificial intelligence, pp 4041–4147
Zhang X, Song Q et al (2015) Guangtaowang and a dissimilarity-based imbalance data classification algorithm. Appl Intell 42:544–565. https://doi.org/10.1007/s10489-014-0610-5
Yi L, Guo H (2004) Murphey neural learning from unbalanced data. Appl Intell 21:117–128
Varando G, Bielza C, Larranga P (2016) Decision function for chain classifiers based on Bayesian network for multi-label classification. Int J Approx Reason 68:164–178
Varando G, Bielza C, Larranaga P (2014) Expressive power of binary relevance and chain classifiers based on Bayesian networks for multi-label classification. Springer, Berlin, pp 519–534
Varando G, Bielza C, Larranga P (2015) Decision boundary for disctete Bayesian network classifiers. J Mach Learn Res 16:2725–2749
Yang Y, Yan W (2012) On the properties of concept classes induced by multivalued Bayesian network. Infor Sci 184(1):155–165
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. Springer, Berlin, pp 22–30
Read J, Pfahringer B, Holmes G et al (2011) Classifier chains for multi-label classification. Mach Learn 85:333–359. https://doi.org/10.1007/s10994-011-5256-5
Sucar L, Bielza C, Eduardo F et al (2014) Morales Enrique multi-label classification with Bayesian network-based chain classifiers. Pattern Recogn Lett 41:14–22
O’Donnell R, Rocco A (2010) Servedio new degree bounds for polynomial threshold functions. Combinatorica 30(3):327–358. https://doi.org/10.1007/s00493-010-2173-3
Devi D, Biswas S, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12
Cano A, Luna JM, Gibaja EL, Ventura S (2016) Laim discretization for multi-label data. Inform Sci 330(C):370–384
Jiang L, Li C, Wang S et al (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39
Jiang L, Cai Z, Wang D et al (2012) Improving tree augmented naive Bayes for class probability estimation. Knowl-Based Syst 26:239–245
Melki G, Cano A, Kecman V et al (2017) Multi-target support vector regression via correlation regressor chains. Inform Sci 415– 416:53–69
Petterson J, Caetano T (2010) Reverse multi-label learning. Advan Neural Inform Process Syst 23:1912–1920
Charte F, Rivera AJ, del Jesus MJ et al (2015) Addressing imbalance in multilabel classification; Measures and random resampling algorithms. Neurocomputing 163:3–16
Charte F, Rivera AJ, del Jesus MJ et al (2014) MLeNN: a first approach to heuristic multilabel undersampling. In: International conference on intelligent data engineering and automated learning. Springer International Publishing, pp 1–9
Acknowledgments
The authors thank the editor and anonymous reviewers for their helpful comments and suggestions. This study was supported by the National Natural Science Foundation of China(Grant Nos. 61573266).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ding, M., Yang, Y. & Lan, Z. Multi-label imbalanced classification based on assessments of cost and value. Appl Intell 48, 3577–3590 (2018). https://doi.org/10.1007/s10489-018-1156-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1156-8