A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Chen, Zhi; Lin, Tao; Xia, Xin; Xu, Hongyan; Ding, Sha

doi:10.1007/s10489-017-1088-8

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Published: 04 December 2017

Volume 48, pages 2441–2457, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Zhi Chen¹,
Tao Lin¹,
Xin Xia¹,
Hongyan Xu¹ &
…
Sha Ding¹

993 Accesses
37 Citations
Explore all metrics

Abstract

Constructing effective classifiers from imbalanced datasets has emerged as one of the main challenges in the data mining community, due to its increased prevalence in various real-world domains. Ensemble solutions are quite often applied in this field for their ability to provide better classification ability than single classifier. However, most existing methods adopt data sampling to train the base classifier on balanced datasets, but not to directly enhance the diversity. Thus, the performance of the final classifier can be limited. This paper suggests a new ensemble learning that can address the class imbalance problem and promote diversity simultaneously. Inspired by the localized generalization error model, this paper generates some synthetic samples located within some local area of the training samples, and trains the base classifiers with the union of original training samples and synthetic neighborhoods samples. By controlling the number of generated samples, the base classifiers can be trained with balanced datasets. Meanwhile, as the generated samples can extend different parts of the original input space and can be quite different from the original training samples, the obtained base classifiers are guaranteed to be accurate and diverse. A thorough experimental study on 36 benchmark datasets was performed, and the experimental results demonstrated that our proposed method can deliver significant better performance than the state-of-the-art ensemble solutions for the imbalanced problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

References

López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Yang Q, Wu X (2006) 10 Challenging problems in data mining research. Int J Inf Technol Decis Mak 05 (04):597–604
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing. Springer, pp 878–887
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Article Google Scholar
Liu X-Y, Wu J, Zhou Z-H (2009) Exploratory undersampling for class-imbalance learning. Trans Syst Man Cybern Part B 39(2):539–550
Article Google Scholar
Galar M, Fernández A, Barrenechea E, Herrera F (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471
Article Google Scholar
Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomput 101:309–318
Article Google Scholar
Qian Y, Liang Y, Li M, Feng G, Shi X (2014) A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing 143:57–67
Article Google Scholar
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: Song I-Y, Eder J, Nguyen TM (eds) Data warehousing and knowledge discovery: 10th international conference, DaWaK 2008 Turin, Italy, September 2–5, 2008 Proceedings. Springer, Berlin, pp 283–292
Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A novel ensemble method for classifying imbalanced data. Pattern Recognit 48(5):1623–1637
Article Google Scholar
Kittler J, Hatef M, Duin RPW, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Article Google Scholar
Díez-Pastor JF, Rodríguez JJ, García-Osorio CI, Kuncheva LI (2015) Diversity techniques improve the performance of the best imbalance learning ensembles. Inf Sci 325:98–117
Article MathSciNet Google Scholar
Visentini I, Snidaro L, Foresti GL (2016) Diversity-aware classifier ensemble selection via f-score. Inf Fusion 28:24–43
Article Google Scholar
Yeung DS, Ng WW, Wang D, Tsang EC, Wang X-Z (2007) Localized generalization error model and its application to architecture selection for radial basis function neural network. IEEE Trans Neural Netw 18(5):1294–1305
Article Google Scholar
Ng WWY, Dorado A, Yeung DS, Pedrycz W, Izquierdo E (2007) Image classification with the use of radial basis function neural networks and the minimization of the localized generalization error. Pattern Recognit 40(1):19–32
Article MATH Google Scholar
Ng WWY, Yeung DS, Firth M, Tsang ECC, Wang X-Z (2008) Feature selection using localized generalization error for supervised classification problems using RBFNN. Pattern Recognit 41(12):3706–3719
Article MATH Google Scholar
Chen Z, Lin T, Chen R, Xie Y, Xu H (2017) Creating diversity in ensembles using synthetic neighborhoods of training samples. Appl Intell 47(2):570–583
Weiss GM, Tian Y (2008) Maximizing classifier utility when there are data acquisition and modeling costs. Data Min Knowl Disc 17(2):253–282
Article MathSciNet Google Scholar
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Article MATH Google Scholar
Wang S, Yao X, IEEE (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining. IEEE, New York, pp 324–331
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
Article Google Scholar
Barandela R, Sanchez JS, Valdovinos RM (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256
Article MathSciNet Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N (ed) Knowledge discovery in databases: PKDD 2003: 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003. Proceedings. Springer, Berlin, pp 107–119
Freund Y (1996) Experiments with a new boosting algorithm. In: Thirteenth international conference on machine learning
Bhowan U, Johnston M, Zhang M, Yao X (2014) Reusing genetic programming for ensemble selection in classification of unbalanced data. IEEE Trans Evol Comput 18(6):893–908
Article Google Scholar
Díez-Pastor JF, Rodríguez JJ, García-Osorio C, Kuncheva LI (2015) Random balance: ensembles of variable priors classifiers for imbalanced data. Knowl-Based Syst 85:96–111
Article Google Scholar
Melville P, Mooney RJ (2005) Creating diversity in ensembles using artificial data. Inf Fusion 6(1):99–111
Article Google Scholar
Martínez-Muñoz G, Suárez A (2005) Switching class labels to generate classification ensembles. Pattern Recognit 38(10):1483–1494
Article Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Article Google Scholar
Akhand MA, Murase K (2012) Ensembles of neural networks based on the alteration of input feature values. Int J Neural Syst 22(1):77–87
Article Google Scholar
Brown G, Wyatt J, Harris R, Yao X (2005) Diversity creation methods: a survey and categorisation. Inf Fusion 6(1):5–20
Article Google Scholar
Akhand MAH, Islam MM, Murase K (2009) A comparative study of data sampling techniques for constructing neural network ensembles. Int J Neural Syst 19(02):67–89
Article Google Scholar
Sun B, Ng WWY, Yeung DS, Chan PPK (2013) Hyper-parameter selection for sparse LS-SVM via minimization of its localized generalization error. Int J Wavelets Multiresolution Inf Process 11(03):1350030
Article MathSciNet MATH Google Scholar
Zhang H, Li M (2014) RWO-sampling: a random walk over-sampling approach to imbalanced data classification. Inf Fusion 20:99–116
Article Google Scholar
Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
Article Google Scholar
Chang CC, Lin CJ (2007) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3, article 27):389–396
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Article Google Scholar
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
MathSciNet MATH Google Scholar
Hodges JL, Lehmann EL (1962) Rank methods for combination of independent experiments in analysis of variance. Ann Math Stat 33(2):482–497
Article MathSciNet MATH Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
MathSciNet MATH Google Scholar
Tsymbal A, Pechenizkiy M, Cunningham P (2005) Diversity in search strategies for ensemble feature selection. Inf Fusion 6(1):83–98
Article Google Scholar

Download references

Acknowledgements

The research is partly supported by Science and Technology Supporting Program, Sichuan Province, China (2013GZX0138 and 2014GZ0154).

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Sichuan, China
Zhi Chen, Tao Lin, Xin Xia, Hongyan Xu & Sha Ding

Authors

Zhi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Sha Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Lin, T., Xia, X. et al. A synthetic neighborhood generation based ensemble learning for the imbalanced data classification. Appl Intell 48, 2441–2457 (2018). https://doi.org/10.1007/s10489-017-1088-8

Download citation

Published: 04 December 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10489-017-1088-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A comparative analysis of gradient boosting algorithms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A synthetic neighborhood generation based ensemble learning for the imbalanced data classification

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A comparative analysis of gradient boosting algorithms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation