Transfer synthetic over-sampling for class-imbalance learning with limited minority class data

Liu, Xu-Ying; Wang, Sheng-Tao; Zhang, Min-Ling

doi:10.1007/s11704-018-7182-1

Transfer synthetic over-sampling for class-imbalance learning with limited minority class data

Research Article
Published: 17 June 2019

Volume 13, pages 996–1009, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Xu-Ying Liu^1,2,3,
Sheng-Tao Wang^1,2,3 &
Min-Ling Zhang^1,2,3

171 Accesses
15 Citations
1 Altmetric
Explore all metrics

Abstract

The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Statistical Inference Through Variable Adaptive Threshold Algorithm in Over-Sampling the Imbalanced Data Distribution Problem

Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data

Article 21 November 2021

A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification

Article 21 April 2022

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

He H, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284
Article Google Scholar
Liu X Y, Wu J, Zhou Z H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539–550
Article Google Scholar
Cieslak D, Chawla N. Learning decision trees for unbalanced data. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008, 241–256
Chapter Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463–484
Article Google Scholar
Wang S, Minku L L, Yao X. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368
Article Google Scholar
Yan Y, Chen M, Shyu M L, Chen S C. Deep learning for imbalanced multimedia data classification. In: Proceedings of the 2015 IEEE International Symposium on Multimedia. 2015, 483–488
Google Scholar
Wang S, Liu W, Wu J, Cao L, Meng Q, Kennedy P J. Training deep neural networks on imbalanced data sets. In: Proceedings of the 2016 International Joint Conference on Neural Networks. 2016, 4368–4374
Chapter Google Scholar
Fawcett T, Provost F J. Combining data mining and machine learning for effective user profiling. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 8–13
Google Scholar
Kubat M, Holte R C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2–3): 195–215
Article Google Scholar
Lewis D D, Ringuette M. A comparison of two learning algorithms for text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. 1994, 81–93
Google Scholar
Wang S, Yao X. Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 2013, 62(2): 434–443
Article Google Scholar
Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997, 30(6): 1145–1159
Article Google Scholar
Yang Q, Wu X. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 2006, 5(4): 597–604
Article Google Scholar
Weiss G M. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 7–19
Article Google Scholar
Weiss G M. Mining with Rare Cases. Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA. 2005, 765–776
Chapter Google Scholar
Khoshgoftaar T M, Seiffert C, Hulse J V, Napolitano A, Folleco A. Learning with limited minority class data. In: Proceedings of the 6th International Conference on Machine Learning and Applications. 2007, 348–353
Google Scholar
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357
Article MATH Google Scholar
Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of the International Conference on Intelligent Computing. 2005, 878–887
Google Scholar
Batista G E, Prati R C, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SGKDD Explorations Newsletter, 2004, 6(1): 20–29
Article Google Scholar
Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001, 63–66
Chapter MATH Google Scholar
He H, Bai Y, Garcia E A, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. 2008, 1322–1328
Google Scholar
Das B, Krishnan N C, Cook D J. wRACOG: a gibbs sampling-based oversampling technique. In: Proceedings of the 13th IEEE International Conference on Data Mining. 2013, 111–120
Google Scholar
Zhang H, Li M. RWO-sampling: a random walk over-sampling approach to imbalanced data classification. Information Fusion, 2014, 20: 99–116
Article Google Scholar
Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345–1359
Article Google Scholar
Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471
Article Google Scholar
Ramentol E, Caballero Y, Bello R, Herrera F. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems, 2012, 33(2): 245–265
Article Google Scholar
Wang S, Yao X. Multiclass imbalance problems: analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2012, 42(4): 1119–1130
Article Google Scholar
Liu X Y, Li Q Q. Learning from combination of data chunks for multiclass imbalanced data. In: Proceedings of the 2014 International Joint Conference on Neural Networks. 2014, 1680–1687
Chapter Google Scholar
Li S, Wang Z, Zhou G, Lee S Y M. Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2011, 1826–1832
Google Scholar
Zhang M L, Li Y K, Liu X Y. Towards class-imbalance aware multilabel learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 4041–4047
Google Scholar
Hoens T R, Chawla N V. Learning in non-stationary environments with class imbalance. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 168–176
Google Scholar
Wang S, Minku L L, Yao X. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368
Article Google Scholar
Cao H, Li X L, Woon Y K, Ng S K. SPO: structure preserving oversampling for imbalanced time series classification. In: Proceeding of the 11st IEEE International Conference on Data Mining. 2011, 1008–1013
Chapter Google Scholar
Cao H, Li X L, Woon D Y K, Ng S K. Integrated oversampling for imbalanced time series classification. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(12): 2809–2822
Article Google Scholar
Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119
Google Scholar
Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331
Google Scholar
Sun Y, Kamel M S, Wong A K, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12): 3358–3378
Article MATH Google Scholar
Seiffert C, Khoshgoftaar T M, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A (Systems and Humans), 2010, 40(1): 185–197
Article Google Scholar
Tomek I. Two modifications of CNN. IEEE Transactions of System Man Cybernetics, 1976, 6: 769–772
MathSciNet MATH Google Scholar
Raina R, Battle A, Lee H, Packer B, Ng A Y. Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 759–766
Google Scholar
Wei Y, Zhu Y, Leung C W, Song Y, Yang Q. Instilling social to physical: co-regularized heterogeneous transfer learning. In: Proceedings of the 13rd AAAI Conference on Artificial Intelligence. 2016, 1338–1344
Google Scholar
Weiss K, Khoshgoftaar T M, Wang D. A survey of transfer learning. Journal of Big Data, 2016, 3(1): 1–40
Article Google Scholar
Al-Stouhi S, Reddy C K. Transfer learning for class imbalance problems with inadequate data. Knowledge and Information Systems, 2016, 48(1): 201–208
Article Google Scholar
Ge L, Gao J, Ngo H, Li K, Zhang A. On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining, 2014, 7(4): 254–271
Article MathSciNet Google Scholar
Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200
Google Scholar
Blake C, Keogh E, Merz C J. UCI repository of machine learning databases. University of California, Irvine, CA, 1996
Google Scholar
Breiman L, Friedman J, Olshen R A, Stone C J. Classification and Regression Trees. London: Routledge Press, 2017
Book MATH Google Scholar
Schapire R E. A brief introduction to Boosting. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999, 1401–1406
Google Scholar
Barandela R, Valdovinos R M, Snchez J S. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256
Article MathSciNet Google Scholar
Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140
MATH Google Scholar

Download references

Acknowledgements

The authors wish to thank the associate editor and anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Key R&D Program of China (2017YFB1002801), the National Natural Science Foundation of China (Grant Nos. 61473087, 61573104), the Natural Science Foundation of Jiangsu Province (BK20141340), and partially supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China
Xu-Ying Liu, Sheng-Tao Wang & Min-Ling Zhang
Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing, 210096, China
Xu-Ying Liu, Sheng-Tao Wang & Min-Ling Zhang
Collaborative Innovation Center for Wireless Communications Technology, Nanjing, 210096, China
Xu-Ying Liu, Sheng-Tao Wang & Min-Ling Zhang

Authors

Xu-Ying Liu
View author publications
Search author on:PubMed Google Scholar
Sheng-Tao Wang
View author publications
Search author on:PubMed Google Scholar
Min-Ling Zhang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Xu-Ying Liu.

Additional information

Xu-Ying Liu received the BS degree in computer science from Nanjing University of Aeronautics and Astronautics, China, the MS and PhD degrees in computer science from Nanjing University, China in 2006 and 2010, respectively. Now she is an assistant professor at the School of Computer Science and Engineering, Southeast University, China. Her research interests mainly include machine learning and data mining.

Sheng-Tao Wang, received MS degree in computer science and engineering from Southeast University, China in 2017. He is currently a big data development engineer in Helium Data. His research interests include machine learning and data mining.

Min-Ling Zhang received the BS, MS, and PhD degrees in computer science from Nanjing University, China in 2001, 2004 and 2007, respectively. Currently, he is a professor at the School of Computer Science and Engineering, Southeast University, China. In recent years, Dr. Zhang has served as the Program Co- Chairs of ACML’17, CCFAI’17, PRICAI’16, Senior PC member or Area Chair of AAAI’18/’17, IJCAI’17/’15, ICDM’17/’16, PAKDD’16/’15, etc. He is also on the editorial board of Frontiers of Computer Science, ACM Transactions on Intelligent Systems and Technology, Neural Networks. Dr. Zhang is the secretary-general of the CAAI (Chinese Association of Artificial Intelligence) Machine Learning Society, standing committee member of the CCF (China Computer Federation) Artificial Intelligence & Pattern Recognition Society.

Electronic supplementary material