Skip to main content
Log in

SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Many practical applications suffer from imbalanced data classification, in which case the minority class has degraded recognition rate. The primary causes are the sample scarcity of the minority class and the intrinsic complex distribution characteristics of imbalanced datasets. The imbalanced classification problem is more serious on small sample datasets. To solve the problems of small sample and class imbalance, a hybrid resampling method is proposed. The proposed method combines an oversampling approach (synthetic minority oversampling technique, SMOTE) and a novel data cleaning approach (weighted edited nearest neighbor rule, WENN). First, SMOTE generates synthetic minority class examples using linear interpolation. Then, WENN detects and deletes unsafe majority and minority class examples using weighted distance function and k-nearest neighbor (kNN) rule. The weighted distance function scales up a commonly used distance by considering local imbalance and spacial sparsity. Extensive experiments over synthetic and real datasets validate the superiority of the proposed SMOTE-WENN compared with three state-of-the-art resampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Yu H, Ni J (2014) An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 11(4):657–666

    Article  Google Scholar 

  2. Yan Q, Cao Y (2020) Optimizing shapelets quality measure for imbalanced time series classification. Appl Intell 50(2):519–536

    Article  Google Scholar 

  3. Weiss G M, Provost F (2003) Learning when training data are costly: The effect of class distribution on tree induction. J Artif Intell Res 19:315–354

    Article  Google Scholar 

  4. Wu G, Chang E Y (2005) Kba: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng (6):786–795

  5. Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597

    Article  Google Scholar 

  6. Holte R C, Acker L, Porter B W et al (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th International Joint Conference on Artificial Intelligence, vol 89. Morgan Kaufmann Publishers, San Francisco, pp 813–818

  7. Prati R C, Batista G E, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321

  8. Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: International conference on rough sets and current trends in computing. Springer, Berlin, pp 158–167

  9. Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Emerging paradigms in machine learning. Springer, Berlin, pp 277–306

  10. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    Article  Google Scholar 

  11. He H, Garcia E A (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21 (9):1263–1284

    Article  Google Scholar 

  12. Su C, Cao J (2019) Improving lazy decision tree for imbalanced classification by using skew-insensitive criteria. Appl Intell 49(3):1127–1145

    Article  Google Scholar 

  13. Xu Y, Wang Q, Pang X, Tian Y (2018) Maximum margin of twin spheres machine with pinball loss for imbalanced data classification. Appl Intell 48(1):23–34

    Article  Google Scholar 

  14. Lin W C, Tsai C F, Hu Y H, Jhang J S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26

    Article  Google Scholar 

  15. Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70

    Article  Google Scholar 

  16. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

  17. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on knowledge discovery and data mining. Springer, Berlin, pp 475–482

  18. Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, Washington, pp 104–111

  19. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

  20. Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29

    Article  Google Scholar 

  21. Guan H, Zhang Y, Xian M, Cheng H D, Tang X (2016) WENN for individualized cleaning in imbalanced data. In: 2016 23Rd international conference on pattern recognition (ICPR). IEEE, pp 456–461

  22. Khoshgoftaar T M, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22(3):387–396

    Article  Google Scholar 

  23. Wilson D R, Martinez T R (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34

    Article  MathSciNet  Google Scholar 

  24. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

  25. Luque A, Carrasco A, Martin A, Heras A D L (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn 91:216–231

    Article  Google Scholar 

  26. Garcia S, Fernandez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  27. Das S, Datta S, Chaudhuri B B (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recogn 81:674–693

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos: 61872203 and 61802212), the Shandong Provincial Natural Science Foundation (No: ZR2019BF017), Major Scientific and Technological Innovation Projects of Shandong Province (Nos: 2019JZZY010127, 2019JZZY010132 and 2019JZZY010201), Jinan City “20 universities” Funding Projects Introducing Innovation Team Program (No: 2019GXRC031), and the Project of Shandong Province Higher Educational Science and Technology Program (No: J18KA331).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongjiao Guan.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guan, H., Zhang, Y., Xian, M. et al. SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51, 1394–1409 (2021). https://doi.org/10.1007/s10489-020-01852-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01852-8

Keywords

Navigation