Skip to main content

SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Abstract

Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to \(2.92\times 10^5\) unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://tupa.gtk.fi/paikkatieto/meta/aeroelectromagnetic_raster_data_of_finland.html

  2. 2.

    https://tupa.gtk.fi/paikkatieto/meta/aeromagnetic_raster_data_of_finland.html

  3. 3.

    https://tupa.gtk.fi/paikkatieto/meta/aeroradiometric_raster_data_of_finland.html

References

  1. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20-29 (jun 2004). https://doi.org/10.1145/1007730.1007735

  2. Butt, U.M., Letchmunan, S., Ali, M., Hassan, F.H., Baqir, A., Sherazi, H.H.R.: Machine learning based diabetes classification and prediction for healthcare applications. Journal of healthcare engineering 2021(1), 9930985 (2021)

    Google Scholar 

  3. Charitou, C., Dragicevic, S., d’Avila Garcez, A.: Synthetic data generation for fraud detection using gans (2021)

    Google Scholar 

  4. Chawla, N.V.: Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook pp. 875–886 (2010)

    Google Scholar 

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357 (2002)

    Article  MATH  Google Scholar 

  6. Chen, C., Breiman, L.: Using random forest to learn imbalanced data. University of California, Berkeley (01 2004)

    Google Scholar 

  7. Cheplygina, V., de Bruijne, M., Pluim, J.P.W.: Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis (2018)

    Google Scholar 

  8. Cover, T.M., Thomas, J.A.: Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience (July 2006)

    Google Scholar 

  9. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)

    Article  Google Scholar 

  10. Farahnakian, F., Zelioli, L., Pitkänen, T., Pohjankukka, J., Middleton, M., Tuominen, S., Nevalainen, P., Heikkonen, J.: Multistream convolutional neural network fusion for pixel-wise classification of peatland. In: 2023 26th International Conference on Information Fusion (FUSION). pp. 1–8 (2023). https://doi.org/10.23919/FUSION52260.2023.10224183

  11. Farahnakian, F., Sheikh, J., Farahnakian, F., Heikkonen, J.: A comparative study of state-of-the-art deep learning architectures for rice grain classification. Journal of Agriculture and Food Research 15, 100890 (2024)

    Article  MATH  Google Scholar 

  12. Foster, D.J., Kale, S., Luo, H., Mohri, M., Sridharan, K.: Logistic regression: The importance of being improper (2018)

    Google Scholar 

  13. Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using gan for improved liver lesion classification (2018)

    Google Scholar 

  14. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)

    Article  Google Scholar 

  15. Han, H., Wang, W., Mao, B.: Borderline-smote: A new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing (2005), https://api.semanticscholar.org/CorpusID:12126950

  16. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics, Springer (2009), https://books.google.fi/books?id=eBSgoAEACAAJ

  17. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1322–1328 (2008). https://doi.org/10.1109/IJCNN.2008.4633969

  18. Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural networks : the official journal of the International Neural Network Society 13 4-5, 411–30 (2000), https://api.semanticscholar.org/CorpusID:11959218

  19. Kherif, F., Latypova, A.: Chapter 12 - principal component analysis. In: Mechelli, A., Vieira, S. (eds.) Machine Learning, pp. 209–225. Academic Press (2020). https://doi.org/10.1016/B978-0-12-815739-8.00012-2

  20. Kulkarni, V.Y., Sinha, P.K.: Pruning of random forest classifiers: A survey and future directions. In: 2012 International Conference on Data Science & Engineering (ICDSE). pp. 64–68. IEEE (2012)

    Google Scholar 

  21. LemaÃŽtre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)

    Google Scholar 

  22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). https://doi.org/10.1109/TPAMI.2018.2858826

    Article  Google Scholar 

  23. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17

  24. Liu, L., Wu, X., Li, S., Li, Y., Tan, S., Bai, Y.: Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Med. Inform. Decis. Mak. 22(1), 82 (2022)

    Article  MATH  Google Scholar 

  25. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2), 539–550 (2009). https://doi.org/10.1109/TSMCB.2008.2007853

  26. Popescu, M.C., Balas, V.E., Perescu-Popescu, L., Mastorakis, N.: Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems 8(7), 579–588 (2009)

    MATH  Google Scholar 

  27. Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), e0118432 (2015)

    Article  Google Scholar 

  28. Salunkhe, U.R., Mali, S.N.: Classifier ensemble design for imbalanced data classification: A hybrid approach. Procedia Computer Science 85, 725–732 (2016). https://doi.org/10.1016/j.procs.2016.05.259, international Conference on Computational Modelling and Security (CMS 2016)

  29. Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating support of a high-dimensional distribution. Neural Computation 13, 1443–1471 (07 2001). https://doi.org/10.1162/089976601750264965

  30. Suthaharan, S., Suthaharan, S.: Decision tree learning. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning pp. 237–269 (2016)

    Google Scholar 

  31. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. Proceedings of International Joint Conference Artificial Intelligence (06 1999)

    Google Scholar 

Download references

Acknowledgements

The compilation of the presented work is supported by funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Javad Sheikh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sheikh, J., Farahnakian, F., Farahnakian, F., Zelioli, L., Heikkonen, J. (2025). SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15326. Springer, Cham. https://doi.org/10.1007/978-3-031-78395-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78395-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78394-4

  • Online ISBN: 978-3-031-78395-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics