SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning

Sheikh, Javad; Farahnakian, Farshad; Farahnakian, Fahimeh; Zelioli, Luca; Heikkonen, Jukka

doi:10.1007/978-3-031-78395-1_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15326))

Included in the following conference series:

International Conference on Pattern Recognition

247 Accesses

Abstract

Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to $2.92\times 10^5$ unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A non-parameter oversampling approach for imbalanced data classification based on hybrid natural neighbors

Article 22 January 2025

Augmentation Based Synthetic Sampling and Ensemble Techniques for Imbalanced Data Classification

Balancing method for landslide monitoring samples and construction of an early warning system

Article 07 January 2025

Notes

References

Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20-29 (jun 2004). https://doi.org/10.1145/1007730.1007735
Butt, U.M., Letchmunan, S., Ali, M., Hassan, F.H., Baqir, A., Sherazi, H.H.R.: Machine learning based diabetes classification and prediction for healthcare applications. Journal of healthcare engineering 2021(1), 9930985 (2021)
Google Scholar
Charitou, C., Dragicevic, S., d’Avila Garcez, A.: Synthetic data generation for fraud detection using gans (2021)
Google Scholar
Chawla, N.V.: Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook pp. 875–886 (2010)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357 (2002)
Article MATH Google Scholar
Chen, C., Breiman, L.: Using random forest to learn imbalanced data. University of California, Berkeley (01 2004)
Google Scholar
Cheplygina, V., de Bruijne, M., Pluim, J.P.W.: Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis (2018)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience (July 2006)
Google Scholar
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Article Google Scholar
Farahnakian, F., Zelioli, L., Pitkänen, T., Pohjankukka, J., Middleton, M., Tuominen, S., Nevalainen, P., Heikkonen, J.: Multistream convolutional neural network fusion for pixel-wise classification of peatland. In: 2023 26th International Conference on Information Fusion (FUSION). pp. 1–8 (2023). https://doi.org/10.23919/FUSION52260.2023.10224183
Farahnakian, F., Sheikh, J., Farahnakian, F., Heikkonen, J.: A comparative study of state-of-the-art deep learning architectures for rice grain classification. Journal of Agriculture and Food Research 15, 100890 (2024)
Article MATH Google Scholar
Foster, D.J., Kale, S., Luo, H., Mohri, M., Sridharan, K.: Logistic regression: The importance of being improper (2018)
Google Scholar
Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using gan for improved liver lesion classification (2018)
Google Scholar
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Article Google Scholar
Han, H., Wang, W., Mao, B.: Borderline-smote: A new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing (2005), https://api.semanticscholar.org/CorpusID:12126950
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics, Springer (2009), https://books.google.fi/books?id=eBSgoAEACAAJ
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1322–1328 (2008). https://doi.org/10.1109/IJCNN.2008.4633969
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural networks : the official journal of the International Neural Network Society 13 4-5, 411–30 (2000), https://api.semanticscholar.org/CorpusID:11959218
Kherif, F., Latypova, A.: Chapter 12 - principal component analysis. In: Mechelli, A., Vieira, S. (eds.) Machine Learning, pp. 209–225. Academic Press (2020). https://doi.org/10.1016/B978-0-12-815739-8.00012-2
Kulkarni, V.Y., Sinha, P.K.: Pruning of random forest classifiers: A survey and future directions. In: 2012 International Conference on Data Science & Engineering (ICDSE). pp. 64–68. IEEE (2012)
Google Scholar
LemaÃŽtre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). https://doi.org/10.1109/TPAMI.2018.2858826
Article Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17
Liu, L., Wu, X., Li, S., Li, Y., Tan, S., Bai, Y.: Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Med. Inform. Decis. Mak. 22(1), 82 (2022)
Article MATH Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2), 539–550 (2009). https://doi.org/10.1109/TSMCB.2008.2007853
Popescu, M.C., Balas, V.E., Perescu-Popescu, L., Mastorakis, N.: Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems 8(7), 579–588 (2009)
MATH Google Scholar
Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), e0118432 (2015)
Article Google Scholar
Salunkhe, U.R., Mali, S.N.: Classifier ensemble design for imbalanced data classification: A hybrid approach. Procedia Computer Science 85, 725–732 (2016). https://doi.org/10.1016/j.procs.2016.05.259, international Conference on Computational Modelling and Security (CMS 2016)
Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating support of a high-dimensional distribution. Neural Computation 13, 1443–1471 (07 2001). https://doi.org/10.1162/089976601750264965
Suthaharan, S., Suthaharan, S.: Decision tree learning. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning pp. 237–269 (2016)
Google Scholar
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. Proceedings of International Joint Conference Artificial Intelligence (06 1999)
Google Scholar

Download references

Acknowledgements

The compilation of the presented work is supported by funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.

Author information

Authors and Affiliations

Department of Computing, University of Turku, 20500, Turku, Finland
Javad Sheikh, Farshad Farahnakian, Fahimeh Farahnakian, Luca Zelioli & Jukka Heikkonen
Geological Survey of Finland (GTK), 02151, Espoo, Finland
Fahimeh Farahnakian

Authors

Javad Sheikh
View author publications
You can also search for this author in PubMed Google Scholar
Farshad Farahnakian
View author publications
You can also search for this author in PubMed Google Scholar
Fahimeh Farahnakian
View author publications
You can also search for this author in PubMed Google Scholar
Luca Zelioli
View author publications
You can also search for this author in PubMed Google Scholar
Jukka Heikkonen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Javad Sheikh .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute, kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheikh, J., Farahnakian, F., Farahnakian, F., Zelioli, L., Heikkonen, J. (2025). SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15326. Springer, Cham. https://doi.org/10.1007/978-3-031-78395-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-78395-1_3
Published: 03 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78394-4
Online ISBN: 978-3-031-78395-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)