Abstract
Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to \(2.92\times 10^5\) unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20-29 (jun 2004). https://doi.org/10.1145/1007730.1007735
Butt, U.M., Letchmunan, S., Ali, M., Hassan, F.H., Baqir, A., Sherazi, H.H.R.: Machine learning based diabetes classification and prediction for healthcare applications. Journal of healthcare engineering 2021(1), 9930985 (2021)
Charitou, C., Dragicevic, S., d’Avila Garcez, A.: Synthetic data generation for fraud detection using gans (2021)
Chawla, N.V.: Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook pp. 875–886 (2010)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357 (2002)
Chen, C., Breiman, L.: Using random forest to learn imbalanced data. University of California, Berkeley (01 2004)
Cheplygina, V., de Bruijne, M., Pluim, J.P.W.: Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis (2018)
Cover, T.M., Thomas, J.A.: Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience (July 2006)
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Farahnakian, F., Zelioli, L., Pitkänen, T., Pohjankukka, J., Middleton, M., Tuominen, S., Nevalainen, P., Heikkonen, J.: Multistream convolutional neural network fusion for pixel-wise classification of peatland. In: 2023 26th International Conference on Information Fusion (FUSION). pp. 1–8 (2023). https://doi.org/10.23919/FUSION52260.2023.10224183
Farahnakian, F., Sheikh, J., Farahnakian, F., Heikkonen, J.: A comparative study of state-of-the-art deep learning architectures for rice grain classification. Journal of Agriculture and Food Research 15, 100890 (2024)
Foster, D.J., Kale, S., Luo, H., Mohri, M., Sridharan, K.: Logistic regression: The importance of being improper (2018)
Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using gan for improved liver lesion classification (2018)
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Han, H., Wang, W., Mao, B.: Borderline-smote: A new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing (2005), https://api.semanticscholar.org/CorpusID:12126950
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics, Springer (2009), https://books.google.fi/books?id=eBSgoAEACAAJ
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1322–1328 (2008). https://doi.org/10.1109/IJCNN.2008.4633969
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural networks : the official journal of the International Neural Network Society 13 4-5, 411–30 (2000), https://api.semanticscholar.org/CorpusID:11959218
Kherif, F., Latypova, A.: Chapter 12 - principal component analysis. In: Mechelli, A., Vieira, S. (eds.) Machine Learning, pp. 209–225. Academic Press (2020). https://doi.org/10.1016/B978-0-12-815739-8.00012-2
Kulkarni, V.Y., Sinha, P.K.: Pruning of random forest classifiers: A survey and future directions. In: 2012 International Conference on Data Science & Engineering (ICDSE). pp. 64–68. IEEE (2012)
LemaÃŽtre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). https://doi.org/10.1109/TPAMI.2018.2858826
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17
Liu, L., Wu, X., Li, S., Li, Y., Tan, S., Bai, Y.: Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Med. Inform. Decis. Mak. 22(1), 82 (2022)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2), 539–550 (2009). https://doi.org/10.1109/TSMCB.2008.2007853
Popescu, M.C., Balas, V.E., Perescu-Popescu, L., Mastorakis, N.: Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems 8(7), 579–588 (2009)
Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), e0118432 (2015)
Salunkhe, U.R., Mali, S.N.: Classifier ensemble design for imbalanced data classification: A hybrid approach. Procedia Computer Science 85, 725–732 (2016). https://doi.org/10.1016/j.procs.2016.05.259, international Conference on Computational Modelling and Security (CMS 2016)
Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R.: Estimating support of a high-dimensional distribution. Neural Computation 13, 1443–1471 (07 2001). https://doi.org/10.1162/089976601750264965
Suthaharan, S., Suthaharan, S.: Decision tree learning. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning pp. 237–269 (2016)
Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. Proceedings of International Joint Conference Artificial Intelligence (06 1999)
Acknowledgements
The compilation of the presented work is supported by funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sheikh, J., Farahnakian, F., Farahnakian, F., Zelioli, L., Heikkonen, J. (2025). SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15326. Springer, Cham. https://doi.org/10.1007/978-3-031-78395-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-78395-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78394-4
Online ISBN: 978-3-031-78395-1
eBook Packages: Computer ScienceComputer Science (R0)