skip to main content
10.1145/3603287.3651191acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
short-paper

Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

Published: 27 April 2024 Publication History

Abstract

An imbalanced dataset is characterized by a substantial disparity in the distribution of examples among its classes, with one class containing significantly more instances than others. Most of the credit fraud datasets are imbalanced. Addressing the challenges posed by imbalanced datasets in classification problems is a complex task, as many classification algorithms struggle to provide satisfactory performance under such conditions. In this article, we have conducted a comparative analysis of various classifiers to assess their performance in handling imbalanced data related to credit card fraud. Then, we employed the Synthetic Minority Oversampling Technique (SMOTE) to synthesize imbalanced data into a relatively balanced dataset. Subsequently, we reevaluated the classification results using different classifiers. Ultimately, our findings revealed that the Naive Bayes classifier was less sensitive to the dataset imbalance, which the AUC score increase rate is 40.19%, the KNN classifier is the most sensitive one, the AUC score increase rate is 61.27%. In all, AdaBoost and Random Forest perform much higher AUC score, which are both higher than 95% after the SMOTE.

References

[1]
Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. 2003. Benchmarking State-of-the-art Classification Algorithms for Credit Scoring. Journal of the operational research society 54 (2003), 627--635.
[2]
Alejandro Correa Bahnsen, Djamia Aouada, and Björn Ottersten. 2014. Example-dependent Cost-sensitive Logistic Regression for Credit Scoring. In 2014 13th International conference on machine learning and applications. Detroit, USA, 263--269.
[3]
Ricardo Barandela, Rosa M Valdovinos, J Salvador Sánchez, and Francesc J Ferri. 2004. The Imbalanced Training Sample Problem: Under or Over Sampling?. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18-20, 2004. Proceedings. Springer, 806--814.
[4]
Mohamed Bekkar, Hassiba Kheliouane Djemaa, and Taklit Akrouf Alitouche. 2013. Evaluation Measures for Models Assessment Over Imbalanced Data Sets. J Inf Eng Appl 3, 10 (2013).
[5]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research 16 (2002), 321--357.
[6]
Yoav Freund and Robert E Schapire. 1997. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.
[7]
Yoav Freund, Robert E Schapire, et al. 1996. Experiments with a New Boosting Algorithm. In ICML, Vol. 96. Citeseer, Garda, Italy, 148--156.
[8]
Peter Gnip, Liberios Vokorokos, and Peter Drotár. 2021. Selective Oversampling Approach for Strongly Imbalanced Data. Peer J Computer Science 7 (2021), e604.
[9]
Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on Deep Learning with Class Imbalance. Journal of Big Data 6, 1 (2019), 1--54.
[10]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Peter Prettenhofer Mathieu Blondel, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[11]
Enislay Ramentol, Yailé Caballero, Rafael Bello, and Francisco Herrera. 2012. Smote-rs b*: A Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-sets Using SMOTE and Rough Sets Theory. Knowledge and Information Systems 33 (2012), 245--265.
[12]
Budi Santoso, Hari Wijayanto, Khairil A. Notodiputro, and Bagus Sartono. 2017. Synthetic over Sampling Methods for Handling Class Imbalanced Problems: A Review. In IOP conference series: earth and environmental science, Vol. 58. IOP Publishing, 012031.
[13]
Sarkar Sobhan, Sammangi Vinay, Chawki Djeddi, and J Maiti. 2022. Classification and Pattern Extraction of Incidents: A Deep Learning-based Approach. Neural Computing and Applications 34, 17 (2022), 14253--14274.
[14]
Guanjin Wang, Kok Wai Wong, and Jie Lu. 2021. AUC-Based Extreme Learning Machines for Supervised and Semi-Supervised Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 12 (2021), 7919--7930. https://doi.org/10.1109/TSMC.2020.2982226
[15]
Shijun Wang, Diana Li, Nicholas Petrick, Berkman Sahiner, Marius George Linguraru, and Ronald M Summers. 2015. Optimizing Area Under the ROC Curve Using Semi-Supervised Learning. Pattern recognition 48, 1 (2015), 276--287.
[16]
Wenyang Wang and Dongchu Sun. 2021. The Improved AdaBoost Algorithms for Imbalanced Data Classification. Information Sciences 563 (2021), 358--374.
[17]
David West. 2000. Neural Network Credit Scoring Models. Computers & operations research 27, 11-12 (2000), 1131--1152.
[18]
Yingxu Yang. 2007. Adaptive Credit Scoring With Kernel Learning Methods. European Journal of Operational Research 183, 3 (2007), 1521--1536.
[19]
I-Cheng Yeh. 2016. Default of Credit Card Clients. UCI Machine Learning Repository.
[20]
Lili Zhang, Trent Geisler, Herman Ray, and Ying Xie. 2022. Improving Logistic Regression on the Imbalanced Data by a Novel Penalized Log-likelihood Function. Journal of Applied Statistics 49, 13 (2022), 3257--3277.
[21]
Lili Zhang, Herman Ray, Jennifer Priestley, and Soon Tan. 2020. A Descriptive Study of Variable Discretization and Cost-sensitive Logistic Regression on Imbalanced Credit Data. Journal of Applied Statistics 47, 3 (2020), 568--581.
[22]
Zhaohui Zheng, Xiaoyun Wu, and Rohini Srihari. 2004. Feature Selection for Text Categorization on Imbalanced Data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 80--89.

Cited By

View all
  • (2025)Predicting NSSI behavior of Chinese secondary vocational school students with different machine learning methods: Dealing with categorical data imbalance with a resampling techniqueCurrent Psychology10.1007/s12144-025-07436-4Online publication date: 3-Feb-2025

Index Terms

  1. Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACMSE '24: Proceedings of the 2024 ACM Southeast Conference
    April 2024
    337 pages
    ISBN:9798400702372
    DOI:10.1145/3603287
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 April 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Classifier
    2. Imbalanced dataset
    3. ROC
    4. Synthesize

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    ACM SE '24
    Sponsor:
    ACM SE '24: 2024 ACM Southeast Conference
    April 18 - 20, 2024
    GA, Marietta, USA

    Acceptance Rates

    ACMSE '24 Paper Acceptance Rate 44 of 137 submissions, 32%;
    Overall Acceptance Rate 502 of 1,023 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Predicting NSSI behavior of Chinese secondary vocational school students with different machine learning methods: Dealing with categorical data imbalance with a resampling techniqueCurrent Psychology10.1007/s12144-025-07436-4Online publication date: 3-Feb-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media