skip to main content
10.1145/3490354.3494378acmconferencesArticle/Chapter ViewAbstractPublication PagesicaifConference Proceedingsconference-collections
research-article

Synthesizing credit card transactions

Published: 04 May 2022 Publication History

Abstract

As noted by Turing Laureates Geoffrey Hinton and Yan LeCun [16], two elements have been essential to AI's recent boom: (1) deep neural nets and the theory and practice behind them; and (2) cloud computing with its abundant labeled data and large computing resources.
Abundant labeled data is available for key domains such as images, speech, natural language processing, and recommendation engines. However, in many other domains such data is not available, or access is highly restricted for privacy reasons, as with health and financial data. Even when abundant data is available, it is often not labeled. Doing such labeling is labor-intensive and non-scalable.
To get around these data problems there have been many proposals to generate synthetic data [20, 24, 29, 30, 35, 39]. However, to the best of our knowledge, key domains still lack labeled data or have at most toy data; or the synthetic data must have access to real data from which it can mimic new data. Looking to some of the challenges outlined in [3] at ICAIF'2020, this paper outlines work to generate realistic synthetic data without those restrictions and for an important domain: credit card transactions - including both normal and fraudulent transactions.
At first glance it may appear simple to generate such transactions - just formalize a few items of the nature, "Sally sold slacks to Sue on Sunday." However, there are many patterns and correlations in real purchases. And there are millions of merchants and innumerable locations. And those merchants offer a wide variety of goods. Determining who shops where and when becomes daunting. Challenging also is the question of how much people pay. Inserting fraudulent transactions in the mix and doing all of these things with no real seed data provide final challenges.
Generating good data to overcome these obstacles benefits from a mixture of technical approaches and domain knowledge. Those domains of knowledge include mechanics of credit card processing as well as a broad set of consumer domains, from electronics to clothing to hair styling to home improvement and many more. We also find that creation of a virtual world depicting people's commercial lives facilitates generation of high-quality, realistic data. This paper outlines some of our key techniques and provides evidence that the data generated is indeed realistic via comparisons to Federal Reserve data, recent data from a major card issuer, and more. At the end of the paper we also provide a link to a public sample of our data [2].
Although beyond the scope of this paper, our synthetic credit-card data also facilitates development and training of models to predict fraud. Those models coupled with the synthetic dataset also provide foundations for designing acceleration hardware, just as GPUs, TPUs [10, 19] and other devices have been used for domains such as image classification, object detection, natural language processing, etc.

References

[1]
Erik Altman. 2021. Anti-Money Laundering Data: InPlusLab Multi-Agent Virtual World Simulation. https://github.com/IBM/AML-Data.
[2]
Erik Altman. 2021. Credit Card Transactions: Fraud Detection and Other Analyses. https://www.kaggle.com/ealtman2019/credit-card-transactions.
[3]
Samuel Assefa, Danial Dervovic, Tucker Balch, Mahmoud Mahfouz, Robert Tillman, Prashant Reddy, and Manuela Veloso. 2020. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. 1st ACM International Conference on AI in Finance (October 2020).
[4]
Janio Martinez Bachmann. 2019. Credit Fraud: Dealing with Imbalanced Datasets. (2019). https://www.kaggle.com.
[5]
Evgeniy Bart and Shimon Ullman. 2005. Cross-Generalization: Learning Novel Classes from a Single Example by Feature Replacement. CVPR: Conference on Computer Vision and Pattern Recognition (2005).
[6]
Gianluca Bontempi and Tom Lenaerts. 2018. Credit Card Fraud Detection: Anonymized credit card transactions labeled as fraudulent or genuine. (2018). https://www.kaggle.com/mlg-ulb/creditcardfraud.
[7]
United States Census Bureau. 2020. Current Population Survey (CPS). (2020). https://www.census.gov/cps/data/cpstablecreator.html.
[8]
Carcillo_et_al. 2018. Scarff: a scalable framework for streaming credit card fraud detection with spark. Information Fusion (May 2018). https://arxiv.org/pdf/1709.08920.pdf.
[9]
Wikipedia / United States Census. 2020. Per capita personal income in the United States. (2020). https://en.wikipedia.org/wiki/Per_capita_personal_income_in_the_United_States.
[10]
Jack Choquette. 2017. Volta: Programmability and Performance. Hot Chips 29 (August 2017). https://www.hotchips.org/archives/2010s/hc29.
[11]
Daniel S. Coven. 2019. Free Zipcode Database with Latitude and Longitude. (2019). http://federalgovernmentzipcodes.us.
[12]
FederalReserve. 2017. Payments Study Annual Supplement. (2017).
[13]
Ian Goodfellow_et_al. 2014. Generative Adversarial Networks. NIPS: Advances in Neural Information Processing Systems (2014).
[14]
Habana-Inference 2019. Habana Goya Inference Processor. (2019). habana.ai/inference.
[15]
Habana-Training 2019. Gaudi AI Training: A New class of performance and scalability. (2019). habana.ai/training.
[16]
Geoffrey Hinton and Yan LeCun. 2019. The Deep Learning Revolution. (2019). https://fcrc.acm.org/turing-lecture-at-fcrc-2019.
[17]
W. Huber. 2017. Generate a random variable with a defined correlation to an existing variable(s). (2017). https://stats.stackexchange.com.
[18]
IEEE Computational Intelligence Society (IEEE-CIS) and Vesta Corporation. 2019. IEEE-CIS Fraud Detection. https://www.kaggle.com/c/ieee-fraud-detection.
[19]
Jouppi_et_al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. Intl Symposium on Computer Architecture (June 2017).
[20]
Xuan Li, Kunfeng Wang, Yonglin Tian, Lan Yan, and Fei-Yue Wang. 2017. The ParallelEye Dataset: Constructing Large-Scale Artificial Scenes for Traffic Vision Research. https://arxiv.org/abs/1712.08394 (December 2017).
[21]
Edgar Lopez. 2016. Synthetic Data from a Financial Payment System. (2016). https://www.kaggle.com/ealaxi/banksim1, http://edgarlopez.net.
[22]
Fortune Magazine. 2019. Fortune 500 Largest Corporations. (2019). http://fortune.com/fortune500/list.
[23]
Kevin Murphy_et_al. 2003. Using the Forest to See the Trees: A Graphical Model Relating Features, Objects & Scenes. NIPS: Advances in Neural Information Proc Sys (2003).
[24]
Neuromation. 2019. https://neuromation.io/marketplace. (2019).
[25]
Apoorva Nitsure. 2021. Anti-Money Laundering Data: InPlusLab Multi-Agent Virtual World Simulation. https://www.kaggle.com/apoorvanitsureibm/lightgbm-on-credit-card-transactions.
[26]
Board of Governors of the Federal Reserve System. 2018. Changes in U.S. Payments Fraud from 2012 to 2016: Evidence from the Federal Reserve Payments Study. (2018). https://www.federalreserve.gov/publications/files/changes-in-us-payments-fraud-from-2012-to-2016-20181016.pdf.
[27]
United States Bureau of Labor Statistics. 2017. What Is the Average Credit Score in the U.S.? (2017). https://www.bls.gov/oes/current/oes_stru.htm.
[28]
Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. 2021. Tabular Transformers for Modeling Multivariate Time Series. ICASSP 2021 - IEEE International Conference on Acoustics, Speech and Signal Processing (June 2021).
[29]
Patki_et_al. 2016. The Synthetic Data Vault. IEEE International Conference on Data Science and Advanced Analytics (October 2016).
[30]
Peng_et_al. 2015. Learning Deep Object Detectors from 3D Models. https://arxiv.org/abs/1412.7122 (October 2015).
[31]
Pozzolo_et_al. 2014. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications (February 2014).
[32]
Pozzolo_et_al. 2015. Credit card fraud detection and concept-drift adaptation with delayed supervised information. IJCNN: International Joint Conference on Neural Networks (July 2015).
[33]
Lorien Y. Pratt. 1993. Discriminability-Based Transfer between Neural Networks. NIPS: Advances in Neural Information Processing Sys (1993).
[34]
Roxanna "Evan" Ramzipoor. 2018. 10 Fraud Myths: The Hidden Insights that Drive Business Decisions. (2018). blog.sift.com/2018/10-fraud-myths.
[35]
Donald Rubin. 1993. Discussion: Statistical Disclosure Limitation. Journal of Official Statistics 9, 2 (January 1993).
[36]
Stefan Lembo Stolba. 2020. What Is the Average Credit Score in the U.S.? Experian (2020). https://www.experian.com/blogs/ask-experian/what-is-the-average-credit-score-in-the-u-s.
[37]
Toyotaro Suzumura and Hiroki Kanezashi. 2021. Anti-Money Laundering Datasets: InPlusLab Anti-Money Laundering DataDatasets. http://github.com/IBM/AMLSim/.
[38]
Alexey Tsymbal. 2004. The problem of concept drift: definitions and related work. (2004). https://www.scss.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf.
[39]
Austin Walters. 2018. Why You Don't Necessarily Need Data for Data Science. (2018). https://medium.com/capitalonetech/whyyoudontnecessarilyneeddatafor-datascience48d7bf503074.

Cited By

View all
  • (2024)Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and ChallengesProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698666(555-563)Online publication date: 14-Nov-2024
  • (2024)SHINE: A Scalable Heterogeneous Inductive Graph Neural Network for Large Imbalanced DatasetsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338124036:9(4904-4915)Online publication date: Sep-2024
  • (2024)Auditing and Generating Synthetic Data with Controllable Trust Trade-offsIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2024.3477976(1-1)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICAIF '21: Proceedings of the Second ACM International Conference on AI in Finance
November 2021
450 pages
ISBN:9781450391481
DOI:10.1145/3490354
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 May 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. agent-based
  2. credit cards
  3. simulation
  4. synthetic data
  5. virtual world

Qualifiers

  • Research-article

Conference

ICAIF'21
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)14
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Evaluating Fairness in Transaction Fraud Models: Fairness Metrics, Bias Audits, and ChallengesProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698666(555-563)Online publication date: 14-Nov-2024
  • (2024)SHINE: A Scalable Heterogeneous Inductive Graph Neural Network for Large Imbalanced DatasetsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338124036:9(4904-4915)Online publication date: Sep-2024
  • (2024)Auditing and Generating Synthetic Data with Controllable Trust Trade-offsIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2024.3477976(1-1)Online publication date: 2024
  • (2024)Machine Learning Methods for Credit Card Fraud Detection: A SurveyIEEE Access10.1109/ACCESS.2024.348729812(158939-158965)Online publication date: 2024
  • (2024)FedFusion: Adaptive Model Fusion for Addressing Feature Discrepancies in Federated Credit Card Fraud DetectionIEEE Access10.1109/ACCESS.2024.346433312(136962-136978)Online publication date: 2024
  • (2024)MoMTSim: A Multi-Agent-Based Simulation Platform Calibrated for Mobile Money TransactionsIEEE Access10.1109/ACCESS.2024.343901212(120226-120238)Online publication date: 2024
  • (2024)Deep Learning for Credit Card Fraud Detection: A Review of Algorithms, Challenges, and SolutionsIEEE Access10.1109/ACCESS.2024.342695512(96893-96910)Online publication date: 2024
  • (2024)Real-World Efficacy of Explainable Artificial Intelligence using the SAGE Framework and Scenario-Based DesignApplied Artificial Intelligence10.1080/08839514.2024.243086738:1Online publication date: 26-Nov-2024
  • (2024)Automatic Card Fraud Detection Based on Decision Tree AlgorithmApplied Artificial Intelligence10.1080/08839514.2024.238524938:1Online publication date: 29-Jul-2024
  • (2024)Event-Aware Multi-component (EMl) Loss for Fraud DetectionPattern Recognition10.1007/978-3-031-78398-2_7(105-119)Online publication date: 2-Dec-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media