research-article

Synthesizing credit card transactions

Author:

Erik AltmanAuthors Info & Claims

ICAIF '21: Proceedings of the Second ACM International Conference on AI in Finance

Article No.: 13, Pages 1 - 9

https://doi.org/10.1145/3490354.3494378

Published: 04 May 2022 Publication History

Get Access

Abstract

As noted by Turing Laureates Geoffrey Hinton and Yan LeCun [16], two elements have been essential to AI's recent boom: (1) deep neural nets and the theory and practice behind them; and (2) cloud computing with its abundant labeled data and large computing resources.

Abundant labeled data is available for key domains such as images, speech, natural language processing, and recommendation engines. However, in many other domains such data is not available, or access is highly restricted for privacy reasons, as with health and financial data. Even when abundant data is available, it is often not labeled. Doing such labeling is labor-intensive and non-scalable.

To get around these data problems there have been many proposals to generate synthetic data [20, 24, 29, 30, 35, 39]. However, to the best of our knowledge, key domains still lack labeled data or have at most toy data; or the synthetic data must have access to real data from which it can mimic new data. Looking to some of the challenges outlined in [3] at ICAIF'2020, this paper outlines work to generate realistic synthetic data without those restrictions and for an important domain: credit card transactions - including both normal and fraudulent transactions.

At first glance it may appear simple to generate such transactions - just formalize a few items of the nature, "Sally sold slacks to Sue on Sunday." However, there are many patterns and correlations in real purchases. And there are millions of merchants and innumerable locations. And those merchants offer a wide variety of goods. Determining who shops where and when becomes daunting. Challenging also is the question of how much people pay. Inserting fraudulent transactions in the mix and doing all of these things with no real seed data provide final challenges.

Generating good data to overcome these obstacles benefits from a mixture of technical approaches and domain knowledge. Those domains of knowledge include mechanics of credit card processing as well as a broad set of consumer domains, from electronics to clothing to hair styling to home improvement and many more. We also find that creation of a virtual world depicting people's commercial lives facilitates generation of high-quality, realistic data. This paper outlines some of our key techniques and provides evidence that the data generated is indeed realistic via comparisons to Federal Reserve data, recent data from a major card issuer, and more. At the end of the paper we also provide a link to a public sample of our data [2].

Although beyond the scope of this paper, our synthetic credit-card data also facilitates development and training of models to predict fraud. Those models coupled with the synthetic dataset also provide foundations for designing acceleration hardware, just as GPUs, TPUs [10, 19] and other devices have been used for domains such as image classification, object detection, natural language processing, etc.

References

[1]

Erik Altman. 2021. Anti-Money Laundering Data: InPlusLab Multi-Agent Virtual World Simulation. https://github.com/IBM/AML-Data.

Abstract

References

Cited By

Recommendations

Securing credit card transactions with one-time payment scheme

Data mining for credit card fraud: A comparative study

How do different payment methods deliver cost and credit efficiency in electronic commerce?

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations