Abstract
As noted by Turing Laureates Geoffrey Hinton and Yan LeCun [16], two elements have been essential to AI's recent boom: (1) deep neural nets and the theory and practice behind them; and (2) cloud computing with its abundant labeled data and large computing resources.
Abundant labeled data is available for key domains such as images, speech, natural language processing, and recommendation engines. However, in many other domains such data is not available, or access is highly restricted for privacy reasons, as with health and financial data. Even when abundant data is available, it is often not labeled. Doing such labeling is labor-intensive and non-scalable.
To get around these data problems there have been many proposals to generate synthetic data [20, 24, 29, 30, 35, 39]. However, to the best of our knowledge, key domains still lack labeled data or have at most toy data; or the synthetic data must have access to real data from which it can mimic new data. Looking to some of the challenges outlined in [3] at ICAIF'2020, this paper outlines work to generate realistic synthetic data without those restrictions and for an important domain: credit card transactions - including both normal and fraudulent transactions.
At first glance it may appear simple to generate such transactions - just formalize a few items of the nature, "Sally sold slacks to Sue on Sunday." However, there are many patterns and correlations in real purchases. And there are millions of merchants and innumerable locations. And those merchants offer a wide variety of goods. Determining who shops where and when becomes daunting. Challenging also is the question of how much people pay. Inserting fraudulent transactions in the mix and doing all of these things with no real seed data provide final challenges.
Generating good data to overcome these obstacles benefits from a mixture of technical approaches and domain knowledge. Those domains of knowledge include mechanics of credit card processing as well as a broad set of consumer domains, from electronics to clothing to hair styling to home improvement and many more. We also find that creation of a virtual world depicting people's commercial lives facilitates generation of high-quality, realistic data. This paper outlines some of our key techniques and provides evidence that the data generated is indeed realistic via comparisons to Federal Reserve data, recent data from a major card issuer, and more. At the end of the paper we also provide a link to a public sample of our data [2].
Although beyond the scope of this paper, our synthetic credit-card data also facilitates development and training of models to predict fraud. Those models coupled with the synthetic dataset also provide foundations for designing acceleration hardware, just as GPUs, TPUs [10, 19] and other devices have been used for domains such as image classification, object detection, natural language processing, etc.