research-article

Tab-Distillation: Impacts of Dataset Distillation on Tabular Data For Outlier Detection

Authors:

Dayananda Herurkar,

Andreas DengelAuthors Info & Claims

ICAIF '24: Proceedings of the 5th ACM International Conference on AI in Finance

Pages 804 - 812

https://doi.org/10.1145/3677052.3698660

Published: 14 November 2024 Publication History

Abstract

Dataset distillation aims to replace large training sets with significantly smaller synthetic sets while preserving essential information. This method reduces the training costs of advanced deep learning models and is widely used in the image domain. Among various distillation methods, "Dataset Condensation with Distribution Matching (DM)" stands out for its low synthesis cost and minimal hyperparameter tuning. Due to its computationally economical nature, DM is applicable to realistic scenarios, such as industries with large tabular datasets. However, its use in tabular data has not been extensively explored. In this study, we apply DM to tabular datasets for outlier detection. Our findings show that distillation effectively addresses class imbalance, a common issue in these datasets. The synthetic datasets offer better sample representation and class separation between inliers and outliers. They also maintain high feature correlation making them resilient against feature pruning. Classification models trained on these distilled datasets perform faster and better that will enhance outlier detection in industries that rely on tabular data.

References

[1]

2000. Census-Income (KDD). UCI Machine Learning Repository.

[2]

Mohiuddin Ahmed, Abdun Naser Mahmood, and Md. Rafiqul Islam. 2016. A Survey of Anomaly Detection Techniques in Financial Domain. Future Gener. Comput. Syst. 55, C (feb 2016), 278–288. https://doi.org/10.1016/j.future.2015.01.001

Digital Library

[3]

Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. 2019. Gradient based sample selection for online continual learning. Curran Associates Inc., Red Hook, NY, USA.

[4]

Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository.

[5]

Vadim Borisov, Tobias Leemann, Kathrin Sessler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2022. Deep Neural Networks and Tabular Data: A Survey. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–21. https://doi.org/10.1109/tnnls.2022.3229161

[6]

Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-End Incremental Learning. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XII (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 241–257. https://doi.org/10.1007/978-3-030-01258-8_15

Digital Library

[7]

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. 2022. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4750–4759.

[8]

Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages. https://doi.org/10.1145/1541880.1541882

Digital Library

[9]

Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and W. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR) 16 (06 2002), 321–357. https://doi.org/10.1613/jair.953

[10]

Yutian Chen, Max Welling, and Alex Smola. 2010. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (Catalina Island, CA) (UAI’10). AUAI Press, Arlington, Virginia, USA, 109–116.

Digital Library

[11]

Andrea Dal Pozzolo, Olivier Caelen, Yann-Aël Le Borgne, Serge Waterschoot, and Gianluca Bontempi. 2014. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications 41 (08 2014), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026

[12]

Dayananda Herurkar, Mario Meier, and Jörn Hees. 2023. RECol: Reconstruction Error Columns for Outlier Detection. In KI 2023: Advances in Artificial Intelligence: 46th German Conference on AI, Berlin, Germany, September 26–29, 2023, Proceedings (Berlin, Germany). Springer-Verlag, Berlin, Heidelberg, 60–74. https://doi.org/10.1007/978-3-031-42608-7_6

Digital Library

[13]

Dayananda Herurkar, Sebastian Palacio, Ahmed Anwar, Joern Hees, and Andreas Dengel. 2024. Fin-Fed-OD: Federated Outlier Detection on Financial Tabular Data. arxiv:2404.14933 [cs.LG] https://arxiv.org/abs/2404.14933

[14]

Dayananda Herurkar, Timur Sattarov, Jörn Hees, Sebastian Palacio, Federico Raue, and Andreas Dengel. 2023. Cross-Domain Transformation for Outlier Detection on Tabular Datasets. In International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, June 18-23, 2023. IEEE, 1–8. https://doi.org/10.1109/IJCNN54540.2023.10191326

[15]

Waleed Hilal, S. Andrew Gadsden, and John Yawney. 2022. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Systems with Applications 193 (2022), 116429. https://doi.org/10.1016/j.eswa.2021.116429

Digital Library

[16]

Addison Howard, Bernadette Bouchon Meunier, IEEE CIS inversion, John Lei, Lynn Vesta, Marcus2010, and Prof. Hussein Abbass. 2019. IEEE-CIS Fraud Detection. https://kaggle.com/competitions/ieee-fraud-detection

[17]

Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. 2021. Graph condensation for graph neural networks. arXiv preprint arXiv:2110.07580 (2021).

[18]

Zahra Kazemi and Houman Zarrabi. 2017. Using deep networks for fraud detection in the credit card transactions. In 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI). 0630–0633. https://doi.org/10.1109/KBEI.2017.8324876

[19]

Yongqi Li and Wenjie Li. 2021. Data distillation for text classification. arXiv preprint arXiv:2104.08448 (2021).

[20]

Dmitry Medvedev and Alexander D’yakonov. 2021. New properties of the data distillation method when working with tabular data. In Analysis of Images, Social Networks and Texts: 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020, Revised Selected Papers 9. Springer, 379–390.

[21]

S. Moro, P. Rita, and P. Cortez. 2012. Bank Marketing. UCI Machine Learning Repository.

[22]

Jack Nicholls, Aditya Kuppa, and Nhien-An Le-Khac. 2021. Financial Cybercrime: A Comprehensive Survey of Deep Learning Approaches to Tackle the Evolving Financial Crime Landscape. IEEE Access 9 (2021), 163965–163986. https://doi.org/10.1109/ACCESS.2021.3134076

[23]

Ebberth L. Paula, Marcelo Ladeira, Rommel N. Carvalho, and Thiago Marzagão. 2016. Deep Learning Anomaly Detection as Support Fraud Investigation in Brazilian Exports and Anti-Money Laundering. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA). 954–960. https://doi.org/10.1109/ICMLA.2016.0172

[24]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, G. Sperl, and Christoph H. Lampert. 2016. iCaRL: Incremental Classifier and Representation Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5533–5542. https://api.semanticscholar.org/CorpusID:206596260

[25]

Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4393–4402. https://proceedings.mlr.press/v80/ruff18a.html

[26]

Timur Sattarov, Dayananda Herurkar, and Jörn Hees. 2022. Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data. CoRR abs/2209.10658 (2022). https://doi.org/10.48550/ARXIV.2209.10658 arXiv:2209.10658

[27]

Marco Schreyer, Timur Sattarov, Damian Borth, Andreas Dengel, and Bernd Reimer. 2017. Detection of Anomalies in Large Scale Accounting Data using Deep Autoencoder Networks. https://doi.org/10.48550/ARXIV.1709.05254

[28]

Ivan Tomek. 1976. Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics SMC-6, 11 (1976), 769–772. https://doi.org/10.1109/TSMC.1976.4309452

[29]

Mariya Toneva, Alessandro Sordoni, Rémi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2018. An Empirical Study of Example Forgetting during Deep Neural Network Learning. ArXiv abs/1812.05159 (2018). https://api.semanticscholar.org/CorpusID:55481903

[30]

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. 2018. Dataset distillation. arXiv preprint arXiv:1811.10959 (2018).

[31]

Roy Wedge, James Max Kanter, Santiago Moral Rubio, Sergio Iglesias Perez, and Kalyan Veeramachaneni. 2017. Solving the "false positives" problem in fraud prediction. arxiv:1710.07709 [cs.AI]

[32]

Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Russakovsky. 2023. Vision-language dataset distillation. (2023).

[33]

I-Cheng Yeh. 2016. Default of Credit Card Clients. UCI Machine Learning Repository.

[34]

Bo Zhao and Hakan Bilen. 2023. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6514–6523.

[35]

Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2020. Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929 (2020).

[36]

Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Dae ki Cho, and Haifeng Chen. 2018. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In International Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:51805340

Index Terms

Tab-Distillation: Impacts of Dataset Distillation on Tabular Data For Outlier Detection

Index terms have been assigned to the content through auto-classification.

Recommendations

Importance-aware adaptive dataset distillation
Abstract
Herein, we propose a novel dataset distillation method for constructing small informative datasets that preserve the information of the large original datasets. The development of deep learning models is enabled by the availability of large-scale ...
Towards trustworthy dataset distillation
Abstract
Efficiency and trustworthiness are two eternal pursuits when applying deep learning in practical scenarios. Considering efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling large datasets into tiny ones. However, ...
Highlights
- We propose a novel paradigm called TrustDD, ensuring both efficiency and trustworthiness.
- The proposed POE surpasses SOTA OE even without real outlier data for OOD detection.
- Experiments show TrustDD improves OOD detection without ...
New Properties of the Data Distillation Method When Working with Tabular Data
Analysis of Images, Social Networks and Texts
Abstract
Data distillation is the problem of reducing the volume of training data while keeping only the necessary information. With this paper, we deeper explore the new data distillation algorithm, previously designed for image data. Our experiments with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICAIF '24: Proceedings of the 5th ACM International Conference on AI in Finance

November 2024

878 pages

ISBN:9798400710810

DOI:10.1145/3677052

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICAIF '24

ICAIF '24: 5th ACM International Conference on AI in Finance

November 14 - 17, 2024

NY, Brooklyn, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
88
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)19

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten