skip to main content
10.1145/3677052.3698660acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicaifConference Proceedingsconference-collections
research-article

Tab-Distillation: Impacts of Dataset Distillation on Tabular Data For Outlier Detection

Published: 14 November 2024 Publication History

Abstract

Dataset distillation aims to replace large training sets with significantly smaller synthetic sets while preserving essential information. This method reduces the training costs of advanced deep learning models and is widely used in the image domain. Among various distillation methods, "Dataset Condensation with Distribution Matching (DM)" stands out for its low synthesis cost and minimal hyperparameter tuning. Due to its computationally economical nature, DM is applicable to realistic scenarios, such as industries with large tabular datasets. However, its use in tabular data has not been extensively explored. In this study, we apply DM to tabular datasets for outlier detection. Our findings show that distillation effectively addresses class imbalance, a common issue in these datasets. The synthetic datasets offer better sample representation and class separation between inliers and outliers. They also maintain high feature correlation making them resilient against feature pruning. Classification models trained on these distilled datasets perform faster and better that will enhance outlier detection in industries that rely on tabular data.

References

[1]
2000. Census-Income (KDD). UCI Machine Learning Repository.
[2]
Mohiuddin Ahmed, Abdun Naser Mahmood, and Md. Rafiqul Islam. 2016. A Survey of Anomaly Detection Techniques in Financial Domain. Future Gener. Comput. Syst. 55, C (feb 2016), 278–288. https://doi.org/10.1016/j.future.2015.01.001
[3]
Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. 2019. Gradient based sample selection for online continual learning. Curran Associates Inc., Red Hook, NY, USA.
[4]
Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository.
[5]
Vadim Borisov, Tobias Leemann, Kathrin Sessler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2022. Deep Neural Networks and Tabular Data: A Survey. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–21. https://doi.org/10.1109/tnnls.2022.3229161
[6]
Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-End Incremental Learning. In Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XII (Munich, Germany). Springer-Verlag, Berlin, Heidelberg, 241–257. https://doi.org/10.1007/978-3-030-01258-8_15
[7]
George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. 2022. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4750–4759.
[8]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages. https://doi.org/10.1145/1541880.1541882
[9]
Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and W. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR) 16 (06 2002), 321–357. https://doi.org/10.1613/jair.953
[10]
Yutian Chen, Max Welling, and Alex Smola. 2010. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (Catalina Island, CA) (UAI’10). AUAI Press, Arlington, Virginia, USA, 109–116.
[11]
Andrea Dal Pozzolo, Olivier Caelen, Yann-Aël Le Borgne, Serge Waterschoot, and Gianluca Bontempi. 2014. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications 41 (08 2014), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
[12]
Dayananda Herurkar, Mario Meier, and Jörn Hees. 2023. RECol: Reconstruction Error Columns for Outlier Detection. In KI 2023: Advances in Artificial Intelligence: 46th German Conference on AI, Berlin, Germany, September 26–29, 2023, Proceedings (Berlin, Germany). Springer-Verlag, Berlin, Heidelberg, 60–74. https://doi.org/10.1007/978-3-031-42608-7_6
[13]
Dayananda Herurkar, Sebastian Palacio, Ahmed Anwar, Joern Hees, and Andreas Dengel. 2024. Fin-Fed-OD: Federated Outlier Detection on Financial Tabular Data. arxiv:2404.14933 [cs.LG] https://arxiv.org/abs/2404.14933
[14]
Dayananda Herurkar, Timur Sattarov, Jörn Hees, Sebastian Palacio, Federico Raue, and Andreas Dengel. 2023. Cross-Domain Transformation for Outlier Detection on Tabular Datasets. In International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, June 18-23, 2023. IEEE, 1–8. https://doi.org/10.1109/IJCNN54540.2023.10191326
[15]
Waleed Hilal, S. Andrew Gadsden, and John Yawney. 2022. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances. Expert Systems with Applications 193 (2022), 116429. https://doi.org/10.1016/j.eswa.2021.116429
[16]
Addison Howard, Bernadette Bouchon Meunier, IEEE CIS inversion, John Lei, Lynn Vesta, Marcus2010, and Prof. Hussein Abbass. 2019. IEEE-CIS Fraud Detection. https://kaggle.com/competitions/ieee-fraud-detection
[17]
Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. 2021. Graph condensation for graph neural networks. arXiv preprint arXiv:2110.07580 (2021).
[18]
Zahra Kazemi and Houman Zarrabi. 2017. Using deep networks for fraud detection in the credit card transactions. In 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI). 0630–0633. https://doi.org/10.1109/KBEI.2017.8324876
[19]
Yongqi Li and Wenjie Li. 2021. Data distillation for text classification. arXiv preprint arXiv:2104.08448 (2021).
[20]
Dmitry Medvedev and Alexander D’yakonov. 2021. New properties of the data distillation method when working with tabular data. In Analysis of Images, Social Networks and Texts: 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020, Revised Selected Papers 9. Springer, 379–390.
[21]
S. Moro, P. Rita, and P. Cortez. 2012. Bank Marketing. UCI Machine Learning Repository.
[22]
Jack Nicholls, Aditya Kuppa, and Nhien-An Le-Khac. 2021. Financial Cybercrime: A Comprehensive Survey of Deep Learning Approaches to Tackle the Evolving Financial Crime Landscape. IEEE Access 9 (2021), 163965–163986. https://doi.org/10.1109/ACCESS.2021.3134076
[23]
Ebberth L. Paula, Marcelo Ladeira, Rommel N. Carvalho, and Thiago Marzagão. 2016. Deep Learning Anomaly Detection as Support Fraud Investigation in Brazilian Exports and Anti-Money Laundering. In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA). 954–960. https://doi.org/10.1109/ICMLA.2016.0172
[24]
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, G. Sperl, and Christoph H. Lampert. 2016. iCaRL: Incremental Classifier and Representation Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 5533–5542. https://api.semanticscholar.org/CorpusID:206596260
[25]
Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4393–4402. https://proceedings.mlr.press/v80/ruff18a.html
[26]
Timur Sattarov, Dayananda Herurkar, and Jörn Hees. 2022. Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data. CoRR abs/2209.10658 (2022). https://doi.org/10.48550/ARXIV.2209.10658 arXiv:2209.10658
[27]
Marco Schreyer, Timur Sattarov, Damian Borth, Andreas Dengel, and Bernd Reimer. 2017. Detection of Anomalies in Large Scale Accounting Data using Deep Autoencoder Networks. https://doi.org/10.48550/ARXIV.1709.05254
[28]
Ivan Tomek. 1976. Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics SMC-6, 11 (1976), 769–772. https://doi.org/10.1109/TSMC.1976.4309452
[29]
Mariya Toneva, Alessandro Sordoni, Rémi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2018. An Empirical Study of Example Forgetting during Deep Neural Network Learning. ArXiv abs/1812.05159 (2018). https://api.semanticscholar.org/CorpusID:55481903
[30]
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. 2018. Dataset distillation. arXiv preprint arXiv:1811.10959 (2018).
[31]
Roy Wedge, James Max Kanter, Santiago Moral Rubio, Sergio Iglesias Perez, and Kalyan Veeramachaneni. 2017. Solving the "false positives" problem in fraud prediction. arxiv:1710.07709 [cs.AI]
[32]
Xindi Wu, Byron Zhang, Zhiwei Deng, and Olga Russakovsky. 2023. Vision-language dataset distillation. (2023).
[33]
I-Cheng Yeh. 2016. Default of Credit Card Clients. UCI Machine Learning Repository.
[34]
Bo Zhao and Hakan Bilen. 2023. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6514–6523.
[35]
Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. 2020. Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929 (2020).
[36]
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Dae ki Cho, and Haifeng Chen. 2018. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In International Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:51805340

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICAIF '24: Proceedings of the 5th ACM International Conference on AI in Finance
November 2024
878 pages
ISBN:9798400710810
DOI:10.1145/3677052
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataset distillation
  2. feature correlation
  3. imbalanced dataset
  4. neural networks
  5. outlier detection
  6. tabular data

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICAIF '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 88
    Total Downloads
  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)19
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media