skip to main content
10.1145/3534678.3542604acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract

Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems

Published: 14 August 2022 Publication History

Abstract

It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this tutorial, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. To make the tutorial actionable for practitioners, we will also discuss the most popular open-source packages that one can get started with along with their strengths and weaknesses. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.

References

[1]
2019. Facets. https://github.com/pair-code/facets.
[2]
Shazia Afzal, Arunima Chaudhary, Nitin Gupta, Hima Patel, Carolina Spina, and Dakuo Wang. 2021. Data-Debugging Through Interactive Visual Explanations. In Trends and Applications in Knowledge Discovery and Data Mining, Manish Gupta and Ganesh Ramakrishnan (Eds.). Springer International Publishing, Cham, 133--142.
[3]
Julien Aligon, Enrico Gallinucci, Matteo Golfarelli, Patrick Marcel, and Stefano Rizzi. 2015. A collaborative filtering approach for recommending OLAP sessions. Decision Support Systems, Vol. 69 (01 2015), 20--30. https://doi.org/10.1016/j.dss.2014.11.003
[4]
Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference. 2580--2586.
[5]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML).
[6]
Ugo Comignani, Noël Novelli, and Laure Berti-Équille2020. Data quality checking for machine learning with mesqual. In Advances in Database Technology-EDBT 2020, 23rd International Conference on Extending Database Technology,.
[7]
Victor Dibia and cC agatay Demiralp. 2018. Data2Vis: Automatic Generation of Data Visualizations Using Sequence to Sequence Recurrent Neural Networks. CoRR, Vol. abs/1804.03126 (2018). [arXiv]1804.03126 http://arxiv.org/abs/1804.03126
[8]
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2016. AIDE: An Active Learning-Based Approach for Interactive Data Exploration. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 11 (2016), 2842--2856. https://doi.org/10.1109/TKDE.2016.2599168
[9]
Ori Bar El, Tova Milo, and Amit Somech. 2019ATENA: An Autonomous System for Data Exploration Based on Deep Reinforcement Learning. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019).
[10]
Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021).
[11]
Kevin Hu, Michiel A. Bakker, Stephen Li, Tim Kraska, and César Hidalgo. 2019. VizML: A Machine Learning Approach to Visualization Recommendation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI '19). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3290605.3300358
[12]
Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Profiler: integrated statistical analysis and visualization for data quality assessment. In AVI.
[13]
Alan F. Karr, Ashish P. Sanil, and David L. Banks. 2006. Data quality: A statistical perspective. Statistical Methodology, Vol. 3, 2 (2006), 137--173. https://doi.org/10.1016/j.stamet.2005.08.005
[14]
Doris Jung-Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A. Hearst, and Aditya G. Parameswaran. 2021. Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows. Proc. VLDB Endow., Vol. 15, 3 (nov 2021), 727--738. https://doi.org/10.14778/3494124.3494151
[15]
Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards Automatic Data Visualization. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 101--112. https://doi.org/10.1109/ICDE.2018.00019
[16]
Rischan Mafrur, Mohamed A. Sharaf, and G. Zuccon. 2020. Quality Matters: Understanding the Impact of Incomplete Data on Visualization Recommendation. In DEXA.
[17]
Patrick Marcel, Nicolas Labroche, and Panos Vassiliadis. 2019. Towards a benefit-based optimizer for Interactive Data Analysis. In DOLAP 2019. Lisboa, France. https://hal.archives-ouvertes.fr/hal-02375855
[18]
Tova Milo and Amit Somech. 2016. REACT: Context-Sensitive Recommendations for Data Analysis. 2137--2140. https://doi.org/10.1145/2882903.2899392
[19]
Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. 2021. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20-25, 2021, Virtual Event, China.
[20]
A. Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, and S. Subramanian. 2021. Balancing familiarity and curiosity in data exploration with deep reinforcement learning. In Fourth workshop in exploiting AI techniques for data management (aiDM'21), R. (ed.) Bordawekar, Y. (ed.) Amsterdamer, O. (ed.) Shmueli, and N. (ed.) Tatbul (Eds.). ACM, 16--23. https://hal.archives-ouvertes.fr/hal-03278966 SIGMOD/PODS '21: International Conference on Management of Data, En ligne, CHN, 12-/12/2025 - 12/12/2030.
[21]
Sergey Redyuk, Zoi Kaoudi, Volker Markl, and Sebastian Schelter. 2021. Automating Data Quality Validation for Dynamic Data Ingestion. In EDBT. 61--72.
[22]
Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, and Dustin Lange. 2019. Differential data quality verification on partitioned data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1940--1945.
[23]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, Vol. 11, 12 (2018), 1781--1794.
[24]
L. Shen, E. Shen, Y. Luo, X. Yang, X. Hu, X. Zhang, Z. Tai, and J. Wang. 5555. Towards Natural Language Interfaces for Data Visualization: A Survey. IEEE Transactions on Visualization & Computer Graphics 01 (jan 5555), 1-1. https://doi.org/10.1109/TVCG.2022.3148007
[25]
Arun Swami, Sriram Vasudevan, and Joojay Huyn. 2020. Data sentinel: A declarative production-scale data validation platform. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1579--1590.

Cited By

View all
  • (2024)Domain-wise data acquisition to improve performance under distribution shiftProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692788(17934-17945)Online publication date: 21-Jul-2024
  • (2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. data-centric ai
  2. exploratory data analysis
  3. large scale analysis
  4. machine learning
  5. visualization techniques

Qualifiers

  • Abstract

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)246
  • Downloads (Last 6 weeks)26
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Domain-wise data acquisition to improve performance under distribution shiftProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692788(17934-17945)Online publication date: 21-Jul-2024
  • (2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • (2024)ChatGPT and Its Role in Academic Libraries: A DiscussionNew Review of Academic Librarianship10.1080/13614533.2024.238151030:4(422-436)Online publication date: 29-Jul-2024
  • (2024)A Data-Centric Approach to improve performance of deep learning modelsScientific Reports10.1038/s41598-024-73643-x14:1Online publication date: 27-Sep-2024
  • (2024)From 2015 to 2023: How Machine Learning Aids Natural Product AnalysisChemistry Africa10.1007/s42250-024-01154-3Online publication date: 31-Dec-2024
  • (2024)Discerning Challenges of Security Information and Event Management (SIEM) Systems in Large OrganizationsHuman Aspects of Information Security and Assurance10.1007/978-3-031-72559-3_23(339-354)Online publication date: 28-Nov-2024
  • (2023)Few-shot Named Entity Recognition: Definition, Taxonomy and Research DirectionsACM Transactions on Intelligent Systems and Technology10.1145/360948314:5(1-46)Online publication date: 9-Oct-2023
  • (2023)Data-centric AI to Improve Early Detection of Mental Illness2023 IEEE Statistical Signal Processing Workshop (SSP)10.1109/SSP53291.2023.10207938(369-373)Online publication date: 2-Jul-2023
  • (2022)Introduction to the Special Section on AI in ManufacturingACM SIGKDD Explorations Newsletter10.1145/3575637.357565024:2(81-85)Online publication date: 8-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media