abstract

Data Quality for Machine Learning Tasks

Authors:
Nitin Gupta

IBM Research India, Delhi, India

IBM Research India, Delhi, India
View Profile

,
Shashank Mujumdar

IBM Research India, Delhi, India

IBM Research India, Delhi, India
View Profile

,
Hima Patel

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Satoshi Masuda

IBM Research Japan, Tokyo, Japan

IBM Research Japan, Tokyo, Japan
View Profile

,
Naveen Panwar

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Sambaran Bandyopadhyay

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Sameep Mehta

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Shanmukha Guttula

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Shazia Afzal

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Ruhi Sharma Mittal

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

,
Vitobha Munigala

IBM Research India, Bengaluru, India

IBM Research India, Bengaluru, India
View Profile

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningAugust 2021Pages 4040–4041https://doi.org/10.1145/3447548.3470817

Published:14 August 2021Publication History

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Pages 4040–4041

ABSTRACT

The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models, there are limited efforts towards improving the data quality.

Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for ML applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

References

KKLB Adikaram, MA Hussein, M Effenberger, and T Becker. [n.d.]. Outlier detection method in linear regression based on sum of arithmetic progression. The Scientific World Journal 2014 ([n. d.]).Google Scholar
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do Not Have Enough Data? Deep Learning to the Rescue!. In AAAI. 7383--7390.Google Scholar
Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. 2018. Spatio-Temporal Data Mining: A Survey of Problems and Methods. ACM Comput. Surv. (2018).Google Scholar
Bortik Bandyopadhyay, Sambaran Bandyopadhyay, Srikanta Bedathur, Nitin Gupta, Sameep Mehta, Shashank Mujumdar, Srinivasan Parthasarathy, and Hima Patel. 2021. 1st International Workshop on Data Assessment and Readiness for AI.. In PAKDD (Workshops).Google Scholar
Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference.Google ScholarDigital Library
Edward Collins, Nikolai Rozanov, and Bingbing Zhang. 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv (2018).Google Scholar
David Camilo Corrales, Juan Carlos Corrales, and Agapito Ledezma. 2018. How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry (2018).Google Scholar
Misha Denil and Thomas Trappenberg. 2010. Overlap versus imbalance. In Canadian conference on artificial intelligence.Google ScholarDigital Library
Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management. Synthesis Lectures on Data Management (2012).Google ScholarDigital Library
M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. 2014. Outlier Detection for Temporal Data: A Survey. IEEE Transactions on Knowledge and Data Engineering (2014).Google Scholar
Nitin Gupta, Hima Patel, Srikanta Bedathur, Sameep Mehta, Shashank Mujumdar, Fuyuki Ishikawa, Laure Berti-Equille, Shazia Afzal, Satoshi Masuda, and Yasuharu Nishi. 2021. 2nd International Workshop on Data Quality Assessment for Machine Learning. In KDD (Workshops).Google Scholar
Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and Importance of Data Quality for Machine Learning Tasks. In KDD.Google Scholar
Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. 2017. Outlier detection for text data. In ICDM.Google Scholar
Edwin M. Knorr and Raymond T. Ng. [n.d.]. Algorithms for Mining DistanceBased Outliers in Large Datasets. In VLDB.Google Scholar
Ana C Lorena, Luís PF Garcia, Jens Lehmann, Marcilio CP Souto, and Tin Kam Ho. 2019. How Complex is your classification problem? A survey on measuring classification complexity. CSUR (2019).Google Scholar
Y. Lu, Y. Cheung, and Y. Y. Tang. 2019. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem. TNNLS (2019).Google Scholar
Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. 2019. Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv (2019).Google Scholar
Marco Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. [n.d.]. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. arXiv ([n. d.]).Google Scholar
Shashi Shekhar, Zhe Jiang, Reem Y. Ali, Emre Eftelioglu, Xun Tang, Venkata M. V. Gunturi, and Xun Zhou. 2015. Spatiotemporal Data Mining: A Computational Perspective. ISPRS International Journal of Geo-Information (2015).Google Scholar
Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv (2020).Google Scholar
Luís Torgo, Rita P Ribeiro, Bernhard Pfahringer, and Paula Branco. 2013. Smote for regression. In Portuguese conference on artificial intelligence.Google ScholarCross Ref
S. Wang, J. Cao, and P. Yu. 2020. Deep Learning for Spatio-Temporal Data Mining: A Survey. TKDE (2020).Google Scholar
Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv (2019).Google Scholar
Jinsung Yoon, Sercan Arik, and Tomas Pfister. 2020. Data valuation using reinforcement learning. In ICML.Google Scholar

Index Terms

Data Quality for Machine Learning Tasks
1. Computing methodologies
  1. Machine learning

Recommendations

Towards Data Quality into the Data Warehouse Development
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Commonly, DW development methodologies, paying little attention to the problem of data quality and completeness. One of the common mistakes made during the planning of a data warehousing project is to assume that data quality will be addressed during ...
Read More
A DaQL to Monitor Data Quality in Machine Learning Applications
Database and Expert Systems Applications
Abstract
Machine learning models can only be as good as the data used to train them. Despite this obvious correlation, there is little research about data quality measurement to ensure the reliability and trustworthiness of machine learning models. ...
Read More
Can big data improve firm decision quality? The role of data quality and data diagnosticity
Abstract
Anecdotal evidence suggests that, despite the large variety of data, the huge volume of generated data, and the fast velocity of obtaining data (i.e., big data), quality of big data is far from perfect. Therefore, many firms defer ...
Highlights
- Data quality (DQ) enhances data diagnosticity and firm decision quality.
- Big ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
August 2021
4259 pages
ISBN:9781450383325
DOI:10.1145/3447548
General Chairs:
Feida Zhu
Singapore Management University
,
Beng Chin Ooi
National University of Singapore
,
Chunyan Miao
Nanyang Technology University
,
Program Chairs:
Haixun Wang,
Iryna Skrypnyk,
Wynne Hsu,
Sanjay Chawla
Copyright © 2021 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 August 2021
Check for updates
Author Tags
data quality
machine learning
quality metrics
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 1,629
  Total Downloads
- Downloads (Last 12 months)610
- Downloads (Last 6 weeks)105
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Quality for Machine Learning Tasks

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards Data Quality into the Data Warehouse Development

A DaQL to Monitor Data Quality in Machine Learning Applications

Can big data improve firm decision quality? The role of data quality and data diagnosticity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data Quality for Machine Learning Tasks

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards Data Quality into the Data Warehouse Development

A DaQL to Monitor Data Quality in Machine Learning Applications

Can big data improve firm decision quality? The role of data quality and data diagnosticity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media