ABSTRACT
The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models, there are limited efforts towards improving the data quality.
Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for ML applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.
- KKLB Adikaram, MA Hussein, M Effenberger, and T Becker. [n.d.]. Outlier detection method in linear regression based on sum of arithmetic progression. The Scientific World Journal 2014 ([n. d.]).Google Scholar
- Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do Not Have Enough Data? Deep Learning to the Rescue!. In AAAI. 7383--7390.Google Scholar
- Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. 2018. Spatio-Temporal Data Mining: A Survey of Problems and Methods. ACM Comput. Surv. (2018).Google Scholar
- Bortik Bandyopadhyay, Sambaran Bandyopadhyay, Srikanta Bedathur, Nitin Gupta, Sameep Mehta, Shashank Mujumdar, Srinivasan Parthasarathy, and Hima Patel. 2021. 1st International Workshop on Data Assessment and Readiness for AI.. In PAKDD (Workshops).Google Scholar
- Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference.Google ScholarDigital Library
- Edward Collins, Nikolai Rozanov, and Bingbing Zhang. 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv (2018).Google Scholar
- David Camilo Corrales, Juan Carlos Corrales, and Agapito Ledezma. 2018. How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry (2018).Google Scholar
- Misha Denil and Thomas Trappenberg. 2010. Overlap versus imbalance. In Canadian conference on artificial intelligence.Google ScholarDigital Library
- Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management. Synthesis Lectures on Data Management (2012).Google ScholarDigital Library
- M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. 2014. Outlier Detection for Temporal Data: A Survey. IEEE Transactions on Knowledge and Data Engineering (2014).Google Scholar
- Nitin Gupta, Hima Patel, Srikanta Bedathur, Sameep Mehta, Shashank Mujumdar, Fuyuki Ishikawa, Laure Berti-Equille, Shazia Afzal, Satoshi Masuda, and Yasuharu Nishi. 2021. 2nd International Workshop on Data Quality Assessment for Machine Learning. In KDD (Workshops).Google Scholar
- Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and Importance of Data Quality for Machine Learning Tasks. In KDD.Google Scholar
- Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. 2017. Outlier detection for text data. In ICDM.Google Scholar
- Edwin M. Knorr and Raymond T. Ng. [n.d.]. Algorithms for Mining DistanceBased Outliers in Large Datasets. In VLDB.Google Scholar
- Ana C Lorena, Luís PF Garcia, Jens Lehmann, Marcilio CP Souto, and Tin Kam Ho. 2019. How Complex is your classification problem? A survey on measuring classification complexity. CSUR (2019).Google Scholar
- Y. Lu, Y. Cheung, and Y. Y. Tang. 2019. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem. TNNLS (2019).Google Scholar
- Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. 2019. Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv (2019).Google Scholar
- Marco Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. [n.d.]. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. arXiv ([n. d.]).Google Scholar
- Shashi Shekhar, Zhe Jiang, Reem Y. Ali, Emre Eftelioglu, Xun Tang, Venkata M. V. Gunturi, and Xun Zhou. 2015. Spatiotemporal Data Mining: A Computational Perspective. ISPRS International Journal of Geo-Information (2015).Google Scholar
- Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv (2020).Google Scholar
- Luís Torgo, Rita P Ribeiro, Bernhard Pfahringer, and Paula Branco. 2013. Smote for regression. In Portuguese conference on artificial intelligence.Google ScholarCross Ref
- S. Wang, J. Cao, and P. Yu. 2020. Deep Learning for Spatio-Temporal Data Mining: A Survey. TKDE (2020).Google Scholar
- Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv (2019).Google Scholar
- Jinsung Yoon, Sercan Arik, and Tomas Pfister. 2020. Data valuation using reinforcement learning. In ICML.Google Scholar
Index Terms
- Data Quality for Machine Learning Tasks
Recommendations
Towards Data Quality into the Data Warehouse Development
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure ComputingCommonly, DW development methodologies, paying little attention to the problem of data quality and completeness. One of the common mistakes made during the planning of a data warehousing project is to assume that data quality will be addressed during ...
A DaQL to Monitor Data Quality in Machine Learning Applications
Database and Expert Systems ApplicationsAbstractMachine learning models can only be as good as the data used to train them. Despite this obvious correlation, there is little research about data quality measurement to ensure the reliability and trustworthiness of machine learning models. ...
Can big data improve firm decision quality? The role of data quality and data diagnosticity
AbstractAnecdotal evidence suggests that, despite the large variety of data, the huge volume of generated data, and the fast velocity of obtaining data (i.e., big data), quality of big data is far from perfect. Therefore, many firms defer ...
Highlights- Data quality (DQ) enhances data diagnosticity and firm decision quality.
- Big ...
Comments