skip to main content
10.1145/3447548.3470817acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract

Data Quality for Machine Learning Tasks

Published:14 August 2021Publication History

ABSTRACT

The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models, there are limited efforts towards improving the data quality.

Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for ML applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

References

  1. KKLB Adikaram, MA Hussein, M Effenberger, and T Becker. [n.d.]. Outlier detection method in linear regression based on sum of arithmetic progression. The Scientific World Journal 2014 ([n. d.]).Google ScholarGoogle Scholar
  2. Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do Not Have Enough Data? Deep Learning to the Rescue!. In AAAI. 7383--7390.Google ScholarGoogle Scholar
  3. Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. 2018. Spatio-Temporal Data Mining: A Survey of Problems and Methods. ACM Comput. Surv. (2018).Google ScholarGoogle Scholar
  4. Bortik Bandyopadhyay, Sambaran Bandyopadhyay, Srikanta Bedathur, Nitin Gupta, Sameep Mehta, Shashank Mujumdar, Srinivasan Parthasarathy, and Hima Patel. 2021. 1st International Workshop on Data Assessment and Readiness for AI.. In PAKDD (Workshops).Google ScholarGoogle Scholar
  5. Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Edward Collins, Nikolai Rozanov, and Bingbing Zhang. 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv (2018).Google ScholarGoogle Scholar
  7. David Camilo Corrales, Juan Carlos Corrales, and Agapito Ledezma. 2018. How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry (2018).Google ScholarGoogle Scholar
  8. Misha Denil and Thomas Trappenberg. 2010. Overlap versus imbalance. In Canadian conference on artificial intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management. Synthesis Lectures on Data Management (2012).Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. 2014. Outlier Detection for Temporal Data: A Survey. IEEE Transactions on Knowledge and Data Engineering (2014).Google ScholarGoogle Scholar
  11. Nitin Gupta, Hima Patel, Srikanta Bedathur, Sameep Mehta, Shashank Mujumdar, Fuyuki Ishikawa, Laure Berti-Equille, Shazia Afzal, Satoshi Masuda, and Yasuharu Nishi. 2021. 2nd International Workshop on Data Quality Assessment for Machine Learning. In KDD (Workshops).Google ScholarGoogle Scholar
  12. Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and Importance of Data Quality for Machine Learning Tasks. In KDD.Google ScholarGoogle Scholar
  13. Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. 2017. Outlier detection for text data. In ICDM.Google ScholarGoogle Scholar
  14. Edwin M. Knorr and Raymond T. Ng. [n.d.]. Algorithms for Mining DistanceBased Outliers in Large Datasets. In VLDB.Google ScholarGoogle Scholar
  15. Ana C Lorena, Luís PF Garcia, Jens Lehmann, Marcilio CP Souto, and Tin Kam Ho. 2019. How Complex is your classification problem? A survey on measuring classification complexity. CSUR (2019).Google ScholarGoogle Scholar
  16. Y. Lu, Y. Cheung, and Y. Y. Tang. 2019. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem. TNNLS (2019).Google ScholarGoogle Scholar
  17. Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. 2019. Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv (2019).Google ScholarGoogle Scholar
  18. Marco Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. [n.d.]. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. arXiv ([n. d.]).Google ScholarGoogle Scholar
  19. Shashi Shekhar, Zhe Jiang, Reem Y. Ali, Emre Eftelioglu, Xun Tang, Venkata M. V. Gunturi, and Xun Zhou. 2015. Spatiotemporal Data Mining: A Computational Perspective. ISPRS International Journal of Geo-Information (2015).Google ScholarGoogle Scholar
  20. Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv (2020).Google ScholarGoogle Scholar
  21. Luís Torgo, Rita P Ribeiro, Bernhard Pfahringer, and Paula Branco. 2013. Smote for regression. In Portuguese conference on artificial intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  22. S. Wang, J. Cao, and P. Yu. 2020. Deep Learning for Spatio-Temporal Data Mining: A Survey. TKDE (2020).Google ScholarGoogle Scholar
  23. Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv (2019).Google ScholarGoogle Scholar
  24. Jinsung Yoon, Sercan Arik, and Tomas Pfister. 2020. Data valuation using reinforcement learning. In ICML.Google ScholarGoogle Scholar

Index Terms

  1. Data Quality for Machine Learning Tasks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
      August 2021
      4259 pages
      ISBN:9781450383325
      DOI:10.1145/3447548

      Copyright © 2021 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 August 2021

      Check for updates

      Qualifiers

      • abstract

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader