skip to main content
10.1145/3533028.3533311acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

Published:12 June 2022Publication History

ABSTRACT

Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.

References

  1. Ziawasch Abedjan et al. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. (2016), 993--1004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Patricia C. Arocena et al. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. (2015), 36--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data Management in Machine Learning Systems. Morgan & Claypool Publishers. Google ScholarGoogle ScholarCross RefCross Ref
  4. Jeroen Castelein et al. 2018. Search-based test data generation for SQL queries. In Proc. ICSE 2018. ACM, 1220--1230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. Proc. VLDB Endow. (2013), 1498--1509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proc. ICDE 2018. IEEE, 458--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. André Conrad et al. 2021. EvoBench: Benchmarking Schema Evolution in NoSQL. In Proc. TPCTC 2021. Springer, 33--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michele Dallachiesa et al. 2013. NADEEF: a commodity data cleaning system. In Proc. SIGMOD 2013. ACM, 541--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Stefan J. Galler and Bernhard K. Aichernig. 2014. Survey on test data generation tools - An evaluation of white- and gray-box testing tools for C#, C++, Eiffel, and Java. Int. J. Softw. Tools Technol. Transf. (2014), 727--751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Saveli Goldberg, Andrzej Niemierko, and Alexander Turchin. 2008. Analysis of Data Errors in Clinical Research Databases. In AMIA 2008, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 8-12, 2008. AMIA. https://knowledge.amia.org/amia-55142-a2008a-1.625176/t-001-1.626020/f-001-1.626021/a-049-1.626417/a-050-1.626414Google ScholarGoogle Scholar
  11. Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Rec. (2020), 18--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alireza Heidari et al. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proc. SIGMOD 2019. ACM, 829--846. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nishtha Jatana and Bharti Suri. 2020. An Improved Crow Search Algorithm for Test Data Generation Using Search-Based Mutation Testing. Neural Process. Lett. (2020), 767--784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Won Y. Kim et al. 2003. A Taxonomy of Dirty Data. Data Min. Knowl. Discov. (2003), 81--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sanjay Krishnan et al. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299Google ScholarGoogle Scholar
  16. Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC) (2011).Google ScholarGoogle Scholar
  17. Peng Li et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In Proc. ICDE 2021. IEEE, 13--24. Google ScholarGoogle ScholarCross RefCross Ref
  18. Xian Li et al. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow. (2012), 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mohammad Mahdavi et al. 2019. Raha: A Configuration-Free Error Detection System. In Proc. SIGMOD 2019. ACM, 865--882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. (2020), 1948--1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  21. Heiko Müller and Johann Christoph Freytag. 2003. Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164. Humboldt University.Google ScholarGoogle Scholar
  22. Paulo Oliveira et al. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.Google ScholarGoogle Scholar
  23. Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. (2000), 3--13. http://sites.computer.org/debull/A00DEC-CD.pdfGoogle ScholarGoogle Scholar
  24. Joeri Rammelaere and Floris Geerts. 2018. Explaining Repaired Data with CFDs. Proc. VLDB Endow. (2018), 1387--1399. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Theodoros Rekatsinas et al. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. (2017), 1190--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
      June 2022
      63 pages
      ISBN:9781450393751
      DOI:10.1145/3533028

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 June 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      DEEM '22 Paper Acceptance Rate9of13submissions,69%Overall Acceptance Rate23of37submissions,62%
    • Article Metrics

      • Downloads (Last 12 months)56
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader