research-article

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

Authors:
Valerie Restat

University of Hagen, Germany

University of Hagen, Germany
View Profile

,
Gerrit Boerner

University of Hagen, Germany

University of Hagen, Germany
View Profile

,
André Conrad

University of Hagen, Germany

University of Hagen, Germany
View Profile

,
Uta Störl

University of Hagen, Germany

University of Hagen, Germany
View Profile

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine LearningJune 2022Article No.: 2Pages 1–6https://doi.org/10.1145/3533028.3533311

Published:12 June 2022Publication History

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

Pages 1–6

ABSTRACT

Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.

References

Ziawasch Abedjan et al. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. (2016), 993--1004. Google ScholarDigital Library
Patricia C. Arocena et al. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. (2015), 36--47. Google ScholarDigital Library
Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data Management in Machine Learning Systems. Morgan & Claypool Publishers. Google ScholarCross Ref
Jeroen Castelein et al. 2018. Search-based test data generation for SQL queries. In Proc. ICSE 2018. ACM, 1220--1230. Google ScholarDigital Library
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. Proc. VLDB Endow. (2013), 1498--1509. Google ScholarDigital Library
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proc. ICDE 2018. IEEE, 458--469. Google ScholarDigital Library
André Conrad et al. 2021. EvoBench: Benchmarking Schema Evolution in NoSQL. In Proc. TPCTC 2021. Springer, 33--49. Google ScholarDigital Library
Michele Dallachiesa et al. 2013. NADEEF: a commodity data cleaning system. In Proc. SIGMOD 2013. ACM, 541--552. Google ScholarDigital Library
Stefan J. Galler and Bernhard K. Aichernig. 2014. Survey on test data generation tools - An evaluation of white- and gray-box testing tools for C#, C++, Eiffel, and Java. Int. J. Softw. Tools Technol. Transf. (2014), 727--751. Google ScholarDigital Library
Saveli Goldberg, Andrzej Niemierko, and Alexander Turchin. 2008. Analysis of Data Errors in Clinical Research Databases. In AMIA 2008, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 8-12, 2008. AMIA. https://knowledge.amia.org/amia-55142-a2008a-1.625176/t-001-1.626020/f-001-1.626021/a-049-1.626417/a-050-1.626414Google Scholar
Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Rec. (2020), 18--29. Google ScholarDigital Library
Alireza Heidari et al. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proc. SIGMOD 2019. ACM, 829--846. Google ScholarDigital Library
Nishtha Jatana and Bharti Suri. 2020. An Improved Crow Search Algorithm for Test Data Generation Using Search-Based Mutation Testing. Neural Process. Lett. (2020), 767--784. Google ScholarDigital Library
Won Y. Kim et al. 2003. A Taxonomy of Dirty Data. Data Min. Knowl. Discov. (2003), 81--99. Google ScholarDigital Library
Sanjay Krishnan et al. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299Google Scholar
Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC) (2011).Google Scholar
Peng Li et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In Proc. ICDE 2021. IEEE, 13--24. Google ScholarCross Ref
Xian Li et al. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow. (2012), 97--108. Google ScholarDigital Library
Mohammad Mahdavi et al. 2019. Raha: A Configuration-Free Error Detection System. In Proc. SIGMOD 2019. ACM, 865--882. Google ScholarDigital Library
Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. (2020), 1948--1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdfGoogle ScholarDigital Library
Heiko Müller and Johann Christoph Freytag. 2003. Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164. Humboldt University.Google Scholar
Paulo Oliveira et al. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.Google Scholar
Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. (2000), 3--13. http://sites.computer.org/debull/A00DEC-CD.pdfGoogle Scholar
Joeri Rammelaere and Floris Geerts. 2018. Explaining Repaired Data with CFDs. Proc. VLDB Endow. (2018), 1387--1399. Google ScholarDigital Library
Theodoros Rekatsinas et al. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. (2017), 1190--1201. Google ScholarDigital Library

Index Terms

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

A survey of network-based intrusion detection data sets
Abstract
Labeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying ...
Read More
Impact of data collection on interpretation and evaluation of student models
LAK '16: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge

Student modeling techniques are evaluated mostly using historical data. Researchers typically do not pay attention to details of the origin of the used data sets. However, the way data are collected can have important impact on evaluation and ...
Read More
Developing a Scalable Model to Analyze Expanding Data Sets

Ioffer a workbook to teach the scalable analysis of expanding data sets. When analyzing data sets, function ranges are often statically defined. As a result, when new data are appended to the data set, the appended data are beyond the static ranges. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
June 2022
63 pages
ISBN:9781450393751
DOI:10.1145/3533028
Conference Chairs:
Matthias Boehm,
Paroma Varma,
Doris Xin
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data preparation pipelines
data sets
error generation
evaluation
Qualifiers
- research-article
Conference

Acceptance Rates
DEEM '22 Paper Acceptance Rate9of13submissions,69%Overall Acceptance Rate23of37submissions,62%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 170
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

ABSTRACT

References

Cited By

Index Terms

Recommendations

A survey of network-based intrusion detection data sets

Impact of data collection on interpretation and evaluation of student models

Developing a Scalable Model to Analyze Expanding Data Sets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

ABSTRACT

References

Cited By

Index Terms

Recommendations

A survey of network-based intrusion detection data sets

Impact of data collection on interpretation and evaluation of student models

Developing a Scalable Model to Analyze Expanding Data Sets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media