ABSTRACT
Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.
- Ziawasch Abedjan et al. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. (2016), 993--1004. Google ScholarDigital Library
- Patricia C. Arocena et al. 2015. Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. (2015), 36--47. Google ScholarDigital Library
- Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data Management in Machine Learning Systems. Morgan & Claypool Publishers. Google ScholarCross Ref
- Jeroen Castelein et al. 2018. Search-based test data generation for SQL queries. In Proc. ICSE 2018. ACM, 1220--1230. Google ScholarDigital Library
- Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Discovering Denial Constraints. Proc. VLDB Endow. (2013), 1498--1509. Google ScholarDigital Library
- Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proc. ICDE 2018. IEEE, 458--469. Google ScholarDigital Library
- André Conrad et al. 2021. EvoBench: Benchmarking Schema Evolution in NoSQL. In Proc. TPCTC 2021. Springer, 33--49. Google ScholarDigital Library
- Michele Dallachiesa et al. 2013. NADEEF: a commodity data cleaning system. In Proc. SIGMOD 2013. ACM, 541--552. Google ScholarDigital Library
- Stefan J. Galler and Bernhard K. Aichernig. 2014. Survey on test data generation tools - An evaluation of white- and gray-box testing tools for C#, C++, Eiffel, and Java. Int. J. Softw. Tools Technol. Transf. (2014), 727--751. Google ScholarDigital Library
- Saveli Goldberg, Andrzej Niemierko, and Alexander Turchin. 2008. Analysis of Data Errors in Clinical Research Databases. In AMIA 2008, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 8-12, 2008. AMIA. https://knowledge.amia.org/amia-55142-a2008a-1.625176/t-001-1.626020/f-001-1.626021/a-049-1.626417/a-050-1.626414Google Scholar
- Mazhar Hameed and Felix Naumann. 2020. Data Preparation: A Survey of Commercial Tools. SIGMOD Rec. (2020), 18--29. Google ScholarDigital Library
- Alireza Heidari et al. 2019. HoloDetect: Few-Shot Learning for Error Detection. In Proc. SIGMOD 2019. ACM, 829--846. Google ScholarDigital Library
- Nishtha Jatana and Bharti Suri. 2020. An Improved Crow Search Algorithm for Test Data Generation Using Search-Based Mutation Testing. Neural Process. Lett. (2020), 767--784. Google ScholarDigital Library
- Won Y. Kim et al. 2003. A Taxonomy of Dirty Data. Data Min. Knowl. Discov. (2003), 81--99. Google ScholarDigital Library
- Sanjay Krishnan et al. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR (2017). arXiv:1711.01299 http://arxiv.org/abs/1711.01299Google Scholar
- Lin Li, Taoxin Peng, and Jessie Kennedy. 2011. A rule based taxonomy of dirty data. GSTF Journal on Computing (JoC) (2011).Google Scholar
- Peng Li et al. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In Proc. ICDE 2021. IEEE, 13--24. Google ScholarCross Ref
- Xian Li et al. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow. (2012), 97--108. Google ScholarDigital Library
- Mohammad Mahdavi et al. 2019. Raha: A Configuration-Free Error Detection System. In Proc. SIGMOD 2019. ACM, 865--882. Google ScholarDigital Library
- Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. (2020), 1948--1961. http://www.vldb.org/pvldb/vol13/p1948-mahdavi.pdfGoogle ScholarDigital Library
- Heiko Müller and Johann Christoph Freytag. 2003. Problems, methods, and challenges in comprehensive data cleansing. Technical Report HUB-IB-164. Humboldt University.Google Scholar
- Paulo Oliveira et al. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.Google Scholar
- Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. (2000), 3--13. http://sites.computer.org/debull/A00DEC-CD.pdfGoogle Scholar
- Joeri Rammelaere and Floris Geerts. 2018. Explaining Repaired Data with CFDs. Proc. VLDB Endow. (2018), 1387--1399. Google ScholarDigital Library
- Theodoros Rekatsinas et al. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. (2017), 1190--1201. Google ScholarDigital Library
Index Terms
- GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines
Recommendations
A survey of network-based intrusion detection data sets
AbstractLabeled data sets are necessary to train and evaluate anomaly-based network intrusion detection systems. This work provides a focused literature survey of data sets for network-based intrusion detection and describes the underlying ...
Impact of data collection on interpretation and evaluation of student models
LAK '16: Proceedings of the Sixth International Conference on Learning Analytics & KnowledgeStudent modeling techniques are evaluated mostly using historical data. Researchers typically do not pay attention to details of the origin of the used data sets. However, the way data are collected can have important impact on evaluation and ...
Developing a Scalable Model to Analyze Expanding Data Sets
Ioffer a workbook to teach the scalable analysis of expanding data sets. When analyzing data sets, function ranges are often statically defined. As a result, when new data are appended to the data set, the appended data are beyond the static ranges. ...
Comments