Abstract
We present a data ingestion quality validation approach using conditional metrics, a novel form of metrics that compute data quality metrics over specific parts of the ingestion data. We propose a method that automatically derives conditional metrics from historical ingestion sequences, using stability as a selection criterion for implementing these metrics as data unit tests. If an ingestion batch fails any unit tests, we show how conditional metrics can be utilized to identify potential errors. We show the effectiveness of our approach through an evaluation on a real world data set under various error scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The techniques in this paper can easily be generalized to multiple relations. We focus on one relation, however, to keep the presentation simple.
- 2.
- 3.
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases. AW (1995)
Baylor, D., et al.: TFX: A tensorflow-based production-scale machine learning platform. In: SIGKDD (2017)
Boese, J., et al.: Probabilistic demand forecasting at scale. In: VLDB (2017)
Breck, E., et al.: Data validation for machine learning. In: MLSys (2019)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Caveness, E., et al.: Tensorflow data validation: Data analysis and validation in continuous ML pipelines. In: SIGMOD (2020)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009)
Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977)
Freeman, L.C.: Centrality in networks conceptual clarification. Social Netw. 1(3), 215–239 (1979)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: EDBT (2021)
Schelter, S., et al.: Unit testing data with deequ. In: SIGMOD (2019)
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Bießmann, F., Grafberger, A.: Automating large-scale data quality verification. In: VLDB (2018)
Acknowledgements.
We thank Kris Luyten for many helpful discussions on this paper. S. Vansummeren was supported by the Bijzonder Onderzoeksfonds (BOF) of Hasselt University under Grant No. BOF20ZAP02. This work is partially funded by the Research Foundation - Flanders (FWO-grant G055219N). The computing resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bylois, N., Neven, F., Vansummeren, S. (2023). Data Ingestion Validation through Stable Conditional Metrics with Ranking and Filtering. In: Abelló, A., Vassiliadis, P., Romero, O., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2023. Lecture Notes in Computer Science, vol 13985. Springer, Cham. https://doi.org/10.1007/978-3-031-42914-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-42914-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42913-2
Online ISBN: 978-3-031-42914-9
eBook Packages: Computer ScienceComputer Science (R0)