Data Ingestion Validation through Stable Conditional Metrics with Ranking and Filtering

Bylois, Niels; Neven, Frank; Vansummeren, Stijn

doi:10.1007/978-3-031-42914-9_15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13985))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

Abstract

We present a data ingestion quality validation approach using conditional metrics, a novel form of metrics that compute data quality metrics over specific parts of the ingestion data. We propose a method that automatically derives conditional metrics from historical ingestion sequences, using stability as a selection criterion for implementing these metrics as data unit tests. If an ingestion batch fails any unit tests, we show how conditional metrics can be utilized to identify potential errors. We show the effectiveness of our approach through an evaluation on a real world data set under various error scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Ingestion Validation Through Stable Conditional Metrics with Ranking and Filtering

Article 05 July 2024

Querying Data Preparation Modules Using Data Examples

Detecting drifts in data streams using Kullback-Leibler (KL) divergence measure for data engineering applications

Article Open access 29 May 2024

Notes

1.
The techniques in this paper can easily be generalized to multiple relations. We focus on one relation, however, to keep the presentation simple.
2.
https://en.wikipedia.org/wiki/Interquartile_range
3.
http://www.nmbs.be

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases. AW (1995)
Google Scholar
Baylor, D., et al.: TFX: A tensorflow-based production-scale machine learning platform. In: SIGKDD (2017)
Google Scholar
Boese, J., et al.: Probabilistic demand forecasting at scale. In: VLDB (2017)
Google Scholar
Breck, E., et al.: Data validation for machine learning. In: MLSys (2019)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Google Scholar
Caveness, E., et al.: Tensorflow data validation: Data analysis and validation in continuous ML pipelines. In: SIGMOD (2020)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009)
Google Scholar
Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977)
Article Google Scholar
Freeman, L.C.: Centrality in networks conceptual clarification. Social Netw. 1(3), 215–239 (1979)
Article MathSciNet Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet MATH Google Scholar
Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: EDBT (2021)
Google Scholar
Schelter, S., et al.: Unit testing data with deequ. In: SIGMOD (2019)
Google Scholar
Schelter, S., Lange, D., Schmidt, P., Celikel, M., Bießmann, F., Grafberger, A.: Automating large-scale data quality verification. In: VLDB (2018)
Google Scholar

Download references

Acknowledgements.

We thank Kris Luyten for many helpful discussions on this paper. S. Vansummeren was supported by the Bijzonder Onderzoeksfonds (BOF) of Hasselt University under Grant No. BOF20ZAP02. This work is partially funded by the Research Foundation - Flanders (FWO-grant G055219N). The computing resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government.

Author information

Authors and Affiliations

Hasselt University and transnational University of Limburg, Data Science Institute, Diepenbeek, Belgium
Niels Bylois, Frank Neven & Stijn Vansummeren

Authors

Niels Bylois
View author publications
You can also search for this author in PubMed Google Scholar
Frank Neven
View author publications
You can also search for this author in PubMed Google Scholar
Stijn Vansummeren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niels Bylois .

Editor information

Editors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Alberto Abelló
University of Ioannina, Ioannina, Greece
Panos Vassiliadis
Universitat Politècnica de Catalunya, Barcelona, Spain
Oscar Romero
Poznan University of Technology, Poznan, Poland
Robert Wrembel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bylois, N., Neven, F., Vansummeren, S. (2023). Data Ingestion Validation through Stable Conditional Metrics with Ranking and Filtering. In: Abelló, A., Vassiliadis, P., Romero, O., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2023. Lecture Notes in Computer Science, vol 13985. Springer, Cham. https://doi.org/10.1007/978-3-031-42914-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-42914-9_15
Published: 28 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42913-2
Online ISBN: 978-3-031-42914-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics