Skip to main content

Data Ingestion Validation through Stable Conditional Metrics with Ranking and Filtering

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2023)

Abstract

We present a data ingestion quality validation approach using conditional metrics, a novel form of metrics that compute data quality metrics over specific parts of the ingestion data. We propose a method that automatically derives conditional metrics from historical ingestion sequences, using stability as a selection criterion for implementing these metrics as data unit tests. If an ingestion batch fails any unit tests, we show how conditional metrics can be utilized to identify potential errors. We show the effectiveness of our approach through an evaluation on a real world data set under various error scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The techniques in this paper can easily be generalized to multiple relations. We focus on one relation, however, to keep the presentation simple.

  2. 2.

    https://en.wikipedia.org/wiki/Interquartile_range

  3. 3.

    http://www.nmbs.be

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases. AW (1995)

    Google Scholar 

  2. Baylor, D., et al.: TFX: A tensorflow-based production-scale machine learning platform. In: SIGKDD (2017)

    Google Scholar 

  3. Boese, J., et al.: Probabilistic demand forecasting at scale. In: VLDB (2017)

    Google Scholar 

  4. Breck, E., et al.: Data validation for machine learning. In: MLSys (2019)

    Google Scholar 

  5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)

    Google Scholar 

  6. Caveness, E., et al.: Tensorflow data validation: Data analysis and validation in continuous ML pipelines. In: SIGMOD (2020)

    Google Scholar 

  7. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009)

    Google Scholar 

  8. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977)

    Article  Google Scholar 

  9. Freeman, L.C.: Centrality in networks conceptual clarification. Social Netw. 1(3), 215–239 (1979)

    Article  MathSciNet  Google Scholar 

  10. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  11. Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: EDBT (2021)

    Google Scholar 

  12. Schelter, S., et al.: Unit testing data with deequ. In: SIGMOD (2019)

    Google Scholar 

  13. Schelter, S., Lange, D., Schmidt, P., Celikel, M., Bießmann, F., Grafberger, A.: Automating large-scale data quality verification. In: VLDB (2018)

    Google Scholar 

Download references

Acknowledgements.

We thank Kris Luyten for many helpful discussions on this paper. S. Vansummeren was supported by the Bijzonder Onderzoeksfonds (BOF) of Hasselt University under Grant No. BOF20ZAP02. This work is partially funded by the Research Foundation - Flanders (FWO-grant G055219N). The computing resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niels Bylois .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bylois, N., Neven, F., Vansummeren, S. (2023). Data Ingestion Validation through Stable Conditional Metrics with Ranking and Filtering. In: Abelló, A., Vassiliadis, P., Romero, O., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2023. Lecture Notes in Computer Science, vol 13985. Springer, Cham. https://doi.org/10.1007/978-3-031-42914-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42914-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42913-2

  • Online ISBN: 978-3-031-42914-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics