Skip to main content

Automating Data Quality Monitoring with Reference Data Profiles

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2022, DATA 2021)

Abstract

Data quality is of central importance for the qualitative evaluation of decisions taken by AI-based applications. In practice, data from several heterogeneous data sources is integrated, but complete, global domain knowledge is often not available. In such heterogeneous scenarios, it is particularly difficult to monitor data quality (e.g., completeness, accuracy, timeliness) over time. In this paper, we formally introduce a new data-centric method for automated data quality monitoring, which is based on reference data profiles. A reference data profile is a set of data profiling statistics that is learned automatically to model the target quality of the data. In contrast to most existing data quality approaches that require domain experts to define rules, our method can be fully automated from initialization to continuous monitoring. This data-centric method has been implemented in our data quality tool DQ-MeeRKat and evaluated with six real-world telematic device data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As pointed out by Lauren Gartner in her keynote at the 2020 MIT CDOIQ Symposium https://www.youtube.com/watch?v=-LOvwtJvIZM (Apr. 2022).

  2. 2.

    The requirements were later refined in [22], but we use here the original source to comply to [6].

  3. 3.

    http://dqm.faw.jku.at/ontologies/dsd (Apr. 2022).

  4. 4.

    https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225 (Apr. 2022).

  5. 5.

    https://www.w3.org/TR/2012/REC-owl2-primer-20121211 (Apr. 2022).

  6. 6.

    http://graphdb.ontotext.com (Apr. 2022).

  7. 7.

    https://db-engines.com/de/ranking/time+series+dbms (Apr. 2022).

  8. 8.

    https://grafana.com (Apr. 2022).

  9. 9.

    https://www.tributech.io (Apr. 2022).

  10. 10.

    https://github.com/lisehr/dq-meerkat/tree/master/documentation/TributechRDPs/ (Apr. 2022).

  11. 11.

    https://github.com/lisehr/dq-meerkat (Apr. 2022).

  12. 12.

    https://www.influxdata.com (Apr. 2022).

  13. 13.

    https://grafana.com (Apr. 2022).

  14. 14.

    http://graphdb.ontotext.com (Apr. 2022).

  15. 15.

    Visualizations of the KG in GraphDB are provided on: https://github.com/lisehr/dq-meerkat/tree/master/documentation/kg-visualization/ (Apr. 2022).

References

  1. Abadi, D., et al.: The Seattle report on database research. ACM SIGMOD Rec. 48(4), 44–53 (2019)

    Article  MathSciNet  Google Scholar 

  2. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)

    Article  Google Scholar 

  3. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data profiling. Synth. Lect. Data Manag. 10(4), 1–154 (2019)

    Article  Google Scholar 

  4. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)

    Google Scholar 

  5. Bronselaer, A.: Data quality management: an overview of methods and challenges. In: Andreasen, T., De Tré, G., Kacprzyk, J., Legind Larsen, H., Bordogna, G., Zadrożny, S. (eds.) FQAS 2021. LNCS (LNAI), vol. 12871, pp. 127–141. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86967-0_10

    Chapter  Google Scholar 

  6. Bronselaer, A., De Mol, R., De Tré, G.: A measure-theoretic foundation for data quality. IEEE Trans. Fuzzy Syst. 26(2), 627–639 (2018)

    Article  Google Scholar 

  7. Dell’Aglio, D., Della Valle, E., van Harmelen, F., Bernstein, A.: Stream reasoning: a survey and outlook. Data Sci. 1(1–2), 59–83 (2017)

    Article  Google Scholar 

  8. Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track COVID-19 in real time. Lancet. Infect. Dis 20(5), 533–534 (2020)

    Article  Google Scholar 

  9. Ehrlinger, L., Gindlhumer, A., Huber, L., Wöß, W.: DQ-MeeRKat: automating data quality monitoring with a reference-data-profile-annotated knowledge graph. In: Proceedings of the 10th International Conference on Data Science, Technology and Applications - DATA, pp. 215–222. SciTePress (2021)

    Google Scholar 

  10. Ehrlinger, L., Werth, B., Wöß, W.: Automated continuous data quality measurement with QuaIIe. Int. J. Adv. Softw. 11(3 & 4), 400–417 (2018)

    Google Scholar 

  11. Ehrlinger, L., Wöß, W.: Semi-automatically generated hybrid ontologies for information integration. In: SEMANTiCS (Posters & Demos). CEUR Workshop Proceedings, vol. 1481, pp. 100–104. RWTH, Aachen (2015)

    Google Scholar 

  12. Ehrlinger, L., Wöß, W.: Towards a definition of knowledge graphs. In: Martin, M., Cuquet, M., Folmer, E. (eds.) Joint Proceedings of the Posters and Demos Track of 12th International Conference on Semantic Systems - SEMANTiCS2016 and 1st International Workshop on Semantic Change & Evolving Semantics (SuCCESS16). CEUR Workshop Proceedings, vol. 1695, pp. 13–16. Technical University of Aachen (RWTH), Aachen, Germany (2016)

    Google Scholar 

  13. Ehrlinger, L., Wöß, W.: Automated data quality monitoring. In: Talburt, J.R. (ed.) Proceedings of the 22nd MIT International Conference on Information Quality (MIT ICIQ), UA Little Rock, Arkansas, USA, pp. 15.1–15.9 (2017)

    Google Scholar 

  14. Ehrlinger, L., Wöß, W.: A survey of data quality measurement and monitoring tools. Front. Big Data 5(850611) (2022). https://doi.org/10.3389/fdata.2022.850611

  15. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  16. Fensel, D., et al.: Knowledge Graphs. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37439-6

    Book  Google Scholar 

  17. Fischer, L., et al.: AI system engineering-key challenges and lessons learned. Mach. Learn. Knowl. Extr. 3(1), 56–83 (2021)

    Article  MathSciNet  Google Scholar 

  18. Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I.-Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2019. LNCS, vol. 11708, pp. 179–188. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13

    Chapter  Google Scholar 

  19. Haegemans, T., Snoeck, M., Lemahieu, W.: Towards a precise definition of data accuracy and a justification for its measure. In: Proceedings of the International Conference on Information Quality (ICIQ 2016), Ciudad Real, Spain, pp. 16.1–16.13. Alarcos Research Group (UCLM) (2016)

    Google Scholar 

  20. Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: International Conference on Management of Data (SIGMOD 2019), New York, NY, USA, pp. 829–846. ACM (2019)

    Google Scholar 

  21. Heine, F., Kleiner, C., Oelsner, T.: Automated detection and monitoring of advanced data quality rules. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11706, pp. 238–247. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27615-7_18

    Chapter  Google Scholar 

  22. Heinrich, B., Hristova, D., Klier, M., Schiller, A., Szubartowicz, M.: Requirements for data quality metrics. J. Data Inf. Qual. 9(2), 12:1–12:32 (2018)

    Google Scholar 

  23. Hogan, A., et al.: Knowledge Graphs. CoRR (2020). https://arxiv.org/abs/2003.02320

  24. Kaiser, M., Klier, M., Heinrich, B.: How to measure data quality? - a metric-based approach. In: International Conference on Information Systems, Montreal, Canada, pp. 1–15. AIS Electronic Library (AISeL) (2007)

    Google Scholar 

  25. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, BC, Canada. ACM (2011)

    Google Scholar 

  26. Kiryakov, A., Ognyanov, D., Manov, D.: OWLIM – a pragmatic semantic repository for OWL. In: Dean, M., et al. (eds.) WISE 2005. LNCS, vol. 3807, pp. 182–192. Springer, Heidelberg (2005). https://doi.org/10.1007/11581116_19

    Chapter  Google Scholar 

  27. Klein, A., Lehner, W.: Representing data quality in sensor data streaming environments. J. Data Inf. Qual. (JDIQ) 1(2), 1–28 (2009)

    Article  Google Scholar 

  28. Laranjeiro, N., Soydemir, S.N., Bernardino, J.: A survey on data quality: classifying poor data. In: Proceedings of the 21st Pacific Rim International Symposium on Dependable Computing (PRDC), Zhangjiajie, China, pp. 179–188. IEEE (2015)

    Google Scholar 

  29. Ledvinka, M., Křemen, P.: A comparison of object-triple mapping libraries. Semant. Web 1–43 (2019)

    Google Scholar 

  30. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)

    Google Scholar 

  31. Naqvi, S.N.Z., Yfantidou, S., Zimányi, E.: Time Series Databases and InfluxDB. Technical report, Université Libre de Bruxelles (2017)

    Google Scholar 

  32. Pipino, L., Wang, R., Kopcso, D., Rybolt, W.: Developing measurement scales for data-quality dimensions. Inf. Qual. 1, 37–52 (2005)

    Google Scholar 

  33. Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: Proceedings of the 24th International Conference on Extending Database Technology (EDBT) (2021)

    Google Scholar 

  34. Scannapieco, M., Catarci, T.: Data quality under a computer science perspective. Archivi Comput. 2, 1–15 (2002)

    Google Scholar 

  35. Sebastian-Coleman, L.: Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework. Elsevier, Waltham, MA, USA (2013)

    Google Scholar 

  36. Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: 6th Biennial Conference on Innovative Data Systems Research (CDIR 2013), Asilomar, California, USA (2013)

    Google Scholar 

  37. Stonebraker, M., Ilyas, I.F.: Data integration: the current status and the way forward. Bull. IEEE Comput. Soc. Tech. Committee Data Eng. 41(2), 3–9 (2018)

    Google Scholar 

  38. Sturges, H.A.: The choice of a class interval. J. Am. Stat. Assoc. 21(153), 65–66 (1926)

    Article  Google Scholar 

  39. Talburt, J.R., Al Sarkhi, A.K., Pullen, D., Claassens, L., Wang, R.: An iterative, self-assessing entity resolution system: first steps toward a data washing machine. Int. J. Adv. Comput. Sci. Appl. 11(12) (2020)

    Google Scholar 

  40. Wang, R.Y., Strong, D.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)

    Article  Google Scholar 

  41. Zenisek, J., Holzinger, F., Affenzeller, M.: Machine learning based concept drift detection for predictive maintenance. Comput. Ind. Eng. 137, 106031 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

The research reported in this paper has been funded by BMK, BMDW, and the Province of Upper Austria in the frame of the COMET Programme managed by FFG. The authors also thank Patrick Lamplmair of Tributech Solutions GmbH for providing the data streams as well as Alexander Gindlhumer and Lisa-Marie Huber for their support in the implementation of DQ-MeeRKat.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lisa Ehrlinger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ehrlinger, L., Werth, B., Wöß, W. (2023). Automating Data Quality Monitoring with Reference Data Profiles. In: Cuzzocrea, A., Gusikhin, O., Hammoudi, S., Quix, C. (eds) Data Management Technologies and Applications. DATA DATA 2022 2021. Communications in Computer and Information Science, vol 1860. Springer, Cham. https://doi.org/10.1007/978-3-031-37890-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37890-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37889-8

  • Online ISBN: 978-3-031-37890-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics