Abstract
Data quality is of central importance for the qualitative evaluation of decisions taken by AI-based applications. In practice, data from several heterogeneous data sources is integrated, but complete, global domain knowledge is often not available. In such heterogeneous scenarios, it is particularly difficult to monitor data quality (e.g., completeness, accuracy, timeliness) over time. In this paper, we formally introduce a new data-centric method for automated data quality monitoring, which is based on reference data profiles. A reference data profile is a set of data profiling statistics that is learned automatically to model the target quality of the data. In contrast to most existing data quality approaches that require domain experts to define rules, our method can be fully automated from initialization to continuous monitoring. This data-centric method has been implemented in our data quality tool DQ-MeeRKat and evaluated with six real-world telematic device data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As pointed out by Lauren Gartner in her keynote at the 2020 MIT CDOIQ Symposium https://www.youtube.com/watch?v=-LOvwtJvIZM (Apr. 2022).
- 2.
- 3.
http://dqm.faw.jku.at/ontologies/dsd (Apr. 2022).
- 4.
- 5.
https://www.w3.org/TR/2012/REC-owl2-primer-20121211 (Apr. 2022).
- 6.
http://graphdb.ontotext.com (Apr. 2022).
- 7.
https://db-engines.com/de/ranking/time+series+dbms (Apr. 2022).
- 8.
https://grafana.com (Apr. 2022).
- 9.
https://www.tributech.io (Apr. 2022).
- 10.
- 11.
https://github.com/lisehr/dq-meerkat (Apr. 2022).
- 12.
https://www.influxdata.com (Apr. 2022).
- 13.
https://grafana.com (Apr. 2022).
- 14.
http://graphdb.ontotext.com (Apr. 2022).
- 15.
Visualizations of the KG in GraphDB are provided on: https://github.com/lisehr/dq-meerkat/tree/master/documentation/kg-visualization/ (Apr. 2022).
References
Abadi, D., et al.: The Seattle report on database research. ACM SIGMOD Rec. 48(4), 44–53 (2019)
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data profiling. Synth. Lect. Data Manag. 10(4), 1–154 (2019)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
Bronselaer, A.: Data quality management: an overview of methods and challenges. In: Andreasen, T., De Tré, G., Kacprzyk, J., Legind Larsen, H., Bordogna, G., Zadrożny, S. (eds.) FQAS 2021. LNCS (LNAI), vol. 12871, pp. 127–141. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86967-0_10
Bronselaer, A., De Mol, R., De Tré, G.: A measure-theoretic foundation for data quality. IEEE Trans. Fuzzy Syst. 26(2), 627–639 (2018)
Dell’Aglio, D., Della Valle, E., van Harmelen, F., Bernstein, A.: Stream reasoning: a survey and outlook. Data Sci. 1(1–2), 59–83 (2017)
Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track COVID-19 in real time. Lancet. Infect. Dis 20(5), 533–534 (2020)
Ehrlinger, L., Gindlhumer, A., Huber, L., Wöß, W.: DQ-MeeRKat: automating data quality monitoring with a reference-data-profile-annotated knowledge graph. In: Proceedings of the 10th International Conference on Data Science, Technology and Applications - DATA, pp. 215–222. SciTePress (2021)
Ehrlinger, L., Werth, B., Wöß, W.: Automated continuous data quality measurement with QuaIIe. Int. J. Adv. Softw. 11(3 & 4), 400–417 (2018)
Ehrlinger, L., Wöß, W.: Semi-automatically generated hybrid ontologies for information integration. In: SEMANTiCS (Posters & Demos). CEUR Workshop Proceedings, vol. 1481, pp. 100–104. RWTH, Aachen (2015)
Ehrlinger, L., Wöß, W.: Towards a definition of knowledge graphs. In: Martin, M., Cuquet, M., Folmer, E. (eds.) Joint Proceedings of the Posters and Demos Track of 12th International Conference on Semantic Systems - SEMANTiCS2016 and 1st International Workshop on Semantic Change & Evolving Semantics (SuCCESS16). CEUR Workshop Proceedings, vol. 1695, pp. 13–16. Technical University of Aachen (RWTH), Aachen, Germany (2016)
Ehrlinger, L., Wöß, W.: Automated data quality monitoring. In: Talburt, J.R. (ed.) Proceedings of the 22nd MIT International Conference on Information Quality (MIT ICIQ), UA Little Rock, Arkansas, USA, pp. 15.1–15.9 (2017)
Ehrlinger, L., Wöß, W.: A survey of data quality measurement and monitoring tools. Front. Big Data 5(850611) (2022). https://doi.org/10.3389/fdata.2022.850611
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fensel, D., et al.: Knowledge Graphs. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37439-6
Fischer, L., et al.: AI system engineering-key challenges and lessons learned. Mach. Learn. Knowl. Extr. 3(1), 56–83 (2021)
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I.-Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2019. LNCS, vol. 11708, pp. 179–188. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13
Haegemans, T., Snoeck, M., Lemahieu, W.: Towards a precise definition of data accuracy and a justification for its measure. In: Proceedings of the International Conference on Information Quality (ICIQ 2016), Ciudad Real, Spain, pp. 16.1–16.13. Alarcos Research Group (UCLM) (2016)
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: International Conference on Management of Data (SIGMOD 2019), New York, NY, USA, pp. 829–846. ACM (2019)
Heine, F., Kleiner, C., Oelsner, T.: Automated detection and monitoring of advanced data quality rules. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2019. LNCS, vol. 11706, pp. 238–247. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27615-7_18
Heinrich, B., Hristova, D., Klier, M., Schiller, A., Szubartowicz, M.: Requirements for data quality metrics. J. Data Inf. Qual. 9(2), 12:1–12:32 (2018)
Hogan, A., et al.: Knowledge Graphs. CoRR (2020). https://arxiv.org/abs/2003.02320
Kaiser, M., Klier, M., Heinrich, B.: How to measure data quality? - a metric-based approach. In: International Conference on Information Systems, Montreal, Canada, pp. 1–15. AIS Electronic Library (AISeL) (2007)
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Vancouver, BC, Canada. ACM (2011)
Kiryakov, A., Ognyanov, D., Manov, D.: OWLIM – a pragmatic semantic repository for OWL. In: Dean, M., et al. (eds.) WISE 2005. LNCS, vol. 3807, pp. 182–192. Springer, Heidelberg (2005). https://doi.org/10.1007/11581116_19
Klein, A., Lehner, W.: Representing data quality in sensor data streaming environments. J. Data Inf. Qual. (JDIQ) 1(2), 1–28 (2009)
Laranjeiro, N., Soydemir, S.N., Bernardino, J.: A survey on data quality: classifying poor data. In: Proceedings of the 21st Pacific Rim International Symposium on Dependable Computing (PRDC), Zhangjiajie, China, pp. 179–188. IEEE (2015)
Ledvinka, M., Křemen, P.: A comparison of object-triple mapping libraries. Semant. Web 1–43 (2019)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)
Naqvi, S.N.Z., Yfantidou, S., Zimányi, E.: Time Series Databases and InfluxDB. Technical report, Université Libre de Bruxelles (2017)
Pipino, L., Wang, R., Kopcso, D., Rybolt, W.: Developing measurement scales for data-quality dimensions. Inf. Qual. 1, 37–52 (2005)
Redyuk, S., Kaoudi, Z., Markl, V., Schelter, S.: Automating data quality validation for dynamic data ingestion. In: Proceedings of the 24th International Conference on Extending Database Technology (EDBT) (2021)
Scannapieco, M., Catarci, T.: Data quality under a computer science perspective. Archivi Comput. 2, 1–15 (2002)
Sebastian-Coleman, L.: Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework. Elsevier, Waltham, MA, USA (2013)
Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: 6th Biennial Conference on Innovative Data Systems Research (CDIR 2013), Asilomar, California, USA (2013)
Stonebraker, M., Ilyas, I.F.: Data integration: the current status and the way forward. Bull. IEEE Comput. Soc. Tech. Committee Data Eng. 41(2), 3–9 (2018)
Sturges, H.A.: The choice of a class interval. J. Am. Stat. Assoc. 21(153), 65–66 (1926)
Talburt, J.R., Al Sarkhi, A.K., Pullen, D., Claassens, L., Wang, R.: An iterative, self-assessing entity resolution system: first steps toward a data washing machine. Int. J. Adv. Comput. Sci. Appl. 11(12) (2020)
Wang, R.Y., Strong, D.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Zenisek, J., Holzinger, F., Affenzeller, M.: Machine learning based concept drift detection for predictive maintenance. Comput. Ind. Eng. 137, 106031 (2019)
Acknowledgements
The research reported in this paper has been funded by BMK, BMDW, and the Province of Upper Austria in the frame of the COMET Programme managed by FFG. The authors also thank Patrick Lamplmair of Tributech Solutions GmbH for providing the data streams as well as Alexander Gindlhumer and Lisa-Marie Huber for their support in the implementation of DQ-MeeRKat.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ehrlinger, L., Werth, B., Wöß, W. (2023). Automating Data Quality Monitoring with Reference Data Profiles. In: Cuzzocrea, A., Gusikhin, O., Hammoudi, S., Quix, C. (eds) Data Management Technologies and Applications. DATA DATA 2022 2021. Communications in Computer and Information Science, vol 1860. Springer, Cham. https://doi.org/10.1007/978-3-031-37890-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-37890-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37889-8
Online ISBN: 978-3-031-37890-4
eBook Packages: Computer ScienceComputer Science (R0)