Abstract
Data is of high quality if it is fit for its intended use in operations, decision-making, and planning. There is a colossal amount of linked data available on the web. However, it is difficult to understand how well the linked data fits into the modeling tasks due to the defects present in the data. Faults emerged in the linked data, spreading far and wide, affecting all the services designed for it. Addressing linked data quality deficiencies requires identifying quality problems, quality assessment, and the refinement of data to improve its quality. This study aims to identify existing end-to-end frameworks for quality assessment and improvement of data quality. One important finding is that most of the work deals with only one aspect rather than a combined approach. Another finding is that most of the framework aims at solving problems related to DBpedia. Therefore, a standard scalable system is required that integrates the identification of quality issues, the evaluation, and the improvement of the linked data quality. This survey contributes to understanding the state of the art of data quality evaluation and data quality improvement. A solution based on ontology is also proposed to build an end-to-end system that analyzes quality violations’ root causes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., Lehmann, J.: Detecting linked data quality issues via crowdsourcing: a DBpedia study, vol. 9, pp. 303–335. IOS Press (2018)
Assaf, A., Troncy, R., Senart, A.: Roomba: an extensible framework to validate and build dataset profiles. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds.) ESWC 2015 (LNAI and LNB). LNCS, vol. 9341, pp. 325–339. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25639-9_46
Albertoni, R., et al.: Data quality vocabulary (DQV). W3C interest group note. World Wide Web Consortium (W3C) (2015)
Almeida, R., Maio, P., Oliveira, P., Barroso, J.: Ontology based rewriting data cleaning operations, vol. 20–22-July-2016, pp. 85–88. Association for Computing Machinery (2016)
Arruda, N., et al.: A fuzzy approach for data quality assessment of linked datasets, vol. 1, pp. 387–394. SciTePress (2019)
Ballou, D.P., Tayi, G.K.: Enhancing data quality in data warehouse environments. Commun. ACM 42(1), 73–78 (1999)
Behkamal, B., Kahani, M., Bagheri, E.: Quality metrics for linked open data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS (LNAI and LNB), vol. 9261, pp. 144–152. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22849-5_11
Bonatti, P.A., Decker, S., Polleres, A., Presutti, V.: Knowledge graphs: new directions for knowledge representation on the semantic web (Dagstuhl seminar 18371). Dagstuhl Rep. 8(9), 29–111 (2019)
Bozic, B., Brennan, R., Feeney, K., Mendel-Gleason, G.: Describing reasoning results with RVO, the reasoning violations ontology. In: MEPDaW/LDQ@ ESWC. pp. 62–69 (2016)
Caminhas, D., Cones, D., Hervieux, N., Barbosa, D.: Detecting and correcting typing errors in DBpedia, vol. 2512. CEUR-WS (2019)
Chen, J., Chen, X., Horrocks, I., Jiménez-Ruiz, E., Myklebust, E.B.: Correcting knowledge base assertions. ArXiv abs/2001.06917 (2020)
Chen, X., Jia, S., Xiang, Y.: A review: knowledge reasoning over knowledge graph. Expert Syst. Appl. 141, 112948 (2020)
Csáki, C.: Towards open data quality improvements based on root cause analysis of quality issues. In: Parycek, P., et al. (eds.) EGOV 2018. LNCS (LNAI and LNB), vol. 11020, pp. 208–220. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98690-6_18
De Meester, B., Heyvaert, P., Arndt, D., Dimou, A., Verborgh, R.: RDF graph validation using rule-based reasoning. Semant. Web J. 12(1), 117–142 (2020)
Debattista, J., Lange, C., Auer, S.: daQ: an ontology for dataset quality information. In: Central Europe Workshop Proceedings, vol. 1184. CEUR-WS (2014)
Debattista, J., Auer, S., Lange, C.: Luzzu-a methodology and framework for linked data quality assessment. J. Data Inf. Qual. 8(1), 1–32 (2016)
Debattista, J., Lange, C., Auer, S.: A preliminary investigation towards improving linked data quality using distance-based outlier detection. In: Li, Y.-F., et al. (eds.) JIST 2016. LNCS, vol. 10055, pp. 116–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50112_39
Dimou, A., et al.: Assessing and refining mappings to RDF to improve dataset quality. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS (LNAI and LNB), vol. 9367, pp. 133–149. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25010-6_8
Färber, M.: The Microsoft academic knowledge graph: a linked data source with 8 billion triples of scholarly data. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11779, pp. 113–129. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_8
Fürber, C., Hepp, M.: Towards a vocabulary for data quality management in semantic web architectures. In: Proceedings of the 1st International Workshop on Linked Web Data Management, LWDM 2011, pp. 1–8. Association for Computing Machinery, New York (2011)
Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semanti. Web 9(1), 77–129 (2018)
Hadhiatma, A.: Improving data quality in the linked open data: a survey, vol. 978, p. 012026. Institute of Physics Publishing (2018)
Heitmann, B., Hayes, C.: Using linked data to build open, collaborative recommender systems. In: AAAI Spring Symposium: Linked Data Meets Artificial Intelligence, vol. SS-10-07, pp. 76–81 (2010)
Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R.: Databugger: a test-driven framework for debugging the web of data, pp. 115–118. Association for Computing Machinery, Inc. (2014)
Kontokostas, D., Zaveri, A., Auer, S., Lehmann, J.: TripleCheckMate: a tool for crowdsourcing the quality assessment of linked data. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 265–272. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41360-5_22
Lakshen, G., Janev, V., Vraneš, S.: Challenges in quality assessment of Arabic DBpedia. Association for Computing Machinery (2018)
Langer, A., Siegert, V., Göpfert, C., Gaedke, M.: SemQuire - assessing the data quality of linked open data sources based on DQV. In: Pautasso, C., Sánchez-Figueroa, F., Systä, K., Murillo Rodríguez, J.M. (eds.) ICWE 2018. LNCS (LNAI and LNB), vol. 11153, pp. 163–175. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03056-8_14
Lertvittayakumjorn, P., Kertkeidkachorn, N., Ichise, R.: Resolving range violations in DBpedia. In: Wang, Z., et al. (eds.) JIST 2017. LNCS (LNAI and LNB), pp. 121–137. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-70682-5_8
Liu, S., d’Aquin, M., Motta, E.: Measuring accuracy of triples in knowledge graphs. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI and LNB), vol. 10318, pp. 343–357. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59888-8_29
Melo, A., Paulheim, H.: Automatic detection of relation assertion errors and induction of relation constraints. Sprachwissenschaft, pp. 1–30 (2020)
Mendes, P., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: ACM International Conference Proceeding Series, pp. 116–123 (2012)
Mihindukulasooriya, N., García-Castro, R., Gómez-Pérez, A.: LD sniffer: a quality assessment tool for measuring the accessibility of linked data. In: Ciancarini, P., et al. (eds.) EKAW 2016. LNCS (LNAI and LNB), vol. 10180, pp. 149–152. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58694-6_20
Mihindukulasooriya, N., Poveda-VillaÍon, M., García-Castro, R., Gómez-Pérez, A.: Loupe-an online tool for inspecting datasets in the linked data cloud, vol. 1486. CEUR-WS (2015)
Mihindukulasooriya, N., Rico, M., García-Castro, R., Gómez-Pérez, A.: An analysis of the quality issues of the properties available in the Spanish DBpedia. In: Puerta, J.M., et al. (eds.) CAEPIA 2015. LNCS (LNAI and LNB), vol. 9422, pp. 198–209. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24598-0_18
Mocnik, F.B., Mobasheri, A., Griesbaum, L., Eckle, M., Jacobs, C., Klonner, C.: A grounding-based ontology of data quality measures. J. Spat. Inf. Sci. 2018(16), 1–25 (2018)
Palmonari, M., Rula, A., Porrini, R., Maurino, A., Spahiu, B., Ferme, V.: ABSTAT: linked data summaries with ABstraction and STATistics. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9341, pp. 128–132. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25639-9_25
Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2017)
Paulheim, H., Bizer, C.: Improving the quality of linked data using statistical distributions, vol. 3. IGI Global (2018)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Rashid, M., Rizzo, G., Mihindukulasooriya, N., Torchiano, M., Corcho, O.: KBQ - a tool for knowledge base quality assessment using evolution analysis, vol. 2065, pp. 58–63. CEUR-WS (2017)
Rico, M., Mihindukulasooriya, N., Kontokostas, D., Paulheim, H., Hellmann, S., Gómez-Pérez, A.: Predicting incorrect mappings: A data-driven approach applied to dbpedia. In: Proceedings of the 33rd annual ACM symposium on applied computing, pp. 323–330. Association for Computing Machinery (2018)
Sejdiu, G., Rula, A., Lehmann, J., Jabeen, H.: A scalable framework for quality assessment of RDF datasets. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS (LNAI and LNB), vol. 11779, pp. 261–276. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_17
Spahiu, B., Maurino, A., Palmonari, M.: Towards improving the quality of knowledge graphs with data-driven ontology patterns and SHACL. In: Conference of 9th Workshop on Ontology Design and Patterns, pp. 103–117. CEUR-WS (2018)
Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)
Trouillon, T., Dance, C., Gaussier, E., Welbl, J., Riedel, S., Bouchard, G.: Knowledge graph completion via complex tensor factorization. J. Mach. Learn. Res. 18, 4735–4772 (2017)
Vaidyambath, R., Debattista, J., Srivatsa, N., Brennan, R.: An intelligent linked data quality dashboard. In: AICS 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, pp. 1–12 (2019)
Weiskopf, N., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20(1), 144–151 (2013)
Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS (LNAI and LNB), vol. 8465, pp. 504–518. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07443-6_34
Yoo, S., Jeong, O.: Automating the expansion of a knowledge graph. Expert Syst. Appl. 141, 112965 (2020)
Zaveri, A., et al.: User-driven quality evaluation of DBpedia. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 97–104 (2013)
Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2016)
Acknowledgements
This work was funded by Science Foundation Ireland through the SFI Centre for Research Training in Machine Learning (18/CRT/6183).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Nayak, A., Božić, B., Longo, L. (2022). Linked Data Quality Assessment: A Survey. In: Xu, C., Xia, Y., Zhang, Y., Zhang, LJ. (eds) Web Services – ICWS 2021. ICWS 2021. Lecture Notes in Computer Science(), vol 12994. Springer, Cham. https://doi.org/10.1007/978-3-030-96140-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-96140-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96139-8
Online ISBN: 978-3-030-96140-4
eBook Packages: Computer ScienceComputer Science (R0)