Skip to main content

Linked Data Quality Assessment: A Survey

  • Conference paper
  • First Online:
Web Services – ICWS 2021 (ICWS 2021)

Abstract

Data is of high quality if it is fit for its intended use in operations, decision-making, and planning. There is a colossal amount of linked data available on the web. However, it is difficult to understand how well the linked data fits into the modeling tasks due to the defects present in the data. Faults emerged in the linked data, spreading far and wide, affecting all the services designed for it. Addressing linked data quality deficiencies requires identifying quality problems, quality assessment, and the refinement of data to improve its quality. This study aims to identify existing end-to-end frameworks for quality assessment and improvement of data quality. One important finding is that most of the work deals with only one aspect rather than a combined approach. Another finding is that most of the framework aims at solving problems related to DBpedia. Therefore, a standard scalable system is required that integrates the identification of quality issues, the evaluation, and the improvement of the linked data quality. This survey contributes to understanding the state of the art of data quality evaluation and data quality improvement. A solution based on ontology is also proposed to build an end-to-end system that analyzes quality violations’ root causes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://wiki.dbpedia.org/.

  2. 2.

    https://www.wikidata.org/wiki/Wikidata:Main_Page.

  3. 3.

    https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.

  4. 4.

    https://www.w3.org/TR/vocab-dqv/.

  5. 5.

    https://www.w3.org/TR/shacl/.

References

  1. Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., Lehmann, J.: Detecting linked data quality issues via crowdsourcing: a DBpedia study, vol. 9, pp. 303–335. IOS Press (2018)

    Google Scholar 

  2. Assaf, A., Troncy, R., Senart, A.: Roomba: an extensible framework to validate and build dataset profiles. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds.) ESWC 2015 (LNAI and LNB). LNCS, vol. 9341, pp. 325–339. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25639-9_46

    Chapter  Google Scholar 

  3. Albertoni, R., et al.: Data quality vocabulary (DQV). W3C interest group note. World Wide Web Consortium (W3C) (2015)

    Google Scholar 

  4. Almeida, R., Maio, P., Oliveira, P., Barroso, J.: Ontology based rewriting data cleaning operations, vol. 20–22-July-2016, pp. 85–88. Association for Computing Machinery (2016)

    Google Scholar 

  5. Arruda, N., et al.: A fuzzy approach for data quality assessment of linked datasets, vol. 1, pp. 387–394. SciTePress (2019)

    Google Scholar 

  6. Ballou, D.P., Tayi, G.K.: Enhancing data quality in data warehouse environments. Commun. ACM 42(1), 73–78 (1999)

    Article  Google Scholar 

  7. Behkamal, B., Kahani, M., Bagheri, E.: Quality metrics for linked open data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS (LNAI and LNB), vol. 9261, pp. 144–152. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22849-5_11

    Chapter  Google Scholar 

  8. Bonatti, P.A., Decker, S., Polleres, A., Presutti, V.: Knowledge graphs: new directions for knowledge representation on the semantic web (Dagstuhl seminar 18371). Dagstuhl Rep. 8(9), 29–111 (2019)

    Google Scholar 

  9. Bozic, B., Brennan, R., Feeney, K., Mendel-Gleason, G.: Describing reasoning results with RVO, the reasoning violations ontology. In: MEPDaW/LDQ@ ESWC. pp. 62–69 (2016)

    Google Scholar 

  10. Caminhas, D., Cones, D., Hervieux, N., Barbosa, D.: Detecting and correcting typing errors in DBpedia, vol. 2512. CEUR-WS (2019)

    Google Scholar 

  11. Chen, J., Chen, X., Horrocks, I., Jiménez-Ruiz, E., Myklebust, E.B.: Correcting knowledge base assertions. ArXiv abs/2001.06917 (2020)

    Google Scholar 

  12. Chen, X., Jia, S., Xiang, Y.: A review: knowledge reasoning over knowledge graph. Expert Syst. Appl. 141, 112948 (2020)

    Article  Google Scholar 

  13. Csáki, C.: Towards open data quality improvements based on root cause analysis of quality issues. In: Parycek, P., et al. (eds.) EGOV 2018. LNCS (LNAI and LNB), vol. 11020, pp. 208–220. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98690-6_18

    Chapter  Google Scholar 

  14. De Meester, B., Heyvaert, P., Arndt, D., Dimou, A., Verborgh, R.: RDF graph validation using rule-based reasoning. Semant. Web J. 12(1), 117–142 (2020)

    Article  Google Scholar 

  15. Debattista, J., Lange, C., Auer, S.: daQ: an ontology for dataset quality information. In: Central Europe Workshop Proceedings, vol. 1184. CEUR-WS (2014)

    Google Scholar 

  16. Debattista, J., Auer, S., Lange, C.: Luzzu-a methodology and framework for linked data quality assessment. J. Data Inf. Qual. 8(1), 1–32 (2016)

    Article  Google Scholar 

  17. Debattista, J., Lange, C., Auer, S.: A preliminary investigation towards improving linked data quality using distance-based outlier detection. In: Li, Y.-F., et al. (eds.) JIST 2016. LNCS, vol. 10055, pp. 116–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50112_39

    Chapter  Google Scholar 

  18. Dimou, A., et al.: Assessing and refining mappings to RDF to improve dataset quality. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS (LNAI and LNB), vol. 9367, pp. 133–149. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25010-6_8

    Chapter  Google Scholar 

  19. Färber, M.: The Microsoft academic knowledge graph: a linked data source with 8 billion triples of scholarly data. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11779, pp. 113–129. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_8

    Chapter  Google Scholar 

  20. Fürber, C., Hepp, M.: Towards a vocabulary for data quality management in semantic web architectures. In: Proceedings of the 1st International Workshop on Linked Web Data Management, LWDM 2011, pp. 1–8. Association for Computing Machinery, New York (2011)

    Google Scholar 

  21. Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semanti. Web 9(1), 77–129 (2018)

    Article  Google Scholar 

  22. Hadhiatma, A.: Improving data quality in the linked open data: a survey, vol. 978, p. 012026. Institute of Physics Publishing (2018)

    Google Scholar 

  23. Heitmann, B., Hayes, C.: Using linked data to build open, collaborative recommender systems. In: AAAI Spring Symposium: Linked Data Meets Artificial Intelligence, vol. SS-10-07, pp. 76–81 (2010)

    Google Scholar 

  24. Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R.: Databugger: a test-driven framework for debugging the web of data, pp. 115–118. Association for Computing Machinery, Inc. (2014)

    Google Scholar 

  25. Kontokostas, D., Zaveri, A., Auer, S., Lehmann, J.: TripleCheckMate: a tool for crowdsourcing the quality assessment of linked data. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 265–272. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41360-5_22

    Chapter  Google Scholar 

  26. Lakshen, G., Janev, V., Vraneš, S.: Challenges in quality assessment of Arabic DBpedia. Association for Computing Machinery (2018)

    Google Scholar 

  27. Langer, A., Siegert, V., Göpfert, C., Gaedke, M.: SemQuire - assessing the data quality of linked open data sources based on DQV. In: Pautasso, C., Sánchez-Figueroa, F., Systä, K., Murillo Rodríguez, J.M. (eds.) ICWE 2018. LNCS (LNAI and LNB), vol. 11153, pp. 163–175. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03056-8_14

    Chapter  Google Scholar 

  28. Lertvittayakumjorn, P., Kertkeidkachorn, N., Ichise, R.: Resolving range violations in DBpedia. In: Wang, Z., et al. (eds.) JIST 2017. LNCS (LNAI and LNB), pp. 121–137. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-70682-5_8

    Chapter  Google Scholar 

  29. Liu, S., d’Aquin, M., Motta, E.: Measuring accuracy of triples in knowledge graphs. In: Gracia, J., Bond, F., McCrae, J.P., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds.) LDK 2017. LNCS (LNAI and LNB), vol. 10318, pp. 343–357. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59888-8_29

    Chapter  Google Scholar 

  30. Melo, A., Paulheim, H.: Automatic detection of relation assertion errors and induction of relation constraints. Sprachwissenschaft, pp. 1–30 (2020)

    Google Scholar 

  31. Mendes, P., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: ACM International Conference Proceeding Series, pp. 116–123 (2012)

    Google Scholar 

  32. Mihindukulasooriya, N., García-Castro, R., Gómez-Pérez, A.: LD sniffer: a quality assessment tool for measuring the accessibility of linked data. In: Ciancarini, P., et al. (eds.) EKAW 2016. LNCS (LNAI and LNB), vol. 10180, pp. 149–152. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58694-6_20

    Chapter  Google Scholar 

  33. Mihindukulasooriya, N., Poveda-VillaÍon, M., García-Castro, R., Gómez-Pérez, A.: Loupe-an online tool for inspecting datasets in the linked data cloud, vol. 1486. CEUR-WS (2015)

    Google Scholar 

  34. Mihindukulasooriya, N., Rico, M., García-Castro, R., Gómez-Pérez, A.: An analysis of the quality issues of the properties available in the Spanish DBpedia. In: Puerta, J.M., et al. (eds.) CAEPIA 2015. LNCS (LNAI and LNB), vol. 9422, pp. 198–209. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24598-0_18

    Chapter  Google Scholar 

  35. Mocnik, F.B., Mobasheri, A., Griesbaum, L., Eckle, M., Jacobs, C., Klonner, C.: A grounding-based ontology of data quality measures. J. Spat. Inf. Sci. 2018(16), 1–25 (2018)

    Google Scholar 

  36. Palmonari, M., Rula, A., Porrini, R., Maurino, A., Spahiu, B., Ferme, V.: ABSTAT: linked data summaries with ABstraction and STATistics. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9341, pp. 128–132. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25639-9_25

    Chapter  Google Scholar 

  37. Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2017)

    Article  Google Scholar 

  38. Paulheim, H., Bizer, C.: Improving the quality of linked data using statistical distributions, vol. 3. IGI Global (2018)

    Google Scholar 

  39. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  40. Rashid, M., Rizzo, G., Mihindukulasooriya, N., Torchiano, M., Corcho, O.: KBQ - a tool for knowledge base quality assessment using evolution analysis, vol. 2065, pp. 58–63. CEUR-WS (2017)

    Google Scholar 

  41. Rico, M., Mihindukulasooriya, N., Kontokostas, D., Paulheim, H., Hellmann, S., Gómez-Pérez, A.: Predicting incorrect mappings: A data-driven approach applied to dbpedia. In: Proceedings of the 33rd annual ACM symposium on applied computing, pp. 323–330. Association for Computing Machinery (2018)

    Google Scholar 

  42. Sejdiu, G., Rula, A., Lehmann, J., Jabeen, H.: A scalable framework for quality assessment of RDF datasets. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS (LNAI and LNB), vol. 11779, pp. 261–276. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_17

    Chapter  Google Scholar 

  43. Spahiu, B., Maurino, A., Palmonari, M.: Towards improving the quality of knowledge graphs with data-driven ontology patterns and SHACL. In: Conference of 9th Workshop on Ontology Design and Patterns, pp. 103–117. CEUR-WS (2018)

    Google Scholar 

  44. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997)

    Article  Google Scholar 

  45. Trouillon, T., Dance, C., Gaussier, E., Welbl, J., Riedel, S., Bouchard, G.: Knowledge graph completion via complex tensor factorization. J. Mach. Learn. Res. 18, 4735–4772 (2017)

    MathSciNet  MATH  Google Scholar 

  46. Vaidyambath, R., Debattista, J., Srivatsa, N., Brennan, R.: An intelligent linked data quality dashboard. In: AICS 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science, pp. 1–12 (2019)

    Google Scholar 

  47. Weiskopf, N., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20(1), 144–151 (2013)

    Article  Google Scholar 

  48. Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS (LNAI and LNB), vol. 8465, pp. 504–518. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07443-6_34

    Chapter  Google Scholar 

  49. Yoo, S., Jeong, O.: Automating the expansion of a knowledge graph. Expert Syst. Appl. 141, 112965 (2020)

    Article  Google Scholar 

  50. Zaveri, A., et al.: User-driven quality evaluation of DBpedia. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 97–104 (2013)

    Google Scholar 

  51. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This work was funded by Science Foundation Ireland through the SFI Centre for Research Training in Machine Learning (18/CRT/6183).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aparna Nayak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nayak, A., Božić, B., Longo, L. (2022). Linked Data Quality Assessment: A Survey. In: Xu, C., Xia, Y., Zhang, Y., Zhang, LJ. (eds) Web Services – ICWS 2021. ICWS 2021. Lecture Notes in Computer Science(), vol 12994. Springer, Cham. https://doi.org/10.1007/978-3-030-96140-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96140-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96139-8

  • Online ISBN: 978-3-030-96140-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics