skip to main content

Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge Graphs

Published: 24 April 2020 Publication History


RDF Knowledge Graphs (or Datasets) contain valuable information that can be exploited for a variety of real-world tasks. However, due to the enormous size of the available RDF datasets, it is difficult to discover the most valuable datasets for a given task. For improving dataset Discoverability, Interlinking, and Reusability, there is a trend for Dataset Search systems. Such systems are mainly based on metadata and ignore the contents; however, in tasks related to data integration and enrichment, the contents of datasets have to be considered. This is important for data integration but also for data enrichment, for instance, quite often datasets’ owners want to enrich the content of their dataset, by selecting datasets that provide complementary information for their dataset. The above tasks require content-based union and complement metrics between any subset of datasets; however, there is a lack of such approaches. For making feasible the computation of such metrics at very large scale, we propose an approach relying on (a) a set of pre-constructed (and periodically refreshed) semantics-aware indexes, and (b) “lattice-based” incremental algorithms that exploit the posting lists of such indexes, as well as set theory properties, for enabling efficient responses at query time. Finally, we discuss the efficiency of the proposed methods by presenting comparative results, and we report measurements for 400 real RDF datasets (containing over 2 billion triples), by exploiting the proposed metrics.


Grigoris Antoniou and Frank Van Harmelen. 2004. A Semantic Web Primer. MIT Press.
Ciro Baron Neto, Kay Müller, Martin Brümmer, Dimitris Kontokostas, and Sebastian Hellmann. 2016. LODVader: An interface to LOD visualization, analyticsand discovERy in real-time. In Proceedings of the Conference on the World Wide Web (WWW’16). 163--166.
M. Ben Ellefi, Z. Bellahsene, J. G. Breslin, E. Demidova, S. Dietze, J. Szymański, and K. Todorov. 2018. RDF dataset profiling—A survey of features, methods, vocabularies and applications. Semantic Web 9, 5 (2018), 677--705.
Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis-Daniel Ibáñez-Gonzalez, Emilia Kacprzak, and Paul Groth. 2020. Dataset search: A survey. VLDB J. 29, 1 (2020), 251--272.
M. d’Aquin and E. Motta. 2011. Watson, more than a semantic web search engine. Semantic Web 2, 1 (2011), 55--63.
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze, and Konstantin Todorov. 2016. Dataset recommendation for data linking: An intensional approach. In Proceedings of the International Semantic Web Conference (ISWC’16). Springer, 36--51.
I. Ermilov, J. Lehmann, M. Martin, and S. Auer. 2016. LODStats: The data web census dataset. In Proceedings of the International Semantic Web Conference (ISWC’16). 38--46.
Thomas Gottron, Ansgar Scherp, Bastian Krayer, and Arne Peters. 2013. LODatio: A schema-based retrieval system for linked open data at web-scale. In Proceedings of the Extended Semantic Web Conference. Springer, 142--146.
Filip Ilievski, Wouter Beek, Marieke van Erp, Laurens Rietveld, and Stefan Schlobach. 2016. LOTUS: Adaptive text search for big linked data. In Proceedings of the International Semantic Web Conference (ISWC’16). Springer, 470--485.
Thomas Jech. 2013. Set Theory. Springer Science 8 Business Media.
Maulik Kamdar and Mark Musen. 2017. PhLeGrA: Graph analytics in pharmacology over the web of life sciences linked open data. In Proceedings of the Conference on the World Wide Web (WWW’17). 321--329.
Shahan Khatchadourian and Mariano P. Consens. 2010. Exploring RDF usage and interlinking in the linked open data cloud using expLOD. In Proceedings of the Linked Data on the Web Conference (LDOW’10).
S. Kruse, P. Papotti, and F. Naumann. 2015. Estimating data integration and cleaning effort. In Proceedings of the International Conference on Extending Database Technology (EDBT’15). 61--72.
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, et al. 2015. DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6, 2 (2015), 167--195.
Luiz André P. Paes Leme, Giseli Rabello Lopes, Bernardo Pereira Nunes, Marco Antonio Casanova, and Stefan Dietze. 2013. Identifying candidate datasets for data interlinking. In Proceedings of the International Conference on Web Engineering (ICWE’13). Springer, 354--366.
Michalis Mountantonakis and Yannis Tzitzikas. 2016. On measuring the lattice of commonalities among several linked datasets. Proc. VLDB Endow. 9, 12 (2016), 1101--1112.
Michalis Mountantonakis and Yannis Tzitzikas. 2018. High performance methods for linked open data connectivity analytics. Information 9, 6 (2018), 134.
Michalis Mountantonakis and Yannis Tzitzikas. 2018. Scalable methods for measuring the connectivity and quality of large numbers of linked datasets. J. Data Info. Qual. 9, 3, Article 15 (2018), 49 pages.
Michalis Mountantonakis and Yannis Tzitzikas. 2019. Large scale semantic integration of linked data: A survey. ACM Comput. Surveys 52, 5 (2019), 103.
Markus Nentwig, Tommaso Soru, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. 2014. LinkLion: A link repository for the web of data. In Proceedings of the European Semantic Web Conference (ESWC’14). Springer, 439--443.
A. B. Neves, R. GG de Oliveira, L. A. P. P. Leme, G. R. Lopes, B. P. Nunes, and M. A. Casanova. 2018. Empirical analysis of ranking models for an adaptable dataset search. In Proceedings of the European Semantic Web Conference (ESWC’18). Springer, 50--64.
Andriy Nikolov, Mathieu d’Aquin, and Enrico Motta. 2011. What should i link to? identifying relevant sources and classes for data linking. In Proceedings of the Joint International Semantic Technology Conference. Springer, 284--299.
Tope Omitola, Landong Zuo, Christopher Gutteridge, Ian C. Millard, Hugh Glaser, Nicholas Gibbins, and Nigel Shadbolt. 2011. Tracing the provenance of linked data using void. In Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS’11). ACM, 17.
Eyal Oren, Renaud Delbru, Michele Catasta, Richard Cyganiak, Holger Stenzhorn, and Giovanni Tummarello. 2008. A document-oriented lookup index for open linked data. Int. J. Metadata Semant. Ontol. 3, 1 (2008), 37--52.
Emmanuel Pietriga, Hande Gözükan, Caroline Appert, Marie Destandau, Šejla Čebirić, François Goasdoué, and Ioana Manolescu. 2018. Browsing linked data catalogs with LODAtlas. In Proceedings of the International Semantic Web Conference (ISWC’18). Springer, 137--153.
Thomas Rebele, Fabian Suchanek, Johannes Hoffart, Joanna Biega, Erdal Kuzey, and Gerhard Weikum. 2016. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In Proceedings of the International Semantic Web Conference (ISWC’16). Springer, 177--185.
Theodoros Rekatsinas, Xin Luna Dong, Lise Getoor, and Divesh Srivastava. 2015. Finding quality in quantity: The challenge of discovering valuable sources for integration. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’15).
Theodoros Rekatsinas, Xin Luna Dong, and Divesh Srivastava. 2014. Characterizing and selecting fresh data sources. In Proceedings of the Association for Computing Machinery’s Special Interest Group on Management of Data (SIGMOD’14). ACM, 919--930.
L. Rietveld, W. Beek, and S. Schlobach. 2015. LOD lab: Experiments at LOD scale. In Proceedings of the International Semantic Web Conference (ISWC’15). Springer, 339--355.
Petar Ristoski, Christian Bizer, and Heiko Paulheim. 2015. Mining the web of linked data with rapidminer. Web Semant.: Sci. Serv. Agents WWW 35 (2015), 142--151.
Yannis Tzitzikas et al. 2016. Unifying heterogeneous and distributed information about marine species through the top level ontology marinetlo. Program 50, 1 (2016), 16--40.
Andre Valdestilhas, Tommaso Soru, Markus Nentwig, Edgard Marx, Muhammad Saleem, and Axel-Cyrille Ngonga Ngomo. 2018. Where is my URI? In Proceedings of the European Semantic Web Conference (ESWC’18). Springer, 671--681.
Pierre-Yves Vandenbussche, Ghislain A. Atemezing, María Poveda-Villalón, and Bernard Vatant. 2017. Linked open vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web. Semantic Web 8, 3 (2017), 437--452.
Pierre-Yves Vandenbussche, Jürgen Umbrich, Luca Matteis, Aidan Hogan, and Carlos Buil-Aranda. 2016. SPARQLES: Monitoring public SPARQL endpoints. Semantic Web (2016), 1--17.
D. Vrandečić and M. Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78--85.
Andreas Wagner, Peter Haase, Achim Rettinger, and Holger Lamm. 2014. Entity-based data source contextualization for searching the web of data. In Proceedings of the European Semantic Web Conference. Springer, 25--41.
S. Yumusak, E. Dogdu, H. Kodaz, A. Kamilaris, and P. Vandenbussche. 2017. SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines. IEICE Trans. Info. Syst. 100, 4 (2017), 758--767.
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality assessment for linked data: A survey. Semantic Web 7, 1 (2016), 63--93.

Cited By

View all
  • (2023)JQPro:Join Query Processing in a Distributed System for Big RDF Data Using the Hash-Merge Join TechniqueMathematics10.3390/math1105127511:5(1275)Online publication date: 6-Mar-2023
  • (2023)Using Multiple RDF Knowledge Graphs for Enriching ChatGPT ResponsesMachine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track10.1007/978-3-031-43430-3_24(324-329)Online publication date: 18-Sep-2023
  • (2022) Schema and content aware classification for predicting the sources containing an answer over corpus and knowledge graphs PeerJ Computer Science10.7717/peerj-cs.8468(e846)Online publication date: 3-Mar-2022
  • Show More Cited By



Information & Contributors


Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 12, Issue 2
Special Issue on Quality Assessment of Knowledge Graphs and On the Horizon
June 2020
105 pages
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 April 2020
Accepted: 01 November 2019
Revised: 01 August 2019
Received: 01 March 2019
Published in JDIQ Volume 12, Issue 2


Request permissions for this article.

Check for updates

Author Tags

  1. Dataset search
  2. contextual connectivity
  3. data integration
  4. dataset quality
  5. discoverability
  6. enrichment
  7. interlinking
  8. lattice of measurements
  9. linked data
  10. relevancy
  11. reusability


  • Research-article
  • Research
  • Refereed

Funding Sources

  • Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under the HFRI PhD Fellowship grant


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Feb 2025

Other Metrics


Cited By

View all
  • (2023)JQPro:Join Query Processing in a Distributed System for Big RDF Data Using the Hash-Merge Join TechniqueMathematics10.3390/math1105127511:5(1275)Online publication date: 6-Mar-2023
  • (2023)Using Multiple RDF Knowledge Graphs for Enriching ChatGPT ResponsesMachine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track10.1007/978-3-031-43430-3_24(324-329)Online publication date: 18-Sep-2023
  • (2022) Schema and content aware classification for predicting the sources containing an answer over corpus and knowledge graphs PeerJ Computer Science10.7717/peerj-cs.8468(e846)Online publication date: 3-Mar-2022
  • (2022)A Two-Phase Method for Optimization of the SPARQL QueryJournal of Sensors10.1155/2022/46248562022(1-12)Online publication date: 25-Aug-2022
  • (2022)ACORDARProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531729(2981-2991)Online publication date: 6-Jul-2022
  • (2022)Modular framework for similarity-based dataset discovery using external knowledgeData Technologies and Applications10.1108/DTA-09-2021-026156:4(506-535)Online publication date: 15-Feb-2022
  • (2022)Open dataset discovery using context-enhanced similarity searchKnowledge and Information Systems10.1007/s10115-022-01751-z64:12(3265-3291)Online publication date: 1-Dec-2022
  • (2022)How Your Cultural Dataset is Connected to the Rest Linked Open Data?Trandisciplinary Multispectral Modelling and Cooperation for the Preservation of Cultural Heritage10.1007/978-3-031-20253-7_12(136-148)Online publication date: 24-Nov-2022
  • (2022)LODChain: Strengthen the Connectivity of Your RDF Dataset to the Rest LOD CloudThe Semantic Web – ISWC 202210.1007/978-3-031-19433-7_31(537-555)Online publication date: 23-Oct-2022
  • (2021)Linking Entities from Text to Hundreds of RDF Datasets for Enabling Large Scale Entity EnrichmentKnowledge10.3390/knowledge20100012:1(1-25)Online publication date: 24-Dec-2021
  • Show More Cited By

View Options

Login options

Full Access

View options


View or Download as a PDF file.



View online with eReader.


HTML Format

View this article in HTML Format.

HTML Format






Share this Publication link

Share on social media