Where is My URI?

Valdestilhas, Andre; Soru, Tommaso; Nentwig, Markus; Marx, Edgard; Saleem, Muhammad; Ngomo, Axel-Cyrille Ngonga

doi:10.1007/978-3-319-93417-4_43

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10843))

Included in the following conference series:

European Semantic Web Conference

6655 Accesses
6 Citations
1 Altmetric

Abstract

One of the Semantic Web foundations is the possibility to dereference URIs to let applications negotiate their semantic content. However, this exploitation is often infeasible as the availability of such information depends on the reliability of networks, services, and human factors. Moreover, it has been shown that around 90% of the information published as Linked Open Data is available as data dumps and more than 60% of endpoints are offline. To this end, we propose a Web service called Where is my URI?. Our service aims at indexing URIs and their use in order to let Linked Data consumers find the respective RDF data source, in case such information cannot be retrieved from the URI alone. We rank the corresponding datasets by following the rationale upon which a dataset contributes to the definition of a URI proportionally to the number of literals. We finally describe potential use-cases of applications that can immediately benefit from our simple yet useful service.

You have full access to this open access chapter, Download conference paper PDF

How to Stay Ontop of Your Data: Databases, Ontologies and More

Publishing E-RDF Linked Data for Many Agents by Single Third-Party Server

Querying the Web of Data with SPARQL-LD

Keywords

Resource type: Web Service Index
Permanent URL: https://w3id.org/where-is-my-uri/.

1 Introduction

In the Web of Data, applications such as Link Discovery or Data Integration frameworks need to know where a specific URI is located. However, due to decentralized architecture of the Web of data, reliability and availability of Linked Data services, locating such URIs is not a trivial task. Locating the URIs from a well-known data dump might be easy. For example, it is trivial to know that the URI http://dbpedia.org/resource/Leipzig belongs to the DBpedia dataset. However, locating the dataset where the URI http://citeseer.rkbexplorer.com/id/resource-CS116606 was first defined is a time-consuming task. Consequently, this can greatly affect the scalable and time-efficient deployment of many Semantic Web applications such as link discovery, Linked Data enrichment, and federated query processing. On the other hand, such provenance information about URIs can lead to regenerate and validate the links across datasets.

The availability of the current available services to provide such information is unfortunately one of the key issues in Semantic Web and Linked Data. It has been shown that around 90% of the information published as Linked Open Data is available as data dumps only and more than 60% of endpoints are offline [14]. The availability problem is mostly due to cost associated with storing and providing querying services.

To this end, we propose Where is my URI? (WIMU), a low-cost Semantic Web service to determine the RDF data source of URIs along with their use. We also rank the data sources in case a single URI is provided by multiple data sources. The ranking is based-on a scoring function. Currently, our service processed more than 58 billion unique triples from more than 660,000 datasets obtained from LODStats [1] and LOD Laundromat [2]. For each URI, our service provides the corresponding datasets and the number of literals in the datasets having this URI. The service is both available from a web interface as well as can be queried from a client application using the standard HTTP protocol. We believe our service can be used in multiple Linked Data related problems such as devising fast link discovery frameworks, efficient source selection, and distributed query processing.

Our main contributions are as follows:

We provide a regularly updated^{Footnote 1} database index of more than 660K datasets from LODStats and LOD Laundromat.
We provide an efficient, low cost and scalable service on the web that shows which dataset most likely defines a URI.
We provide various statistics of datasets indexed from LODStats and LOD Laundromat.

The service is available from https://w3id.org/where-is-my-uri/ under GNU Affero public license 3.0 and the source code is available online^{Footnote 2}.

The rest of the paper is organized as follows: We first provide a brief overview of the state-of-the-art. We then discuss the proposed approach in detail, including the index creation, the web interface, and the data processing. We finally present our evaluation results and conclude.

2 Related Work

The work presented in [4] shows how to set up a Linked Data repository called DataTank and publish data as turtle files or through a SPARQL endpoint. The difference with WIMU is that we provide a RESTful service instead of a setup to configure a Linked Data repository.

The work in [7] is based on an approach developed for the 3store RDF triple store and describes a technique for indexing RDF documents allowing the rank and retrieval of their contents. Their index contained \(10^7\) triples, which was remarkable for the early years of the Semantic Web. Moreover, their system is not available for tests anymore. A similar point here is that the authors claim that for a given URI from an RDF document, the system will retrieve the URLs of documents containing that URI.

In the approach called LOD-A-LOT [5] which is a queryable dump file of the LOD Cloud^{Footnote 3}, there are some differences with WIMU. The first, it is not possible to know the provenance of the URI in order to know which dataset the URI was defined. They provide a huge dump file^{Footnote 4} containing all the data from LOD Laundromat^{Footnote 5}. LOD Laundromat itself provides an endpoint to an inverted index of their data^{Footnote 6}. However, finding the original document a URI was defined in is not trivial, as the returned metadata only describe the datasets themselves [2]. Moreover, as the primary aim of LOD Laundromat is to “clean” Linked Data, most dumps are possibly not continuously monitored, once cleaned.

Comparing with all the approaches above, the main advantage of WIMU is that the datasets a URI likely belongs to are ranked using a score. Our index has also a larger coverage, as it includes data from the two largest Linked Data hubs, i.e., LODStats [1] and LOD Laundromat [2], and the most updated SPARQL endpoints. Finally, WIMU is able to process RDF files containing more than one URI at the same time^{Footnote 7}.

3 The Approach

WIMU uses the number of literals as a heuristic to identify the dataset which most likely defines a URI. The intuition behind this can be explained in two points: (1) Literal values are the raw data that can disambiguate a URI node in the most direct way and (2) The Semantic Web architecture expects that datasets reusing a URI only refer to it without defining more literal values. One more reason for point (1) is that: it is straightforward to understand whether two literal values are different, whereas disambiguating URIs usually requires more effort.

We store the collected data in an Apache Lucene^{Footnote 8} index. Due to runtime performance and complexity reasons, we found that storing the information into Lucene was more convenient than a traditional triple store such as Virtuoso^{Footnote 9}. The rationale behind this choice is that a tuple such as (URI, Dataset, Score) would be expressed using at least three triples; for instance:

where is an index record URI. Therefore, materializing all records would have substantially increased the space complexity of our index.

3.1 The Index Creation

The index creation, the core of our work, is shown in Fig. 1 and consists in the following four steps:

1.
Retrieve list of datasets from sources (i.e., LOD Stats and LOD Laundromat).
2.
Retrieve data from datasets (i.e., dump files, endpoints, and HDT files).
3.
Build three indexes from dump files, endpoints, and HDT files.
4.
Merge the indexes into one.
5.
Make the index available and browsable via a web application and an API service.

For each processed dataset, we keep its URI as provenance. After we have downloaded and extracted a dump file, we process it by counting the literals as objects for each subject. For endpoints and HDT files, we use a SPARQL query:

We process the data in parallel, distributing the datasets among the CPUs. If a dataset is too large for a CPU, we split it into smaller chunks. To preserve space, dump files are deleted after being processed. The index was generated using an Intel Xeon Core i7 processor with 64 cores, 128 GB RAM on an Ubuntu 14.04.5 LTS with Java SE Development Kit 8.

3.2 The Web Interface and the API Service

In order to simplify the access to our service, we create a web interface where it is possible to visualize all the data from the service, as Fig. 2 shows.

The web interface allows the user to query a URI and see the results in a HTML web browser; the API service allows the user to work with an output in JSON format. In Fig. 2, we can see an example of usage of the service, where WIMU is requested for the dataset in which the URI dbpedia:Leipzig was defined. Figure 4 shows the generic usage of WIMU.

4 Use Cases

In this section, we present three use-cases to show that our hypothesis works on the proposed tasks.

4.1 Data Quality in Link Repositories

The first use-case is about quality assurance in a link repository by re-applying link discovery algorithms on the stored links. This task concerns important steps of the Linked Data Lifecycle, in particular Data Interlinking and Quality. Link repositories contain sets of links that connect resources belonging to different datasets. Unfortunately, the subject and the object URIs of a link often do not have metadata available, hence their Concise Bounded Descriptions (CBDs) are hard to obtain. In Sect. 4.1, \((D_1,...,D_n|x)\) : \(D_n\) represent the datasets and x is the quantity of literals. The input for our service in this use-case is S; the output is \(\{(D_1,3),(D_2,1),(D_3,2)\}\), where D1 most likely defines S due to the highest number of literal. In the same way, the dataset that most likely defines T is \(D_4\) with 7 literals. Once we have this information, the entire CBD of the two resources S and T can be extracted and a Link Discovery algorithm can check whether the owl:sameAs link among them should subsist (Fig. 3).

4.2 Finding Class Axioms for Link Discovery

A class axiom is needed by the link discovery algorithm to reduce the number of comparisons. Here, the aim is to find two class axioms for each mapping in the link repository.

To this end, we use real data including a mapping^{Footnote 10} from the LinkLion repository [9] between http://citeseer.rkbexplorer.com/id/resource-CS65161 (S) and http://citeseer.rkbexplorer.com/id/resource-CS65161 (T). Our service shows that S was defined in four datasets, whereas the dataset with more literals was http://km.aifb.kit.edu/projects/btc-2009/btc-2009-chunk-039.gz^{Footnote 11}. Thus, we can deduce where the URI S was most likely defined. Knowing the datasets allows us to extract the axioms of the classes our URIs belong to. The techniques to decrease the complexity vary from choosing the most specific class to using an ontology learning tool such as DL-Learner [8].

4.3 Federated Query Processing

Federated queries, which aim to collect information from more than one datasets is of central importance for many semantic web and linked data applications [11, 12]. One of the key step in federated query processing is the source selection. The goal of the source selection is to find relevant sources (i.e., datasets) for the given user query. In the next step, the federated query processing engine makes use of the source selection information to generate an optimized query execution plan. WIMU can be used by the federated SPARQL engines to find the relevant sources against the individual triple patterns of the given SPARQL query. In particular, our service can be helpful during the source selection and query planning in cost-based SPARQL federation engines such SPLENDID [6], SemaGrow [3], HiBISCuS [13], CostFed [10], etc.

4.4 Usage Examples

The service API provides a JSON as output, allowing users to use WIMU with some programming language compatible with JSON. Here we give examples, for more details please check the manual^{Footnote 12}.

Service: https://w3id.org/where-is-my-uri/Find.

Parameters Table 1:

Table 1. Parameters

Full size table

Input (Single URI example):

https://w3id.org/where-is-my-uri/Find?top=5&uri=http://dbpedia.org/resource/Leipzig.

Output:

Java and the API Gson^{Footnote 13}:

5 Statistics About the Datasets

To the best of our knowledge, LODStats [1] is the only project oriented to monitoring dump files; however, its last update dates back to 2016. Observing Table 2, we are able to say that from LODStats, not all datasets are ready to use. Especially, more than 58% are off-line, 14% are empty datasets, 8% of the triples that have literals as objects are blank nodes and 35% of the online datasets present some error using the Apache Jena parser^{Footnote 14}. A large part of those data was processed and cleaned by LOD Laundromat [2].

Table 2. Datasets.

Full size table

The algorithm took three days and seven hours to complete the task. Thus, we will create a scheduled job to update our database index once a month. With respect to the information present in the Fig. 5, we can observe that the majority of files from LODStats are in RDF/XML format. Moreover, the endpoints are represented in greater numbers (78.6%), the dominant file format is RDF with 84.1% of the cases, and 56.2% of errors occurred because Apache Jena was not able to perform SPARQL queries. Among the HDT files from LOD Laundromat, 2.3% of them could not be processed due to parsing errors. Another relevant point is that 99.2% of the URIs indexed with WIMU come from LOD Laundromat, due to 69.8% of datasets from LODstats contain parser errors in which WIMU was not able to process the data.

Finally, we validated our heuristic assessing if the URI really belongs to the dataset with more literals. To this end, we took a sample of 100 URIs^{Footnote 15} that belong to at least two datasets, where we assess manually the data in order to check if the results are really correct. As a result, the dataset containing the correct information was found as first result in 90% of the URIs and among the top three in 95% of the URIs.

6 Conclusion and Future Work

We provide a database index of URIs and their respective datasets built upon large Linked Data hubs such as LODStats and LOD Laundromat. In order to make this data available and easy to use, we developed a semantic web service. For a given URI, it is possible to know the dataset the URI likely was defined in using a heuristic based on the number of literals. We showed two use-cases and carried out a preliminary evaluation to facilitate the understanding of our work. As future work, we will integrate the service into the version 2.0 of the LinkLion repository, so as to perform linkset quality assurance with the application of link discovery algorithms on the stored links.

Notes

1.
Updated monthly due to the huge size of data processed.
2.
https://github.com/dice-group/wimu.
3.
http://lod-cloud.net/.
4.
A HDT file with more than 500 GB which requires more than 16 GB RAM to process.
5.
http://lodlaundromat.org/.
6.
http://index.lodlaundromat.org/.
7.
For example: https://w3id.org/where-is-my-uri/Find?link=http://www.linklion.org/download/mapping/citeseer.rkbexplorer.com---ibm.rkbexplorer.com.nt.
8.
https://lucene.apache.org/.
9.
https://virtuoso.openlinksw.com/.
10.
http://www.linklion.org/download/mapping/citeseer.rkbexplorer.com---ibm.rkbexplorer.com.nt.
11.
Also available in HDT file from http://download.lodlaundromat.org/15b06d92ae660ffdcff9690c3d6f5185?type=hdt.
12.
More examples such as many URIs, linksets, and generation of Concise Bounded Description (CBD) check https://dice-group.github.io/wimu/.
13.
https://github.com/google/gson.
14.
See https://github.com/dice-group/wimu/blob/master/ErrorJenaParser.tsv.
15.
https://github.com/dice-group/wimu/blob/master/result100.csv.

References

Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS (LNAI), vol. 7603, pp. 353–362. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33876-2_31
Chapter Google Scholar
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_14
Chapter Google Scholar
Charalambidis, A., Troumpoukis, A., Konstantopoulos, S.: SemaGrow: optimizing federated SPARQL queries. In: Proceedings of the 11th International Conference on Semantic Systems, SEMANTICS 2015, pp. 121–128. ACM, New York (2015)
Google Scholar
Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: Painless URI dereferencing using the DataTank. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8798, pp. 304–309. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11955-7_39
Chapter Google Scholar
Fernández, J.D., Beek, W., Martínez-Prieto, M.A. Arias, M.: LOD-a-lot: A queryable dump of the LOD cloud (2017)
Google Scholar
Görlitz, O., Staab, S.: SPLENDID: SPARQL endpoint federation exploiting VoID descriptions. In: Hartig, O., Harth, A., Sequeda, J. F. (eds.) 2nd International Workshop on Consuming Linked Data (COLD 2011) in CEUR Workshop Proceedings, vol. 782, October 2011
Google Scholar
Harris, S., Gibbins, N., Payne, T.R.: SemIndex: preliminary results from semantic web indexing (2004)
Google Scholar
Lehmann, J.: DL-Learner: learning concepts in description logics. J. Mach. Learn. Res. 10(Nov), 2639–2642 (2009)
MathSciNet MATH Google Scholar
Nentwig, M., Soru, T., Ngonga Ngomo, A.-C., Rahm, E.: LinkLion: a link repository for the web of data. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8798, pp. 439–443. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11955-7_63
Chapter Google Scholar
Potocki, A., Saleem, M., Soru, T., Hartig, O., Voigt, M., Ngomo, A.-C.N.: Federated SPARQL query processing via CostFed (2017)
Google Scholar
Saleem, M., Kamdar, M.R., Iqbal, A., Sampath, S., Deus, H.F., Ngomo, A.-C.N.: Big linked cancer data: integrating linked TCGA and PubMed. J. Web Semant. Sci. Serv. Agents World Wide Web 27–28, 34–41 (2014). Semantic Web Challenge 2013
Article Google Scholar
Saleem, M., Kamdar, M.R., Iqbal, A., Sampath, S., Deus, H.F., Ngonga, A.-C.: Fostering serendipity through big linked data. In: Semantic Web Challenge at ISWC (2013)
Google Scholar
Saleem, M., Ngonga Ngomo, A.-C.: HiBISCuS: hypergraph-based source selection for SPARQL endpoint federation. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 176–191. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07443-6_13
Chapter Google Scholar
Vandenbussche, P.-Y., Umbrich, J., Matteis, L., Hogan, A., Buil-Aranda, C.: SPARQLES: monitoring public SPARQL endpoints. Semant. Web 8(6), 1049–1065 (2017)
Article Google Scholar

Download references

Acknowledgments

This research has been partially supported by CNPq Brazil under grants No. 201536/2014-5 and H2020 projects SLIPO (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no. NG 105/3-2), the BMWI Projects SAKE (project no. 01MD15006E) and GEISER (project no. 01MD16014E). Thanks to special help from Ivan Ermilov and Diego Moussallem.

Author information

Authors and Affiliations

AKSW Group, University of Leipzig, Leipzig, Germany
Andre Valdestilhas, Tommaso Soru & Muhammad Saleem
Database Group, University of Leipzig, Leipzig, Germany
Markus Nentwig
Leipzig University of Applied Sciences, Leipzig, Germany
Edgard Marx
Data Science Group, Paderborn University, Paderborn, Germany
Axel-Cyrille Ngonga Ngomo

Authors

Andre Valdestilhas
View author publications
You can also search for this author in PubMed Google Scholar
Tommaso Soru
View author publications
You can also search for this author in PubMed Google Scholar
Markus Nentwig
View author publications
You can also search for this author in PubMed Google Scholar
Edgard Marx
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Saleem
View author publications
You can also search for this author in PubMed Google Scholar
Axel-Cyrille Ngonga Ngomo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andre Valdestilhas .

Editor information

Editors and Affiliations

University of Bologna, Bologna, Italy
Aldo Gangemi
Sapienza University of Rome, Rome, Italy
Roberto Navigli
Universidad Simón Bolívar, Caracas, Venezuela
Maria-Esther Vidal
Wright State University, Dayton, Ohio, USA
Pascal Hitzler
EURECOM, Biot, France
Raphaël Troncy
CWI, Amsterdam, The Netherlands
Laura Hollink
Elsevier B.V., Amsterdam, The Netherlands
Anna Tordai
CNR-ISTC, Rome, Italy
Mehwish Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valdestilhas, A., Soru, T., Nentwig, M., Marx, E., Saleem, M., Ngomo, AC.N. (2018). Where is My URI?. In: Gangemi, A., et al. The Semantic Web. ESWC 2018. Lecture Notes in Computer Science(), vol 10843. Springer, Cham. https://doi.org/10.1007/978-3-319-93417-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-319-93417-4_43
Published: 03 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93416-7
Online ISBN: 978-3-319-93417-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Where is My URI?

Abstract

Similar content being viewed by others

How to Stay Ontop of Your Data: Databases, Ontologies and More