Keywords

1 Introduction

In the Web of Data, applications such as Link Discovery or Data Integration frameworks need to know where a specific URI is located. However, due to decentralized architecture of the Web of data, reliability and availability of Linked Data services, locating such URIs is not a trivial task. Locating the URIs from a well-known data dump might be easy. For example, it is trivial to know that the URI http://dbpedia.org/resource/Leipzig belongs to the DBpedia dataset. However, locating the dataset where the URI http://citeseer.rkbexplorer.com/id/resource-CS116606 was first defined is a time-consuming task. Consequently, this can greatly affect the scalable and time-efficient deployment of many Semantic Web applications such as link discovery, Linked Data enrichment, and federated query processing. On the other hand, such provenance information about URIs can lead to regenerate and validate the links across datasets.

The availability of the current available services to provide such information is unfortunately one of the key issues in Semantic Web and Linked Data. It has been shown that around 90% of the information published as Linked Open Data is available as data dumps only and more than 60% of endpoints are offline [14]. The availability problem is mostly due to cost associated with storing and providing querying services.

To this end, we propose Where is my URI? (WIMU), a low-cost Semantic Web service to determine the RDF data source of URIs along with their use. We also rank the data sources in case a single URI is provided by multiple data sources. The ranking is based-on a scoring function. Currently, our service processed more than 58 billion unique triples from more than 660,000 datasets obtained from LODStats [1] and LOD Laundromat [2]. For each URI, our service provides the corresponding datasets and the number of literals in the datasets having this URI. The service is both available from a web interface as well as can be queried from a client application using the standard HTTP protocol. We believe our service can be used in multiple Linked Data related problems such as devising fast link discovery frameworks, efficient source selection, and distributed query processing.

Our main contributions are as follows:

  • We provide a regularly updatedFootnote 1 database index of more than 660K datasets from LODStats and LOD Laundromat.

  • We provide an efficient, low cost and scalable service on the web that shows which dataset most likely defines a URI.

  • We provide various statistics of datasets indexed from LODStats and LOD Laundromat.

The service is available from https://w3id.org/where-is-my-uri/ under GNU Affero public license 3.0 and the source code is available onlineFootnote 2.

The rest of the paper is organized as follows: We first provide a brief overview of the state-of-the-art. We then discuss the proposed approach in detail, including the index creation, the web interface, and the data processing. We finally present our evaluation results and conclude.

2 Related Work

The work presented in [4] shows how to set up a Linked Data repository called DataTank and publish data as turtle files or through a SPARQL endpoint. The difference with WIMU is that we provide a RESTful service instead of a setup to configure a Linked Data repository.

The work in [7] is based on an approach developed for the 3store RDF triple store and describes a technique for indexing RDF documents allowing the rank and retrieval of their contents. Their index contained \(10^7\) triples, which was remarkable for the early years of the Semantic Web. Moreover, their system is not available for tests anymore. A similar point here is that the authors claim that for a given URI from an RDF document, the system will retrieve the URLs of documents containing that URI.

In the approach called LOD-A-LOT [5] which is a queryable dump file of the LOD CloudFootnote 3, there are some differences with WIMU. The first, it is not possible to know the provenance of the URI in order to know which dataset the URI was defined. They provide a huge dump fileFootnote 4 containing all the data from LOD LaundromatFootnote 5. LOD Laundromat itself provides an endpoint to an inverted index of their dataFootnote 6. However, finding the original document a URI was defined in is not trivial, as the returned metadata only describe the datasets themselves [2]. Moreover, as the primary aim of LOD Laundromat is to “clean” Linked Data, most dumps are possibly not continuously monitored, once cleaned.

Comparing with all the approaches above, the main advantage of WIMU is that the datasets a URI likely belongs to are ranked using a score. Our index has also a larger coverage, as it includes data from the two largest Linked Data hubs, i.e., LODStats [1] and LOD Laundromat [2], and the most updated SPARQL endpoints. Finally, WIMU is able to process RDF files containing more than one URI at the same timeFootnote 7.

3 The Approach

WIMU uses the number of literals as a heuristic to identify the dataset which most likely defines a URI. The intuition behind this can be explained in two points: (1) Literal values are the raw data that can disambiguate a URI node in the most direct way and (2) The Semantic Web architecture expects that datasets reusing a URI only refer to it without defining more literal values. One more reason for point (1) is that: it is straightforward to understand whether two literal values are different, whereas disambiguating URIs usually requires more effort.

We store the collected data in an Apache LuceneFootnote 8 index. Due to runtime performance and complexity reasons, we found that storing the information into Lucene was more convenient than a traditional triple store such as VirtuosoFootnote 9. The rationale behind this choice is that a tuple such as (URI, Dataset, Score) would be expressed using at least three triples; for instance:

figure a

where is an index record URI. Therefore, materializing all records would have substantially increased the space complexity of our index.

3.1 The Index Creation

The index creation, the core of our work, is shown in Fig. 1 and consists in the following four steps:

  1. 1.

    Retrieve list of datasets from sources (i.e., LOD Stats and LOD Laundromat).

  2. 2.

    Retrieve data from datasets (i.e., dump files, endpoints, and HDT files).

  3. 3.

    Build three indexes from dump files, endpoints, and HDT files.

  4. 4.

    Merge the indexes into one.

  5. 5.

    Make the index available and browsable via a web application and an API service.

Fig. 1.
figure 1

Creation workflow of the WIMU index.

For each processed dataset, we keep its URI as provenance. After we have downloaded and extracted a dump file, we process it by counting the literals as objects for each subject. For endpoints and HDT files, we use a SPARQL query:

figure b

We process the data in parallel, distributing the datasets among the CPUs. If a dataset is too large for a CPU, we split it into smaller chunks. To preserve space, dump files are deleted after being processed. The index was generated using an Intel Xeon Core i7 processor with 64 cores, 128 GB RAM on an Ubuntu 14.04.5 LTS with Java SE Development Kit 8.

3.2 The Web Interface and the API Service

In order to simplify the access to our service, we create a web interface where it is possible to visualize all the data from the service, as Fig. 2 shows.

Fig. 2.
figure 2

Web interface.

The web interface allows the user to query a URI and see the results in a HTML web browser; the API service allows the user to work with an output in JSON format. In Fig. 2, we can see an example of usage of the service, where WIMU is requested for the dataset in which the URI dbpedia:Leipzig was defined. Figure 4 shows the generic usage of WIMU.

4 Use Cases

In this section, we present three use-cases to show that our hypothesis works on the proposed tasks.

4.1 Data Quality in Link Repositories

The first use-case is about quality assurance in a link repository by re-applying link discovery algorithms on the stored links. This task concerns important steps of the Linked Data Lifecycle, in particular Data Interlinking and Quality. Link repositories contain sets of links that connect resources belonging to different datasets. Unfortunately, the subject and the object URIs of a link often do not have metadata available, hence their Concise Bounded Descriptions (CBDs) are hard to obtain. In Sect. 4.1, \((D_1,...,D_n|x)\) : \(D_n\) represent the datasets and x is the quantity of literals. The input for our service in this use-case is S; the output is \(\{(D_1,3),(D_2,1),(D_3,2)\}\), where D1 most likely defines S due to the highest number of literal. In the same way, the dataset that most likely defines T is \(D_4\) with 7 literals. Once we have this information, the entire CBD of the two resources S and T can be extracted and a Link Discovery algorithm can check whether the owl:sameAs link among them should subsist (Fig. 3).

Fig. 3.
figure 3

First use-case.

4.2 Finding Class Axioms for Link Discovery

A class axiom is needed by the link discovery algorithm to reduce the number of comparisons. Here, the aim is to find two class axioms for each mapping in the link repository.

To this end, we use real data including a mappingFootnote 10 from the LinkLion repository [9] between http://citeseer.rkbexplorer.com/id/resource-CS65161 (S) and http://citeseer.rkbexplorer.com/id/resource-CS65161 (T). Our service shows that S was defined in four datasets, whereas the dataset with more literals was http://km.aifb.kit.edu/projects/btc-2009/btc-2009-chunk-039.gzFootnote 11. Thus, we can deduce where the URI S was most likely defined. Knowing the datasets allows us to extract the axioms of the classes our URIs belong to. The techniques to decrease the complexity vary from choosing the most specific class to using an ontology learning tool such as DL-Learner [8].

Fig. 4.
figure 4

Usage.

4.3 Federated Query Processing

Federated queries, which aim to collect information from more than one datasets is of central importance for many semantic web and linked data applications [11, 12]. One of the key step in federated query processing is the source selection. The goal of the source selection is to find relevant sources (i.e., datasets) for the given user query. In the next step, the federated query processing engine makes use of the source selection information to generate an optimized query execution plan. WIMU can be used by the federated SPARQL engines to find the relevant sources against the individual triple patterns of the given SPARQL query. In particular, our service can be helpful during the source selection and query planning in cost-based SPARQL federation engines such SPLENDID [6], SemaGrow [3], HiBISCuS [13], CostFed [10], etc.

4.4 Usage Examples

The service API provides a JSON as output, allowing users to use WIMU with some programming language compatible with JSON. Here we give examples, for more details please check the manualFootnote 12.

Service: https://w3id.org/where-is-my-uri/Find.

Parameters Table 1:

Table 1. Parameters

Input (Single URI example):

https://w3id.org/where-is-my-uri/Find?top=5&uri=http://dbpedia.org/resource/Leipzig.

Output:

figure c

Java and the API GsonFootnote 13:

figure d

5 Statistics About the Datasets

To the best of our knowledge, LODStats [1] is the only project oriented to monitoring dump files; however, its last update dates back to 2016. Observing Table 2, we are able to say that from LODStats, not all datasets are ready to use. Especially, more than 58% are off-line, 14% are empty datasets, 8% of the triples that have literals as objects are blank nodes and 35% of the online datasets present some error using the Apache Jena parserFootnote 14. A large part of those data was processed and cleaned by LOD Laundromat [2].

Table 2. Datasets.

The algorithm took three days and seven hours to complete the task. Thus, we will create a scheduled job to update our database index once a month. With respect to the information present in the Fig. 5, we can observe that the majority of files from LODStats are in RDF/XML format. Moreover, the endpoints are represented in greater numbers (78.6%), the dominant file format is RDF with 84.1% of the cases, and 56.2% of errors occurred because Apache Jena was not able to perform SPARQL queries. Among the HDT files from LOD Laundromat, 2.3% of them could not be processed due to parsing errors. Another relevant point is that 99.2% of the URIs indexed with WIMU come from LOD Laundromat, due to 69.8% of datasets from LODstats contain parser errors in which WIMU was not able to process the data.

Fig. 5.
figure 5

Dump files and Apache Jena parsing error.

Finally, we validated our heuristic assessing if the URI really belongs to the dataset with more literals. To this end, we took a sample of 100 URIsFootnote 15 that belong to at least two datasets, where we assess manually the data in order to check if the results are really correct. As a result, the dataset containing the correct information was found as first result in 90% of the URIs and among the top three in 95% of the URIs.

6 Conclusion and Future Work

We provide a database index of URIs and their respective datasets built upon large Linked Data hubs such as LODStats and LOD Laundromat. In order to make this data available and easy to use, we developed a semantic web service. For a given URI, it is possible to know the dataset the URI likely was defined in using a heuristic based on the number of literals. We showed two use-cases and carried out a preliminary evaluation to facilitate the understanding of our work. As future work, we will integrate the service into the version 2.0 of the LinkLion repository, so as to perform linkset quality assurance with the application of link discovery algorithms on the stored links.