1 Introduction

A Knowledge Base (KB) stores descriptions of entities and their relationships, often in the form of a very large RDF graph, such as DBpedia or Wikidata. A relationship path between an entity pair is a path in an RDF graph that connects the nodes that represent the entities. The entity relatedness problem refers to the question of computing the relationship paths that better describe the connectivity between a given entity pair.

Several approaches [1,2,3,4,5,6] have been proposed to address the entity relatedness problem. They apply a simple strategy: (1) search for relationship paths between the given entity pair – the larger the number of paths found, the stronger the connectivity between the entities is likely to be; and (2) sort the paths found and select the relevant ones. However, there currently is no adequate benchmarks to measure the effectiveness of such approaches. In some cases, expert users evaluate the results, and an apparently reliable method to judge the effectiveness of the approach is introduced. In others, a ground truth is created, which is a difficult and time-consuming task, and hardly the authors make the resources available. Thus, an open challenge is: How to evaluate and compare approaches that address the entity relatedness problem?

The major contribution of this paper is a dataset created to support the evaluation of approaches that address the entity relatedness problem, which we refer to as the Entity Relatedness Test Dataset. The dataset contains entities and relationship paths extracted from DBpedia that pertain to two familiar domains, music and movies, and additional data extracted from the Internet Movie Database – IMDb and last.fm, which are popular reference datasets in these domains. The dataset and resources are available at [17,18,19,20,21].

The paper describes in detail the major steps and design decisions behind the construction of the dataset. The first design decision was to select DBpedia as the reference knowledge base, from which we extracted relationships paths. The second design decision was to select the movies and music domains, which are backed up by two well-known datasets, IMDb and last.fm, from which we extracted reliable domain-specific knowledge. The dataset construction process involved three major steps. The first step consisted in the selection of a set of entity pairs from the music and movies domains. The second step referred to the extraction of a set of relationship paths from DBpedia, for each entity pair. The final step was to rank the paths, based on information extracted from IMDb and last.fm, and to select the top-k ones.

This paper is structured as follows. Section 2 summarizes related work. Section 3 introduces a generic strategy to find and rank relationship paths. Section 4 describes the construction of the dataset. Finally, Sect. 5 presents the conclusions.

2 Related Work

Finding and Ranking Relationship Paths Between a Given Entity Pair in a Knowledge Base.

RECAP [3], EXPLASS [4] and DBpedia Profiler [6] implemented path finding processes in an RDF knowledge base with the help of SPARQL queries [9]. REX [2] used two breadth-first searches on the RDF graph to enumerate relationship paths between two entities, and considered the degree of a node as an activation criterion to prioritize nodes. Likewise, the work in [15] used the Jaccard similarity to compute an approximated minimal distance between the start and the end nodes, and to discover meaningful connection between the nodes.

Evaluating Relationship-Path Ranking in a Knowledge Base.

Path-ranking measures were proposed in [3, 6, 8] to rank relationship paths in knowledge databases. Some approaches [3, 4, 6] evaluated relationship path rankings with the help of user experiments. However, the evaluation methods did not clearly define the capabilities of the approaches analyzed. The work proposed in [10] argued that entity similarity heuristics increase the relevance of the links between nodes. The authors compared and measured the effectiveness of different search strategies through user experiments.

In this paper, we describe a dataset containing entity pairs and relationship paths in two entertainment domains, music and movies, to compare approaches that address the entity relatedness problem.

3 A Generic Relationship Path Finding and Ranking Process

An RDF graph G is a set of RDF triples of the form G = {(s, p, o)} = (V, E), where the subject is an entity s ∈ V, and it has property p ∈ E whose value is an object o ∈ V, which is either another entity. Particularly, p is seen as the edge that link the entities s and o in an RDF Graph. We will use the terms entity and node of G interchangeably.

A relationship path in G between nodes w 0 and w k in G is an expression of the form (w 0 , p 1 , w 1 , p 2 , w 2 , …, p k−1 , w k−1 , p k , w k ), where: k is the length of the path; w i is a node of G such that w i and w j are different, for 0 ≤ i ≠ j  k; and either (w i , w i+1 ) or (w i+1 , w i ) are edges of G labeled with p i+1 , for 0 ≤ i < k. Note that, since a relationship path is an undirected path, but G is a directed graph, we allow either (w i , w i+1 ) or (w i+1 , w i ) to participate in the path. Alternatively, one may assume that each property p has an inverse, denoted “ˆp”, using SPARQL notation.

To construct the dataset, we adopt a generic path finding and ranking process, briefly described as follow.

The path finding algorithm receives an RDF graph G, two target entities, v start and v end , a maximum distance k, and an activation function \( \varvec{\uptau} \). It implements two breadth first searches (BFS), executed in parallel, to find paths in G between the target entities [7, 10, 14]. A BFS is started from each target entity (line 6). Subpaths are generated in the expansion step, and full paths are created when one of the target entities is reached, or the subpaths S left or S right share a common entity (line 7). An activation function \( \varvec{\uptau} \) optimizes the traversal of G; only entities that comply with the activation criteria are considered. The output of the algorithm is a set of RDF paths between v start and v end .

The path ranking algorithm receives a set of paths Paths and a path ranking function f, and outputs a ranked subset of Paths.

The final algorithm calls the path finding algorithm and then the path ranking algorithm. It outputs a ranked list of paths.

figure a

4 Constructing the Entity Relatedness Test Dataset

The construction of the Entity Relatedness Test Dataset poses three major challenges: (1) how to select entity pairs; (2) how to find relationship paths for the entity pairs selected; and (3) how to rank the relationship paths. We addressed these challenges in the movies and music domains.

The dataset and resource are available at [17,18,19,20,21]. Examples and a more detailed evaluation of how use this dataset can be found in [16].

4.1 Selecting Entity Pairs

We focused on best-selling music artistsFootnote 1, in the music domain, and on famous classic actors and actressesFootnote 2, in the movies domain. We considered the box office sales and the actor’s fame as relevance criteria for the music and movies domains.

After selecting a list of entities from each of these two domains, we submitted each entity to Google Search to select a set of related entities. Then, for the possible entity pair, we computed their semantic connectivity scoreFootnote 3 [11] in DBpedia, with maximum length 4, to discover entity pairs with high connectivity. The maximum path length between two entities was set to 4, since it is a value backed up by the small world [12] phenomenon, which says that a pair of nodes is separated by a small number of connections, and since it was confirmed in previous experiments [15].

4.2 Finding Relationship Paths

For each of the 40 entity pairs of our dataset, we used the path finding algorithm, described in Sect. 3 (and introduced in [16]), to create 40 sets, each with 50 relationship paths. We applied the algorithm to the RDF graph of DBpedia, and used an activation function that prioritizes entities which are instances of classes of the DBpedia ontology that pertain to the domain in question. The classes or types of an entity in DBpedia are defined through the rdf:type property. The classes of the DBpedia ontology in music and movie domains are defined in Tables 7 and 8 in Sect. 5 at [16]. The entities that belong to previous classes are considered in the generations of relationship paths in DBpedia. The path finding algorithm uses as single activation function the classes of the DBpedia ontology in the domain concerned, the expansion process analyses the types of each entity, if an entity belongs to a class of the ontology domain, then it is prioritized to generate relationship paths.

To define which classes of the DBpedia ontology pertain to each of the domains in question, we adopted as reference the Music Ontology, for the music domain, and the Movie Ontology, for the movies domain. Then, we manually selected classes of the DBpedia ontology that could be paired with the major classes of each reference ontology.

4.3 Mapping Entities

As a preparation to the path ranking process, we mapped entities in DBpedia to entities in the reference datasets, as explained in this section.

Music Domain.

To map DBpedia entities to last.fm, we used the keyword search API of last.fmFootnote 4: api:artist.getInfo, api:album.getInfo and api:track.getInfo.

We first determined whether the entity represented an artist or a musical content by analyzing the rdf:type property, as in [6]. For example, the entity dbr:Michael_Jackson has type dbo:Artist. If the entity represented an artist, we extracted keywords from its URI (such as “Michael + Jackson”) and submitted them to api:artist.getInfo Footnote 5 to search for the entity. If the search was successful, we had an exact mapping, otherwise we used other keywords. It the entity represented musical content (an album, song or single), we had to identify its main artist in DBpedia, through the property dbp:artist. For example, the main artist of dbr:Thriller_(album) is dbr:Michael_Jackson. If the entity represented a musical album, we called api:album.getInfo Footnote 6 to search for the entity. Similarly, it the entity represented a song or a single, we called api:track.getInfo.

Movies Domain.

In DBpedia, we used the property rdf:type to decide if an entity was a movie. In any other case, we considered the entity as a participant of a movie. We identified the immediate type of an entity using the method proposed in [6].

To map DBpedia entities to IMDb, we imported the IMDbFootnote 7 database to a local PostgreSQL database and re-created data about names, movies and casts (people who worked in a movie). Usually, the entities in DBpedia have an auto description in the URI. For example, the URL dbr:Cleopatra_(1963_film) indicates the name of a movie, “Cleopatra”, and its release year, “1963”. We used this basic description to find the same entity in IMDb through classic SQL queries. For those cases where the queries returned more than one result, we used the Levenshtein Distance [13] to choose the IMDb entity most similar to the DBpedia entity.

4.4 Ranking the Relationship Paths

We ranked the paths in each of the 40 sets using semantic information extracted from IMDb and last.fm to compute entity ratings, and information extracted from DBpedia to compute property relevance scores.

To obtain the ranked lists, we first computed the score of each path π as the average of the rating of the entities involved in the path. Recall that π is a path in the DBpedia graph. Each entity e used in π was first mapped to an equivalent entity e’ in IMDb or last.fm, as explained in Sect. 4.3; the rating of e’ was computed from data in IMDb or last.fm, as described below, and assigned to e. Finally, the score of π was computed as the average of the ratings of the entities that occur in π.

For each entity pair, we ranked the paths using their scores and retained the top 50 paths. However, since the path score ignores the relevance of the properties, paths that involve the same entities will have the same score. As a further step, we inspected each ranked list and used the relevance scores of the properties, computed in DBpedia, to help rank the paths with the same entities.

This ranking process is justified for two basic reasons. On one hand, we intended to create a dataset that would help evaluate approaches that address the entity relatedness problem, which typically involve a path ranking measure. Therefore, it would not be reasonable to adopt a path ranking measure from the literature (which would create ranked lists biased to that measure). On the other hand, it would be infeasible to manually rank the relationship paths that connect two entities (in DBpedia), whose number is typically very high [16]. Hence, we opted to: (1) select two domains – music and movies – for which specialized data were available; (2) filter the paths in DBpedia so that they traverse only entities in each of these domains; (3) use specialized domain data to pre-rank the paths found; (4) manually inspect and sanction the pre-ranking, which proved to be a feasible task. The computation of entity ratings and property relevance scores is detailed below.

Entity Rating in the Music Domain.

In last.fm, each artist and musical content has two relevance scores: the listeners score and the play count score. This information can be accessed through the search API of last.fm. The listeners score represents the number of different users who listen a song, and the play count score is the number of times a person listens to a song. An album, depending on the number of songs, receives as play count score (or listener score) the sum of the play count scores (or listeners scores) of the songs in the album. Similarly, an artist receives a play count score and a listener score. We used the play count score to create an entity rating in the music domain; if the entity is not identified in the mapping, we assigned a zero score.

Entity Rating in the Movies Domain.

IMDb publishes user-generated ratings for movies; an IMDb registered user can cast a vote (from 1 to 10) for every released movie in the database. Users can vote as many times as they want, but each vote will overwrite the previous one. In the case of people (actors, directors, writers) involved in a movie, we computed the average rating of the movies where the person participated to generate his/her rating. We imported the movies ratings to our local database and, with the table Cast, we related movies and actors to compute the artist rating. Again, if the entity is not identified in the mapping, we assign a zero score.

Property Relevance Score in DBpedia.

We used the inverse triple frequency (ITF) [3] as the property relevance score, defined as \( itf\left( {p,G} \right) = log\frac{\left| G \right|}{{\left| {G_{p} } \right|}} \), where \( \left| G \right| \) is the number of triples in a knowledge base and \( \left| {G_{p} } \right| \) is the number of triples in G whose property is p.

Example:

Consider the following paths of the DBpedia RDF graph:

  • P 1 . Elizabeth_Taylor ^producer The_Taming_of_the_Shrew starring Richard_Burton

  • P 2 . Elizabeth_Taylor ^starring The_Taming_of_the_Shrew starring Richard_Burton

where “Elizabeth_Taylor”, “Richard_Burton” and “The_Taming_of_the_Shrew” actually are abbreviations for the URIs of these DBpedia entities, and likewise for the properties.

The first step is to compute the entity rating of these entities using information from IMDb, which involves finding these DBpedia entities in IMDb. The path scores are computed as the average of the rating of the entities in the path. Since these two paths involve the same entities, they will have the same score. The second step is then to compute the ITF in DBpedia of the properties “^starring” and “^producer” to help disambiguate the ranking. Since “^producer” is less frequent in DBpedia than “^starring”, it has a higher ITF. Path P 1 should then be ranked before P 2 . However, this is subjected to manual inspection to confirm the preference of P 1 over P 2 , which was the final decision in this case, on the grounds that P 1 is perhaps more informative to the user than P 2 .

5 Conclusions and Future Work

In this paper, we described a dataset created to support the evaluation of approaches that address the entity relatedness problem. The dataset contains entity pairs in the movies and music domains, and lists of relationship paths in DBpedia, ranked based on information about their entities found in IMDb and last.fm, and on information about their properties computed from DBpedia.

The dataset can be used to test activation functions, based on entity similarity measures, and path ranking measures directly on the DBpedia graph. To use the dataset in the context of another knowledge base K, one should remap the entities and properties used in our reference dataset to K, much as we described in Sect. 4.

The construction process can be replicated to other domains where, intuitively: (1) entities with high reputation help select “meaningful” paths; (2) less frequent properties, or more discriminatory properties, also help select “meaningful” paths. In fact, the construction process described in Sect. 4 is as interesting as the resulting dataset. Therefore, as future work, we plan to focus on other domains, such as Sports, Video Games and Academic Publication, to increase the size of the Entity Relatedness Test Dataset described in the paper.