Dynamic Planning for Link Discovery

Georgala, Kleanthi; Obraczka, Daniel; Ngonga Ngomo, Axel-Cyrille

doi:10.1007/978-3-319-93417-4_16

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10843))

Included in the following conference series:

European Semantic Web Conference

6594 Accesses
3 Citations

Abstract

With the growth of the number and the size of RDF datasets comes an increasing need for scalable solutions to support the linking of resources. Most Link Discovery frameworks rely on complex link specifications for this purpose. We address the scalability of the execution of link specifications by presenting the first dynamic planning approach for Link Discovery dubbed Condor. In contrast to the state of the art, Condor can re-evaluate and reshape execution plans for link specifications during their execution. Thus, it achieves significantly better runtimes than existing planning solutions while retaining an F-measure of 100%. We quantify our improvement by evaluating our approach on 7 datasets and 700 link specifications. Our results suggest that Condor is up to 2 orders of magnitude faster than the state of the art and requires less than 0.1% of the total runtime of a given specification to generate the corresponding plan.

You have full access to this open access chapter, Download conference paper PDF

HELIOS – Execution Optimization for Link Discovery

SAIM – One Step Closer to Zero-Configuration Link Discovery

Wombat – A Generalization Approach for Automatic Link Discovery

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The provision of links between knowledge bases is one of the core principles of Linked Data.^{Footnote 1} Hence, the growth of knowledge bases on the Linked Data Web in size and number has led to a significant body of work which addresses the two key challenges of Link Discovery (LD): efficiency and accuracy (see [1] for a survey). In this work, we focus on the first challenge, i.e., on the efficient computation of links between knowledge bases. Most LD frameworks use combinations of atomic similarity measures by means of specification operators and thresholds to compute link candidates. The combinations are often called linkage rules [2] or link specifications (short LSs, see Fig. 1 for an example and Sect. 2 for a formal definition) to compute links [1]. So far, most approaches for improving the execution of LSs have focused on reducing the runtime of the atomic similarity measures used in LSs (see, e.g., [3,4,5]). While these algorithms have led to significant runtime improvements, they fail to exploit global knowledge about the LSs to be executed. In Condor, we build upon these solutions and tackle the problem of executing link specifications efficiently.

Condor makes used of a minute but significant change in the planning and execution of LSs. So far, the execution of LSs has been modeled as a linear process (see [1]), where a LS is commonly rewritten, planned and finally executed.^{Footnote 2} While this architecture has its merits, it fails to use a critical piece of information: the execution engine knows more about runtimes than the planner once it has executed a portion of the specification. The core idea behind our work is to make use of the information generated by the execution engine at runtime to re-evaluate the plans generated by the planner. To this end, we introduce an architectural change to LD frameworks by enabling a flow of information from the execution engine back to the planner. While this change might appear negligible, it has a significant effect on the performance of LD systems as shown by our evaluation (see Sect. 4).

The contributions of this work are hence as follows: (1) We propose the first planner for link specification which is able to re-plan steps of an input LS L based on the outcome of partial executions of L. By virtue of this behavior, we dub Condor a dynamic planner. (2) In addition to being dynamic, Condor goes beyond the state of the art by ensuring that duplicated steps are executed exactly once. Moreover, our planner can also make use of subsumptions between result sets to further reuse previous results of the execution engine. (3) We evaluate our approach on 700 LSs and 7 datasets and show that we outperfom the state of the art significantly.

2 Preliminaries

The formal framework underlying our preliminaries is derived from [6, 7]. LD frameworks aim to compute the set $M=\{(s, t) \in S \times T: R(s, t)\}$ where S and T are sets of RDF resources and R is a binary relation. Given that M is generally difficult to compute directly, declarative LD frameworks compute an approximation $M' \subseteq S \times T \times \mathbb {R}$ of M by executing a link specification (LS), which we define formally in the following.

An atomic LS L is a pair $L=(m, \theta )$, where m is a similarity measure that compares properties of pairs (s, t) from $S \times T$ and $\theta $ is a similarity threshold. LS can be combined by means of operators and filters. Here, we consider the binary operators $\sqcup $, $\sqcap $ and $\backslash $, which stand for the union, intersection and difference of specifications respectively. Filters are pairs $(f, \tau )$, where f is either empty (denoted $\epsilon $), a similarity measure or a combination of similarity measures and $\tau $ is a threshold.

A complex LS L is a triple $(f, \tau , \omega (L_1, L_2))$ where $\omega $ is a specification operator and $(f, \tau )$ is a filter. An example of a LS is given in Fig. 1. Note that an atomic specification can be regarded as a filter $(f, \tau , X)$ with $X = S \times T$. Thus we will use the same graphical representation for filters and atomic specifications. We call $(f, \tau )$ the filter of L and denote it with $\varphi (L)$. For our example, $\varphi (L) = (\epsilon , 0.5)$. The operator of a LS L will be denoted op(L). For $L = (f, \tau , \omega (L_1, L_2))$, $op(L) = \omega $. In our example the operator of the LS is $\backslash $. The size of L, denoted |L|, is defined as follows: If L is atomic, then $|L| = 1$. For complex LSs $L = (f, \tau , \omega (L_1, L_2))$, we set $L = |L_1| + |L_2| + 1$. The LS shown in Fig. 1 has a size of 7. For $L = (f, \tau , \omega (L_1, L_2))$, we call $L_1$ resp. $L_2$ the left resp. right direct child of L.

Table 1. Semantics of link specifications

Full size table

We denote the semantics (i.e., the results of a LS for given sets of resources S and T) of a LS L by $[[L]]$ and call it a mapping. We begin by assuming the natural semantics of the combinations of measures. The semantics of LSs are then as shown in Table 1. To compute the mapping $[[L]]$ (which corresponds to the output of L for a given pair (S, T)), LD frameworks implement (at least parts of) a generic architecture consisting of an execution engine, an optional rewriter and a planner (see [1] for more details). The rewriter performs algebraic operations to transform the input LS L into a LS $L'$ (with $[[L]] = [[L']]$) that is potentially faster to execute. The most common planner is the canonical planner (dubbed Canonical), which simply traverses L in post-order and has its results computed in that order by the execution engine.^{Footnote 3} For the LS shown in Fig. 1, the execution plan returned by Canonical would thus first compute the mapping $M_1 = [[(\texttt {cosine}(label, label), 0.4)]]$ of pairs of resources whose property label has a cosine similarity equal to, or greater than 0.4. The computation of $M_2 = [[(\texttt {trigrams}(name, name), 0.8)]]$ would follow. Step 3 would be to compute $M_3 = M_1 \sqcup M_2$ while abiding by the semantics described in Table 1. Step 4 would be to filter the results by only keeping pairs that have a similarity above 0.5 and so on. Given that there is a 1–1 correspondence between a LS and the plan generated by the canonical planner, we will reuse the representation of a LS devised above for plans. The sequence of steps for such a plan is then to be understood as the sequence of steps that would be derived by Canonical for the LS displayed.

3 CONDOR

The goal of Condor is to improve the overall execution time of LSs. To this end, Condor aims to derive a time-efficient execution plan for a given input LS L. The basic idea behind state-of-the-art planners for LD (see [7]) is to approximate the costs of possible plans for L, and to simply select the least costly (i.e., the presumable fastest) plan so as to improve the execution costs. The selected plan is then forwarded to the execution engine and executed. We call this type of planning static planning because the plan selected is never changed. Condor addresses the planning and execution of LSs differently: Given an input LS L, Condor’s planner uses an initial cost function to generate initial plans P, of which each consists of a sequence of steps that are to be executed by Condor’s execution engine to compute L. The planner chooses the least costly plan and forwards it to the engine. After the execution of each step, the execution engine overwrites the planner’s cost function by replacing the estimated costs of the executed step with its real costs. The planner then re-evaluates the alternative plans generated afore and alters the remaining steps to be executed if the updated cost function suggests better expected runtimes for this alteration of the remaining steps. We call this novel paradigm for planning the execution of LSs dynamic planning.

3.1 Planning

Algorithm 1 summarizes the dynamic planning approach implemented by Condor. The algorithm (dubbed plan) takes a LS L as input and returns the plan P(L) with the smallest expected runtime. The core of the approach consists of (1) a cost function r which computes expected runtimes and (2) a recursive cost evaluation scheme. Condor’s planner begins by checking whether the input L has already been executed within the current run (Line 2). If L has already been executed, there is no need to re-plan the LS. Instead, plan returns the known plan P(L). If L has not yet been executed, we proceed by first checking whether L is atomic. If L is atomic, we return $P = run(m,\theta )$ (line 6), which simply computes $[[L]]$ on $S \times T$. Here, we make use of existing scalable solutions for computing such mappings [1].

If $L = (f, \tau , \omega (L_1, L_2))$, plan derives a plan for $L_1$ and $L_2$ (lines 10 and 11), then computes possible plans given op(L) and then decides for the least costly plan based on the cost function. The possible plans generated by Condor depend on the operator of L. For example, if $op(L) = \sqcap $, then plan evaluates three alternative plans: (1) The canonical plan (lines 21, 23, 27, 31), which consists of executing $P(L_1)$ and $P(L_2)$, performing an intersection between the resulting mappings and then filtering the final mapping using $(f, \tau )$; (2) The filter-right plan (lines 24, 32), where the best plan $P_1$ for $L_1$ is executed, followed by a run of a filtering operation on the results of $P_1$ using $(f_2, \tau _2) = \varphi (L_2)$ and then filtering the final mapping using $(f, \tau )$; (3) The filter-left plan (lines 28, 32), which is a filter-right plan with the roles of $L_1$ and $L_2$ reversed.

As mentioned in Sect. 1, Condor’s planning function re-uses results of previously executed LSs and plans. Hence, if both $P_1$ and $P_2$ have already been executed ($r(P_1) = r(P_2) = 0$), then the best plan is the canonical plan, where Condor will only need to retrieve the mappings of the two plans and then perform the intersection and the filtering operation (line 20). If $P_1$ resp. $P_2$ have already been executed (see Line 22 resp. 26), then the algorithm decides between the canonical and the filter-right resp. filter-left plan. If no information is available, then the costs of the different alternatives are calculated based on our cost function described in Sect. 3.2 and the least costly plan is chosen. Similar approaches are implemented for $op(L) = \backslash $ (lines 12–18). In particular, in line 17, the plan algorithm implements the filter-right plan by first executing the plan $P_1$ for the left child and then constructing a “reverse filter” from $(f_2, \tau _2) = \varphi (L_2)$ by calling the getReverseFilter function. The resulting filter is responsible for allowing only links of the retrieved mapping of $L_1$ that are not returned by $L_2$. For $op(L) = \sqcup $ (line 36) the plan always consists of merging the results of $P(L_1)$ and $P(L_2)$ by using the semantics described in Table 1.

3.2 Plan Evaluation

As explained in the first paragraphs of Sect. 2, one important component of Condor is the cost function required to estimate the costs of executing the corresponding plan. Based on [8], we used a linear plan evaluation schema as introduced in [7]. A plan P is characterized by one basic component, r(P), the approximated runtime of executing P.

Approximation of r(P) for Atomic LSs. We compute r(P(L)) by assuming that the runtime of $L=f(m, \theta )$ can be approximated in linear time for each metric m using the following equation:

$$\begin{aligned} r(P) = \gamma _0 + \gamma _1|S| + \gamma _2|T| + \gamma _3\theta , \end{aligned}$$

(1)

where |S| is the size of the source KB, |T| is the size of the target KB and $\theta $ is the threshold of the specification. We used a linear model with these parameters since the experiments in [7, 8] suggested that they are sufficient to produce accurate approximations. The next step of our plan evaluation approach was to estimate the parameters $\gamma _0, \gamma _1, \gamma _2 \,\text{ and }\, \gamma _3$. However, the size of the source and the target KBs is unknown prior to the linking task. Therefore, we used a sampling method, where we generated source and target datasets of sizes $1000, 2000, \ldots , 10000$ by sampling data from the English labels of DBpedia 3.8. and stored the runtime of the measures implemented by our framework for different thresholds $\theta $ between 0.5 and 1. Then, we computed the $\gamma _i$ parameters by deriving the solution of the problem to the linear regression solution of $\varGamma = (R^TR)^{-1}R^TY$, where $\varGamma = (\gamma _0, \gamma _1, \gamma _2, \gamma _3)^T$, Y is a vector in which the $y_i$-th row corresponds to the runtime retrieved by running i$^{th}$ experiment and R is a four-column matrix in which the corresponding experimental parameters $(1, |S|, |T|, \theta )$ are stored in the $r_i$-th row.

Approximation of r(P) for Complex LSs. For the canonical plan, r(P) is estimated by summing up the r(P) of all plans that correspond to children specifications of the complex LS. For the filter-right and filter-left plans, r(P) is estimated by summing the r(P) of the child LS whose plan is going to be executed along with the approximation of the runtime of the filtering function performed by the other child LS. To estimate the runtime of a filtering function, we compute the approximation analogously to the computation of the runtime of an atomic LS.

Additionally, we define a set of rules if $\omega = \sqcap $ or $\omega = \backslash $: (1) r(P) includes only the sums of the children LSs that have not yet been executed. (2) If both children of the LS are executed then r(P) is set to 0. Therefore, we force the algorithm to choose canonical over the other two options, since it will create a smaller overhead in total runtime of Condor.

3.3 Execution

Algorithm 2 describes the execution of the plan that Algorithm 1 returned. The execute algorithm takes as input a LS L and returns the corresponding mapping M once all steps of P(L) have been executed. The algorithm begins in line 2, where execute returns the mapping M of L, if L has already been executed and its result cached. If L has not been executed before, we proceed by checking whether a LS $L'$ with $[[L]] \subseteq [[L]]'$ has already been executed (line 7). If such a $L'$ exists, then execute retrieves $M' = [[L]]'$ and runs $(f, \tau , [[L]]')$ where $(f, \tau ) = \varphi (L)$ (line 9). If $\not \exists L'$, the algorithm checks whether L is atomic. If this is the case, then $P(L) = run(m,\theta )$ computes $[[L]]$. If $L = (f, \tau , \omega (L_1, L_2))$, execute calls the plan function described previously.

3.4 Example Run

To elucidate the workings of Condor further, we use the LS described in Fig. 1 as a running example. Table 2 shows the cost function r(P) of each possible plan that can be produced for the specifications included in L, for the different calls of the plan function for L. The runtime value of a plan for a complex LS additionally includes a value for the filtering or set operations, wherever present. Recall that plan is a recursive function (lines 10, 11) and plans L in post-order (bottom-up, left-to-right). Condor produces a plan equivalent to the canonical plan for the left child due to the $\sqcup $ operator. Then, it proceeds in finding the least costly plan for the right child. For the right child, plan has to choose between the three alternatives described in Sect. 3.1. Table 2 shows the approximation r(P) of each plan for $(\sqcap ((\texttt {cosine}(label, label), 0.4),(\texttt {trigrams}(name, name), 0.8)), 0.5)$. The least costly plan for the right child is the filter-left plan, where $L' = (\texttt {trigrams}(name,name), 0.8))$ is executed and $[[L']]$ is then filtered using $(\texttt {cosine}(label,label),0.4))$ and $(\epsilon ,0.5)$. Before proceeding to discover the best plan for L, Condor assigns an approximate runtime r(P) to each child plan of L: 3.5 s for the left child and 1.5 s for the right child.

Once Condor has identified the best plans for both children of L, it proceeds to find the most efficient plan for L. Since both children have not been executed previously, plan goes to line 15. There, it has to chose between two alternative plans, i.e., the canonical plan with $r(P)=6.2\,\text {s}$ and the filter-right plan with $r(P)=5.2\,\text {s}$. It is obvious that plan is going to assign the filter-right plan as the least costly plan for L. Note that this plan overwrites the right child filter-left plan, and it will instead use the right child as a filter.

Once the plan is finalized, the plan function returns and assigns the plan shown in Fig. 2a to P(L) in line 14. For the next step, execute retrieves the left child $(\sqcup ((\texttt {cosine}(label, label), 0.4),(\texttt {trigrams}(name, name), 0.8)), 0.5)$ and assigns it to $L_1$ (line 15). Then, the algorithm calls execute for $L_1$. execute repeats the plan procedure for $L_1$ recursively and returns the plan illustrated in Fig. 3. The plan is executed and finally (line 16) the resulting mapping is assigned to $M_1$. Remember that all intermediate mappings as well as the final mapping along with the corresponding LSs are stored for future use (line 29). Additionally, we replace the cost value estimations of each executed plan by their real values in line 28. Now, the cost value of $(\texttt {cosine}(label, label), 0.4)$ is assigned to 2.0 s, the cost value of $(\texttt {trigrams}(name, name), 0.8)$ is assigned to 1.0 s and finally, the cost value of the left child will be replaced by 4.0 s.

Now, given the runtimes from the execution engine, the algorithm re-plans the further steps of L. Within this second call of plan (line 17), Condor does not re-plan the sub-specification that corresponds to $L_1$, since its plan (Fig. 3) has been executed previously. Initially, plan had decided to use the right child as a filter. However, both $(\texttt {cosine}(label, label), 0.4)$ and $(\texttt {trigrams}(name, name), 0.8)$ have already been executed. Hence, the new total cost of executing the right child is set to 0.0. Consequently, plan changes the remaining steps of the initial plan of L, since the cost of executing the canonical plan is now set to 0.0. The final plan is illustrated in Fig. 2b.

Once the new plan P(L) is constructed, execute checks if P(L) includes any operators. In our example, $op(L)=\backslash $. Thus, we execute the second direct child of L as described in P(L), $L_2 = (\sqcap ((\texttt {cosine}(label,label),0.4),(\texttt {trigrams}(name,name),0.8)),0.5)$. Algorithm 2 calls the execute function for $L_2$, which calls plan. Condor’s planning algorithm then returns a plan for $L_2$, which is similar to the plan for the left child illustrated in Fig. 3 by replacing the $\sqcup $ operator with the $\sqcap $ operator, with $r(P(L_2))=0\,\text {s}$.

When the algorithm proceeds to executing $P(L_2)$, it discovers that the atomic LSs of $L_2$ have already executed. Thus, it retrieves the corresponding mappings, performs the intersection between the results of $(\texttt {cosine}(label,label),0.4)$ and $(\texttt {trigrams}(name,name),0.8)$, filters the resulting mapping of the intersection with $(\epsilon , 0.5)$ and stores the resulting mapping for future use (line 29). Returning to our initial LS L, the algorithm has now retrieved results for both $L_1$ and $L_2$ and proceeds to perform the steps described in line 21 and 27. The final plan constructed by Condor is presented in Fig. 2b.

If the second call of the plan function for L in line 17 had resulted in not altering the initial P(L), then execute would have proceeded in applying a reverse filter (i.e., the implementation of the difference of mappings) on $M_1$ by using $(\sqcap ((\texttt {cosine}(label,label),0.4),(\texttt {trigrams}(name,name),0.8)),0.5)$ (line 24). Similarly operations would have been carried out if $op(L)=\sqcap $ in line 26.

Overall, the complexity of Condor can be derived as follows: For each node of a LS L, Condor generates a constant number of possible plans. Hence, the complexity of each iteration of Condor is O(|L|). The execution engine executes at least one node in each iteration, meaning that it needs at most O(|L|) iterations to execute L completely. Hence, Condor’s worst-case runtime complexity is $O(|L|^2)$.

4 Evaluation

4.1 Experimental Setup

The aim of our evaluation was to address the following questions: ($Q_1$) Does Condor achieve better runtimes for LSs? ($Q_2$) How much time does Condor spend planning? ($Q_3$) How do the different sizes of LSs affect Condor’s runtime? To address these questions, we evaluated our approach against seven data sets. The first four are the benchmark data sets for LD dubbed Abt-Buy, Amazon-Google Products, DBLP-ACM and DBLP-Scholar described in [9]. These are manually curated benchmark data sets collected from real data sources such as the publication sites DBLP and ACM as well as the Amazon and Google product websites. To assess the scalability of Condor, we created three additional data sets (MOVIES, TOWNS and VILLAGES, see Table 3) from the data sets DBpedia, LinkedGeodata and LinkedMDB.^{Footnote 4}$^,$^{Footnote 5} Table 3 describes their characteristics and presents the properties used when linking the retrieved resources. The mapping properties were provided to the link discovery algorithms underlying our results. We generated 100 LSs for each dataset by using the unsupervised version of Eagle, a genetic programming approach for learning LSs [10]. We used this algorithm because it can detect LSs of high accuracy on the datasets at hand. We configured Eagle by setting the number of generations and population size to 20, mutation and crossover rates were set to 0.6. All experiments were carried out on a 20-core Linux Server running OpenJDK 64-Bit Server 1.8.0.66 on Ubuntu 14.04.3 LTS on Intel Xeon CPU E5-2650 v3 processors clocked at 2.30 GHz. Each experiment was repeated three times. We report the average runtimes of each of the algorithms. Note that all three planners return the same set of links and that they hence all achieve 100% F-measure w.r.t. the LS to be executed.^{Footnote 6}

4.2 Results

We compared the execution time of Condor with that of the state-of-the-art algorithm for planning (Helios [7]) as well as with the canonical planner implemented in Limes. We chose Limes because it is a state-of-the-art declarative framework for link discovery which ensures result completeness. Figure 4 shows the runtimes achieved by the different algorithm in different settings. As shown in Fig. 4a, Condor outperforms Canonical and Helios on all datasets. A Wilcoxon signed-rank test on the cumulative runtimes of the approaches (significance level = 99%) confirms that the differences in performance between Condor and the other approaches are statistically significant on all datasets. This observation and the statistical test clearly answer question $Q_1$:

Figure 4a shows that our approach performs best on AMAZON-GP, where it can reduce the average runtime of the set of specifications by $78\%$ compared to Canonical, making Condor 4.6 times faster. Moreover, for the same dataset, dynamic planning is 8.04 times more efficient than Helios. Note that finding a better plan than the canonical plan on this particular dataset is non-trivial (as shown by the Helios results). Here, our dynamic planning approach pays off by being able to revise the original and altering this plan at runtime early enough to achieve better results than the Canonical planner and Helios. The highest absolute difference is achieved on DBLP-Scholar, where Condor reduces the overall execution time of the Canonical planner on the 100 LSs by approximately 600 s per specification on average. On the same dataset, the difference between Condor and Helios is approximately 110 s per LS.

The answer to our second question is that the benefits of the dynamic planning strategy are far superior to the time required by the re-planning scheme (as showed by Fig. 4). Condor spends between $0.0005\%$ (DBLP-SCHOLAR) and $0.1\%$ (AMAZON-GP) of the overall runtime on planning. The specifications computed for the AMAZON-GP dataset have on average the largest size in contrast to the other datasets. On this particular dataset, Condor spends less than 10 ms planning. We regard this result as particularly good, as using Condor brings larger benefits with growing specifications.

To answer $Q_3$, we also computed the runtime of LSs depending on their size t (see Figs. 4b and c). For LSs of size 1, the execution times achieved by all three planners are most commonly comparable (difference of average runtimes = 0.02 s) since the plans produced are straight-forward and leave no room for improvement. For specifications of size 3, Condor is already capable of generating plans that are 7.5% faster than the canonical plans on average. The gap between Condor and the state of the art increases with the size of the specifications. For specifications of sizes 7 and more, Condor plans only necessitate 30.5% resp. 55.7% of the time required by the plans generated by Canonical resp. Helios. A careful study of the plan generated by Condor reveals that the re-use of previously executed portions of a LS and the use of subsumption are clearly beneficial to the execution runtime of large LSs. However, the study also shows that in a few cases, Condor creates a filter-right or filter-left plan where a canonical plan would have been better. This is due to some sub-optimal runtime approximations produced by the r(P) function. We can summarize our result as follows.

Table 2. Runtime costs for the plans computed for the specification in (Fig. 1) by the two calls of the plan in lines 14 and 17. All runtimes are presented in seconds. The $1^{st}$ column includes the initial runtime approximations of plans. The $2^{nd}$ column includes (1) a real runtime value of a plan, if the plan has been executed ($^\diamond $), (2) a 0.0 value if all the subsequent plans of that plan have been executed previously ($^\bullet $) or have an estimation of zero cost in the current call of plan ($^*$), (3) a runtime approximation value, that includes only runtimes of subsequent plans that have not been executed yet ($^\Box $).

Full size table

Table 3. Characteristics of data sets

Full size table

5 Related Work

This paper addresses the creation of better plans for scalable link discovery. A large number of frameworks such as SILK [2], Limes [11] and KnoFuss [12] were developed to support the link discovery process. These frameworks commonly rely on scalable approaches for computing simple and complex specifications. For example, a lossless framework that uses blocking is SILK [2], a tool relying on rough index pre-matching. KnoFuss [12] on the other hand implements classical blocking approaches derived from databases. These approaches are not guaranteed to achieve result completeness. Zhishi.links [13] is another framework that scales (through an indexing-based approach) but is not guaranteed to retrieve all links. The completeness of results is guaranteed by the Limes framework, which combines time-efficient algorithms such as Ed-Join and PPJoin+ with a set-theoretical combination strategy. The execution of LSs in Limes is carried out by means of the Canonical [11] and Helios [7] planners. Given that Limes was shown to outperform SILK in [7], we chose to compare our approach with Limes. The survey of Nentwig et al. [1] and the results of the Ontology Alignment and Evaluation Initiative for 2017 of the OAEI [14],^{Footnote 7} provide an overview of further link discovery systems.

Condor is the first dynamic planner for link discovery. The problem we tackled in this work bears some resemblance to the task of query optimization in databases [15]. There have been numerous advances which pertain to addressing this question, including strategies based on genetic programming [16], cost-based and heuristic optimizers [17], and dynamic approaches [18]. Dynamic approaches for query planning were the inspiration for the work presented herein.

6 Conclusion and Future Work

We presented Condor, a dynamic planner for link discovery. We showed how our approach combines dynamic planning with subsumption and result caching to outperform the state of the art by up to two orders of magnitude. A large number of questions are unveiled by our results. First, our results suggest that Condor’s runtimes can be improved further by improving the cost function underlying the approach. Hence, we will study the use of most complex regression approaches for approximating the runtime of metrics. Moreover, the parallel execution of plans will be studied in future.

Notes

1.
http://www.w3.org/DesignIssues/LinkedData.html.
2.
Note that some systems implement the rewriting and planning in an implicit manner.
3.
Note that the planner and engine are not necessarily distinct in existing implementations.
4.
http://www.linkedmdb.org/.
5.
The new data and a description of how they were constructed are available at http://titan.informatik.uni-leipzig.de/kgeorgala/DATA/.
6.
Our complete experimental results can be found at http://titan.informatik.uni-leipzig.de/kgeorgala/condor_results.zip. Our open source code can be found at http://limes.sf.net.
7.
Ontology Alignment Evaluation Initiative: http://ontologymatching.org.

References

Nentwig, M., Hartung, M., Ngonga Ngomo, A.-C., Rahm, E.: A survey of current link discovery frameworks. Semant. Web 8, 1–18 (2015). (Preprint)
Google Scholar
Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall. In: Marian, A., Vassalos, V. (eds.) WebDB (2011)
Google Scholar
Ngonga Ngomo, A.-C., Auer, S.: LIMES - a time-efficient approach for large-scale link discovery on the web of data. In: Proceedings of IJCAI (2011)
Google Scholar
Wang, J., Feng, J., Li, G.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endow. 3(1–2), 1219–1230 (2010)
Article Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, pp. 131–140. ACM, New York (2008)
Google Scholar
Sherif, M.A., Ngonga Ngomo, A.-C., Lehmann, J.: Wombat – a generalization approach for automatic link discovery. In: Blomqvist, E., Maynard, D., Gangemi, A., Hoekstra, R., Hitzler, P., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10249, pp. 103–119. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58068-5_7
Chapter Google Scholar
Ngonga Ngomo, A.-C.: HELIOS – execution optimization for link discovery. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 17–32. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_2
Chapter Google Scholar
Georgala, K., Hoffmann, M., Ngonga Ngomo, A.-C.: An evaluation of models for runtime approximation in link discovery. In: Proceedings of the International Conference on Web Intelligence, WI 2017, pp. 57–64. ACM, New York (2017)
Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Article Google Scholar
Ngonga Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_17
Chapter Google Scholar
Ngonga Ngomo, A.-C.: On link discovery using a hybrid approach. J. Data Semant. 1(4), 203–217 (2012)
Article Google Scholar
Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_15
Chapter Google Scholar
Niu, X., Rong, S., Zhang, Y., Wang, H.: Zhishi.links results for OAEI 2011. Ontol. Matching 184, 220 (2011)
Google Scholar
Achichi, M., Cheatham, M., Dragisic, Z., Euzenat, J., Faria, D., Ferrara, A., Flouris, G., Fundulaki, I., Harrow, I., Ivanova, V., Jiménez-Ruiz, E., Kuss, E., Lambrix, P., Leopold, H., Li, H., Meilicke, C., Montanelli, S., Pesquita, C., Saveta, T., Shvaiko, P., Splendiani, A., Stuckenschmidt, H., Todorov, K., Trojahn, C., Zamazal, O.: Results of the ontology alignment evaluation initiative 2016. In: Proceedings of the 11th International Workshop on Ontology Matching, OM 2016, Co-located with the 15th International Semantic Web Conference (ISWC 2016) Kobe, Japan, 18 October 2016, vol. 1766, pp. 73–129. RWTH, Aachen (2016)
Google Scholar
Silberschatz, A., Korth, H., Sudarshan, S.: Database Systems Concepts, 5th edn. McGraw-Hill Inc., New York (2006)
MATH Google Scholar
Bennett, K., Ferris, M.C., Ioannidis, Y.E.: A genetic algorithm for database query optimization. In: Proceedings of the fourth International Conference on Genetic Algorithms, pp. 400–407 (1991)
Google Scholar
Kanne, C.C., Moerkotte, G.: Histograms reloaded: the merits of bucket diversity. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 663–674. ACM, New York (2010)
Google Scholar
Ng, K.W., Wang, Z., Muntz, R.R., Nittel, S.: Dynamic query re-optimization. In: Eleventh International Conference on Scientific and Statistical Database Management, pp. 264–273. IEEE (1999)
Google Scholar

Download references

Acknowledgments

This work has been supported by H2020 projects SLIPO (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no. NG 105/3-2) and the BMWI Projects SAKE (project no. 01MD15006E) and GEISER (project no. 01MD16014).

Author information

Authors and Affiliations

AKSW Research Group, University of Leipzig, Augustusplatz 10, 04103, Leipzig, Germany
Kleanthi Georgala & Daniel Obraczka
Data Science Group, Paderborn University, Pohlweg 51, 33098, Paderborn, Germany
Axel-Cyrille Ngonga Ngomo

Authors

Kleanthi Georgala
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Obraczka
View author publications
You can also search for this author in PubMed Google Scholar
Axel-Cyrille Ngonga Ngomo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kleanthi Georgala .

Editor information

Editors and Affiliations

University of Bologna, Bologna, Italy
Aldo Gangemi
Sapienza University of Rome, Rome, Italy
Roberto Navigli
Universidad Simón Bolívar, Caracas, Venezuela
Maria-Esther Vidal
Wright State University, Dayton, Ohio, USA
Pascal Hitzler
EURECOM, Biot, France
Raphaël Troncy
CWI, Amsterdam, The Netherlands
Laura Hollink
Elsevier B.V., Amsterdam, The Netherlands
Anna Tordai
CNR-ISTC, Rome, Italy
Mehwish Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Georgala, K., Obraczka, D., Ngonga Ngomo, AC. (2018). Dynamic Planning for Link Discovery. In: Gangemi, A., et al. The Semantic Web. ESWC 2018. Lecture Notes in Computer Science(), vol 10843. Springer, Cham. https://doi.org/10.1007/978-3-319-93417-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-93417-4_16
Published: 03 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93416-7
Online ISBN: 978-3-319-93417-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dynamic Planning for Link Discovery

Abstract

Similar content being viewed by others

HELIOS – Execution Optimization for Link Discovery

SAIM – One Step Closer to Zero-Configuration Link Discovery

Wombat – A Generalization Approach for Automatic Link Discovery

Keywords

1 Introduction

2 Preliminaries