Keywords

1 Introduction

Keyword search is the most popular in information retrieval systems, where queries often are boolean or regular expressions over keywords, and information sources are text documents [4]. Lately, we observe a widespread application of keyword search paradigm to structured and semistructured data sources [68], in particular to relational databases [2, 11, 13].

In this paper, we focus on answering keyword queries issued against a P2P system populated with relational databases. The user expects that an answer will contain all relevant data stored in databases reachable from the chosen peer directly or indirectly. A keyword query is propagated in the P2P system and the peers (agents) translate it according to their database schemes into a relational database query. Partial answers are appropriately merged and sent back to a target peer. A crucial issue is then the strategy of query propagation and the way of merging partial answers. A peer can apply a partial merge (involving only answers returned to the peer) or a total merge (besides answers also the entire database stored on the peer is taken into account). The decision influence both the cost of the execution and the information contents in the final answer. Application of the total merge may infer some missing values, which otherwise were not discovered in the final answer. We propose and discuss some necessary condition that specifies whether it will be beneficial to apply the total merge.

In Sect. 2, we review some information about relational databases and keyword queries. Motivation and basic concepts, as well as the running example, are included in Sect. 3. P2P keyword query answering is discussed in Sect. 4. A necessary condition concerning making decision about total merge is formulated in Sect. 5. Section 6 concludes the paper.

2 Preliminaries

In this paper, to avoid overwhelming notation, we will consider only one-attribute integrity constraints in relational databases. Let Att be a set of attributes. We assume that a relational database schema is a triple \({\mathcal R}=(\mathbf{R},att,\mathbf{IC})\), where:

  • \(\mathbf{R}=\{R_1,\dots ,R_p\}\) – a set of relation names,

  • \(att:\mathbf{R}\rightarrow 2^\mathbf{Att}\) – a function assigning to each relational name \(R\in \mathbf{R}\) a set \(att(R)\subseteq \mathbf{Att}\) of attributes referred to as the type of R,

  • \(\mathbf{IC}=\mathbf{FD}\, \cup \, \mathbf{PK}\, \cup \, \mathbf{FK}\) – a set of integrity constraints, where: \(\mathbf{FD}\) is a set of functional dependencies, \(\mathbf{FD}=\{R.A\rightarrow R.A'\ \vert \ A,A'\in att(R)\}\); \(\mathbf{PK}\) is a set of primary keys (for each relation name there is at most one primary key), \(\mathbf{PK}=\{R.A\ \vert \ A\in att(R)\}\); \(\mathbf{FK}\) is a set of foreign keys, \(\mathbf{FK}=\{R.A\rightarrow R'.A'\ \vert \ A\in att(R), R'.A'\in \mathbf{PK}\}\).

Let \(\mathbf{Const}\) and \(\mathbf{Id}\) be, respectively, sets of constants (strings, for simplicity) and identifiers (or surrogate keys). A tuple of type R is a set \(r=\{(A_1:a_1),\dots ,(A_n:a_n)\}\), where \(\{A_1,\dots ,A_n\}=att(R)\), \(a_i\in \mathbf{Const}\cup \mathbf{Id}\). A fact is an expression R(r), where r is a tuple of type R. A database with schema \({\mathcal R}\) is a finite set of facts, \(DB=\{R(r)\ \vert \ R\in \mathbf{R},\ r\text { is a tuple of type }R\}\). A database is consistent if satisfies all integrity constraints in \(\mathbf{IC}\) [1]. Further on, we will consider only consistent databases.

A keyword query is a set of keywords, \(Q=\{q_1,\dots ,q_k\}\), where each \(q\in Q\) is a constant. A keyword searches tuples interconnected by means of foreign keys and selects tuples containing the keywords (all or some of them). A formula contain(r.Aq) is true for a tuple r if the value of attribute A contains a keyword q. In general, a keyword can occur in any string-valued column as well as any subset of keyword from Q can be taken into account. We will restrict ourselves to the situation described below.

We assume that for a keyword \(q\in Q\), there is a pair \((R_q,A_q)\) such that q can be searched in the column \(R_q.A_q\). A keyword query Q is translated into a tuple relational calculus [1] query trc(Q):

$$\begin{aligned} trc(Q)=\{(r_1,\dots ,r_m)\ \vert \ \varPhi (r_1,\dots ,r_m)\wedge \varPsi (r_1,\dots ,r_l) \}, \end{aligned}$$
(1)

where: (1) \(l\le m\); (2) for each tuple variable \(r_i\), a range formula \(R_i(r_i)\) is in \(\varPhi \); (3) \(\varPsi \) is a conjunction of formulas of the form \(contain(r_j.A_q,q)\), where \(R_q=R_j\); (4) for each i, \(1\le i< m\), \(R_i.A\rightarrow R_{i+1}.A'\in \mathbf{FK}\), for some \(A\in att(R_i)\), \(A'\in att(R_{i+1})\), and the join condition \(R_i.A= R_{i+1}.A'\) is in \(\varPhi \).

3 Motivation and Basic Concepts

Further on in this paper, we will present databases either in a tabular form or as sets of facts. Let us consider three relational databases \(DB_1\), \(DB_2\) and \(DB_3\) (Fig. 1), storing information about films and actors playing in those films. These databases are semantically homogeneous, but their structures differ slightly (in fact, only in presence or absence of some columns). Moreover, film identifiers (FilmId) have only local meaning and may be different in different databases (e.g., X1 in \(DB_2\) and Y1 in \(DB_3\)). So, values of FilmId are from \(\mathbf{Id}\), values of other attributes are from \(\mathbf{Const}\).

Fig. 1
figure 1

Databases \(DB_1\) (with the empty instance), \(DB_2\) and \(DB_3\)

Let us consider a keyword query of size 2, \(Q=\{Brody,Polanski\}\). We expect that an answer to the query should contain all relevant data from \(DB_1\), \(DB_2\) and \(DB_3\). Let DB be a database created from \(DB_1\) by renaming Film1 to Film, and PlaysIn1 to PlaysIn. Then the translation of Q to trc(Q) with respect to a DB is:

$$\begin{aligned} \begin{array}{l} trc(Q)=\{(p,f)\ \vert \ PlaysIn(p)\wedge Film(f)\wedge p.FilmId=f.FilmId\\ \wedge contain(p.Name, {\text {``}}Brody{\text {''}}) \wedge contain(f.Director,{\text {``}}Polanski{\text {''}})\}. \end{array} \end{aligned}$$
(2)

Now, let us consider a naive way of answering Q by applying trc(Q) to DB. First, data from \(DB_1\), \(DB_2\), and \(DB_3\) are transformed to DB, as follows:

$$\begin{aligned} \begin{array}{l} Film=Film1\cup Film2\cup Film3, \\ PlaysIn=PlaysIn1\cup PlaysIn2\cup PlaysIn3. \end{array} \end{aligned}$$
(3)

It is easy to see that the answer produced by Q(DB) (or by trc(Q)(DB)) is empty. It contradicts our intuition since we expect that the answer should be: (R.Brody, USA, 2002, “The Pianist, R.Polanski), that can be deduced from the database. To this order, the following problems must be solved: (a) dealing with identifiers: X1 and Y1 denote in fact the same entity but they are different; (b) during the transformation some values in target tables are missing, i.e., can be represented by \(\mathtt{NULL}\)s; (c) constraints from \(\mathbf{IC}\) may be used; (d) direct transformation of entire databases i inefficient and should be avoided.

4 P2P Keyword Query Answerin

In this section, we discuss a way for solving problems enumerated in the end of Sect. 3. First, we will show how the theory of data exchange [10] may be used to deal with identifiers, null values and constraints. Next, a way utilizing P2P environment will be shown.

4.1 Query Answering Using Data Exchange with Constraints

We apply the theory of data exchange developed by Fagin et al. [9, 10]. In relational database data exchange, a source database with a schema \({\mathcal R}_s\) is transformed to a target database with a schema \({\mathcal R}_t\), by means of a set \(\Sigma \) of source-to-target dependencies (STDs) and a set \(\varDelta \) of equality-generating-dependencies (EGDs).

We assume that in a target database: (1) a source constant value from \(\mathbf{Const}\) different than \(\mathtt{NULL}\), is represented by itself; (2) a source identifier (such as a value of FilmId) is mapped to a variable (a labeled nulls) in a set \(\mathsf {Var}_\mathsf{{Id}}\); (3) a missing value and \(\mathtt{NULL}\) value is mapped to a variable in a set \(\mathsf {Var}_\mathsf{{V}}\). \(\mathbf{Const}\), \(\mathsf {Var}_\mathsf{{Id}}\) and \(\mathsf {Var}_\mathsf{{V}}\) are pairwise disjoint. Mapping between identifiers and variables in \(\mathsf {Var}_\mathsf{{Id}}\) is made by means of a Skolem function SkId(), that assigns to an identifier a unique and fresh variable from \(\mathsf {Var}_\mathsf{{Id}}\). SkId() is defined for any source database separately, i.e., equal identifiers in the same databases are mapped to the same variable, but equal identifiers in different databases are mapped to different variables.

The following set \(\Sigma \) of STDs defines transformation of union of source databases \(DB_1\), \(DB_2\) and \(DB_3\) to a target database DB:

$$\begin{aligned} \begin{array}{clll} \Sigma =\{ &{} Film1(x,y,z)\wedge v=SkId_1(x) &{}\Rightarrow &{} Film(v,y,z), \\ &{} PlaysIn1(x,y,z)\wedge w=SkId_1(z) &{}\Rightarrow &{} PlaysIn(x,y,w),\\ &{} Film2(x,y)\wedge v=SkId_2(x) &{}\Rightarrow &{} \exists z Film(v,y,z), \\ &{} PlaysIn2(x,y,z)\wedge w=SkId_2(z) &{}\Rightarrow &{} PlaysIn(x,y,w),\\ &{} Film3(x,y,z)\wedge v=SkId_3(x) &{}\Rightarrow &{} Film(v,y,z), \\ &{} PlaysIn2(x,z)\wedge w=SkId_3(z) &{}\Rightarrow &{} \exists y PlaysIn(x,y,w)\}. \end{array} \end{aligned}$$
(4)

A transformation process in data exchange is referred to as chase [3, 10]. In our case, in the result of chasing \(\{DB_1,DB_2,DB_3\}\) with respect to \(\Sigma \), and into a database with the schema of DB, denoted \(chase_{\Sigma }^{DB}(\{DB_1,DB_2,DB_3\})\), we obtain a database:

figure a

In the schema of \(DB'\), we have the following set of functional dependencies:

$$\begin{array}{l} FD_{Film}=\{FilmId\rightarrow Title,Director; Title\rightarrow FilmId,Director\}\\ FD_{PlaysIn}=\{Name\rightarrow Country; FilmId\rightarrow Year\} \end{array}$$

FD implies the following set of equality-generating-dependencies (EGDs):

$$\begin{aligned} \begin{array}{cl} \varDelta =\{&{} Film(x,y_1,z_1)\wedge Film(x,y_2,z_2)\Rightarrow y_1=y_2\wedge z_1=z_2\\ &{} Film(x_1,y,z_1)\wedge Film(x_2,y,z_2)\Rightarrow x_1=x_2\wedge z_1=z_2\\ &{} PlaysIn(x,y_1,z_1)\wedge PlaysIn(x,y_2,z_2)\Rightarrow y_1=y_2\wedge z_1=z_2\}. \end{array} \end{aligned}$$
(5)

Now, we chase \(DB'\) with respect to \(\varDelta \), and obtain \(DB''=chase_{\varDelta }^{DB}(DB')\), where EGDs generates equalities: \(X_2=X_1\), \(V_1\)=“R. Polanski”, \(V_2\)=“England”:

figure b

Then the answer to Q is (variables in tuples belonging to the result are discarded):

$$\begin{aligned} \begin{array}{c} Q(\{DB_1,DB_2,DB_3\})=Q(chase_{\Sigma \cup \varDelta }^{DB}(\{DB_1,DB_2,DB_3\})) = Q(DB'')=\\ \{(A.Brody, USA, 2002, The\ Pianist, R. Polanski)\}. \end{array} \end{aligned}$$
(6)

4.2 Query Answering in P2P System

Now we assume that databases depicted in Fig. 1 are stored in three peers forming a P2P data integration system (Fig. 2). We will discuss how the keyword query Q can be propagated and answered in the system, and how the way of the propagation influence the contents of the answer to the query.

Fig. 2
figure 2

A sample P2P integration architecture with three peers (agents) and three local databases stored in peers

Fig. 3
figure 3

Query answering in P2P system with different propagation strategies and with partial and total merge

In rewriting procedures in P2P query answering, we will need the following translation. Let \(Q=\{q_1,\dots ,q_k\}\) be a keyword query, then

$$\begin{aligned} dlSet(Q)=\{dl(q_1),\dots ,dl(q_k)\}, \end{aligned}$$
(7)

where dl(q) is a datalog query over a database schema \({\mathcal R}\) and returning a database with a schema being a subset of \({\mathcal R}\).

Query Q can be answered using different strategies. We will consider three strategies depicted in Fig. 3. Moreover, in strategies (2) and (3), two different ways of merge are applied: in strategy (2) \(Peer_2\) applies either partial or total merge, similarly – \(Peer_3\) uses partial or total merge in strategy (3).

First, a keyword query \(Q=\{Brody, Polanski\}\) is sent to \(Peer_1\) that rewrites it to \(Q_1\) and translates to \(dlSet(Q_1)\) using the knowledge included in the schema of its database. Next, \(Peer_1\) propagates Q to its partners and waits for answers. Finally, because \(Peer_1\) is the target peer, it is responsible for producing the final answer. This is done by executing the query \(trc(Q_1)\) against the result of the partial or total merge over collected answers at \(Peer_1\).

Every partner, \(Peer_2\) and \(Peer_3\) in this case, behaves in the same way, i.e., rewrites and translates the query, propagates it to its partners and waits for answers. Additionally, a peer decides about the way of merging answers obtained from partners: in the partial merge only answers are taken into account, in the case of total merge besides of answers also the peer’s entire database is taken into account. The decision on this stage is significant since influence both the cost and information contents of the final answer. In rest of this section, we will illustrate this issue.

We have the following translations:

$$ \begin{array}{l} Q_1=\{Brody, Polanski\},\ \ Q_2=\{Brody\},\ \ Q_3=\{Brody, Polanski\};\\ dlSet(Q_1)= dlSet(Q_3)=\{dl(Brody), dl(Polanski)\},\\ dlSet(Q_2)=\{dl(Brody)\},\\ \end{array} $$

where, for example (prefixes .tgt and .src distinguish target tables from the source ones):

$$ \begin{array}{l} dl_{DB_2}(Brody)=\\ \; tgt.PlaysId2(n,c,f)\leftarrow src.PlaysIn2(n,c,f), contain(n,{\text {``}}Brody{\text {''}})\\ \; tgt.Film(f,t)\leftarrow tgt.PlaysIn(n,c,f), src.Film2(f,t);\\ dl_{DB_3}(Polanski)= \\ \; tgt.Film3(f,t,d)\leftarrow src.Film3(f,t,d)\},contain(d,{\text {``}}Polanski{\text {''}}). \end{array} $$

Analogously, for \(dl_{DB_1}(Brody)\), \(dl_{DB_1}(Polanski)\), and \(dl_{DB_3}(Brody)\).

We have:

$$ \begin{array}{l} Q_1(DB_1)=dlSet(Q_1)(DB_1)=\emptyset ;\\ Q_2(DB_2)=dlSet(Q_2)(DB_2)=\{PlaysIn2(Name:A.Brody,Country:USA,\\ \qquad \qquad \;\;\, FilmId:X1),Film2(FilmId:X1,Title:The\ Pianist)\},\\ Q_3(DB_3)=dlSet(Q_3)(DB_3)=\{Film3(FilmId:Y1,Title:The\ Pianist,\\ \qquad \qquad \;\;\, Director:R.Polanski)\}. \end{array} $$

In the following, we denote: \(Ans_j(DB_i)\) – a result of a merge by means of the chase procedure restricted to the schema of \(DB_i\); \(Ans_j\) – the final result obtained by executing trc(Q) over \(Ans_j(DB_1)\) (since in our example \(Peer_1\) is the final peer). Attribute names are omitted. Additionally, in the final result all variables (labeled nulls) are discarded.

Strategy (1)

Q is sent to \(Peer_1\), and \(Peer_1\) propagates it to \(Peer_2\) and \(Peer_3\) (Fig. 3(1)). All received answers: \(Q_1(DB_1)\), \(Q_2(DB_2)\), and \(Q_3(DB_3)\), are merged:

$$ \begin{array}{l} Ans_1(DB_1)=chase_{\Sigma \cup \varDelta }^{DB_1}(Q_1(DB_1),Q_2(DB_2),Q_3(DB_3))\\ =\{PlaysIn1(A.Brady,USA,X_1,V_1),Film1(X_1,The\ Pianist,R.Polanski)\}.\\ \end{array} $$

The final result is (note that a value of Year is not given):

$$Ans_1=(A. Brady,USA,The\ Pianist,R.Polanski).$$

Strategy (2)

Now, \(Peer_2\) propagates Q to \(Peer_3\) and waits for an answer \(Q_3(DB_3)\), Fig. 3(2). Then decides about partial or total merge:

  • Partial merge:

    $$ \begin{array}{l} Ans_2(DB_2)=chase_{\Sigma \cup \varDelta }^{DB_2}(Q_2(DB_2),Q_3(DB_3))\\ =\{PlaysIn2(A.Brady,USA,X_1),Film2(X_1,The\ Pianist)\}.\\ \end{array} $$

    Next, \(Ans_2(DB_2)\) is sent to \(Peer_1\), and merged there with \(Q_1(DB_1)\) and \(Q_3(DB_3)\). The final answer \(Ans_2\) is equal to \(Ans_1\).

  • Total merge: The entire database \(DB_2\) is merged with \(Q_3(DB_3)\). However, the result is the same as by the partial merge, i.e.,

    $$ \begin{array}{l} Ans_3(DB_2)=chase_{\Sigma \cup \varDelta }^{DB_2}(DB_2,Q_3(DB_3))=Ans_2(DB_2). \end{array} $$

    The final answer \(Ans_3\) is again equal to \(Ans_1\).

Strategy (3)

Now, \(Peer_3\) propagates Q to \(Peer_2\) and waits for \(Q_2(DB_2)\), Fig. 3(3). Then decides about partial or total merge:

  • Partial merge:

    $$ \begin{array}{l} Ans_4(DB_3)=chase_{\Sigma \cup \varDelta }^{DB_3}(Q_3(DB_3),Q_2(DB_2))\\ =\{PlaysIn3(A.Brady,X_1,V_1),Film3(X_1,The\ Pianist,R.Polanski)\}.\\ \end{array} $$

    Again, the final answer \(Ans_4\) is equal to \(Ans_1\).

  • Total merge:

    $$ \begin{array}{l} Ans_5(DB_3)=Q_3(chase_{\Sigma \cup \varDelta }^{DB_3}(Q_2(DB_2),DB_3))=\\ =\{PlaysIn3(A.Brady,X_1,2002),Film3(X_1,The\ Pianist,R.Polanski)\}.\\ \end{array} $$

    Now, the final answer has additionally value 2002 of Year inferred in the integration process:

    $$ Ans_5=(A. Brady,USA,2002,The\ Pianist,R.Polanski). $$

5 Strategy Choice Criterion

In the previous section, we discussed some ways (strategies) for propagation, and answering keyword queries in P2P systems. It turns out that both costs of the execution and information contents of the final result may be different. Here we focus on the way of merging (partial or total) that is to be decided in a peer. It follows from the considered example in Sect. 4.2 that:

  1. 1.

    In the case of strategy (2), in the both ways of merging the results were the same, i.e., \(Ans_2=Ans_3\). Of course, applying of the total merge is much more costly than applying the partial merge. So, the total merge should be applied only if there is a chance to discover some missing values (i.e., to replace variables by constants).

  2. 2.

    In strategy (3), applying of the total merge was profitable. We were able to discover (infer) a value of Year, i.e., that Brody played in “The Pianist” in year 2002. Thus, \(Ans_5\supset Ans_4\).

Now, we will discuss a necessary condition for deciding the total merge in a peer.

Definition 1

Let Ans be an answer to a query Q, \(R(r)\in Ans\), and \((A:v)\in r\), where \(v\in \mathsf {Var}_\mathsf{{V}}\) and A is an attribute of R. Then we call v a missing value of A in Ans.

For example, \(V_1\) is a missing value of Year in \(Ans_4(DB_3)\), however, \(X_1\) in this answer is not a missing value since \(X_1\) is in \(\mathsf {Var}_\mathsf{{Id}}\) – not in \(\mathsf {Var}_\mathsf{{V}}\).

Theorem 1

Let v be a missing value of A in Ans, where \((A:v)\in r\) and \(R(r)\in Ans\). The necessary condition for discovering the value of v in a DB with a schema \((\mathbf{R},\mathbf{IC})\):

  • there is \(A'\in att(R)\) such that \(A'\rightarrow A\in \mathbf{IC}\), and

  • there is \(q\in Q\) such that: \(R=R_q\), \(A_q\in att(R)\), \(contains(A_q,q)\) occurs in trc(Q), and \(A'\rightarrow A_q\notin \mathbf{IC}\).

Then we say that the missing value of v can be discovered by means of the constraint (functional dependency) \(A'\rightarrow A\).

For example, \(Brody\in Q\), and \(contains(Name,``Brody'')\) occurs in trc(Q). Moreover, \(FilmId\rightarrow Year\in \mathbf{IC}\) and \(FilmId\rightarrow Name\notin \mathbf{IC}\). Thus, we see that the promise of Theorem 1 is satisfied for discovering the missing value of \(V_1\) in \(Ans_4(DB_3)\) by means of the constraint \(FilmId\rightarrow Year\) (referring to \(DB_3\)).

Proof

(of Theorem 1) Assume that the promise of the theorem is satisfied. Let \((A:v)\in r\), where: \(R(r)\in Ans\), \(R\in \mathbf{R}\). Additionally, let us assume that \((A':c_1)\in r\) and there is a tuple \(r'\) of type att(R), such that:

  • \(R(r')\in DB\),

  • \((A':c_1)\in r'\),

  • \((A:c_2)\in r'\),

  • \((A_q:c_3)\in r'\),

  • \(contains(A_q:q)\) is false in \(r'\).

By the assumption \(A'\rightarrow A\in \mathbf{IC}\), so it follows from the corresponding EGD:

$$r.A'=r'.A'\Rightarrow r.A=r'.A,$$

that \(v=c_2\). Moreover, since \(contains(A_q:q)\) is false in \(r'\), the above substitution cannot be inferred by means of the partial merge. \(\square \)

6 Conclusions

We discussed the problem of answering keyword queries issued against a P2P-connected relational databases. In such a system, there are many autonomous peers (agents), which collaborates in the process of producing answers to queries. The system is flexible and peers can enter and leave the system dynamically. A query posed to a peer is propagated to its partners along semantic paths defined by mappings, those partners propagate the query to their partners, etc. Answers to the query flow back in the opposite directions. The target peer merge the answers producing the expected answer. We focused on translating and propagating keyword queries, as well as on the way of merging partial answers by a peer. The proper decision significantly influence both the quality of the answer as well as efficiency of processing. The discussed strategy was inspired by the SixP2P system originally designed for integrating XML data [5, 12].