Abstract
We discuss the way of using keyword search in the case of P2P connected relational databases. Each peer is then an agent deciding about translation and propagation of a keyword query, as well as about the way of merging partial answers. The way of merging (partial or total) is essential for both efficiency of execution and information contents in the final answer. We prove a theorem stating a necessary condition for deciding about the strategy of merging made by a peer.
Similar content being viewed by others
Keywords
1 Introduction
Keyword search is the most popular in information retrieval systems, where queries often are boolean or regular expressions over keywords, and information sources are text documents [4]. Lately, we observe a widespread application of keyword search paradigm to structured and semistructured data sources [6–8], in particular to relational databases [2, 11, 13].
In this paper, we focus on answering keyword queries issued against a P2P system populated with relational databases. The user expects that an answer will contain all relevant data stored in databases reachable from the chosen peer directly or indirectly. A keyword query is propagated in the P2P system and the peers (agents) translate it according to their database schemes into a relational database query. Partial answers are appropriately merged and sent back to a target peer. A crucial issue is then the strategy of query propagation and the way of merging partial answers. A peer can apply a partial merge (involving only answers returned to the peer) or a total merge (besides answers also the entire database stored on the peer is taken into account). The decision influence both the cost of the execution and the information contents in the final answer. Application of the total merge may infer some missing values, which otherwise were not discovered in the final answer. We propose and discuss some necessary condition that specifies whether it will be beneficial to apply the total merge.
In Sect. 2, we review some information about relational databases and keyword queries. Motivation and basic concepts, as well as the running example, are included in Sect. 3. P2P keyword query answering is discussed in Sect. 4. A necessary condition concerning making decision about total merge is formulated in Sect. 5. Section 6 concludes the paper.
2 Preliminaries
In this paper, to avoid overwhelming notation, we will consider only one-attribute integrity constraints in relational databases. Let Att be a set of attributes. We assume that a relational database schema is a triple \({\mathcal R}=(\mathbf{R},att,\mathbf{IC})\), where:
-
\(\mathbf{R}=\{R_1,\dots ,R_p\}\) – a set of relation names,
-
\(att:\mathbf{R}\rightarrow 2^\mathbf{Att}\) – a function assigning to each relational name \(R\in \mathbf{R}\) a set \(att(R)\subseteq \mathbf{Att}\) of attributes referred to as the type of R,
-
\(\mathbf{IC}=\mathbf{FD}\, \cup \, \mathbf{PK}\, \cup \, \mathbf{FK}\) – a set of integrity constraints, where: \(\mathbf{FD}\) is a set of functional dependencies, \(\mathbf{FD}=\{R.A\rightarrow R.A'\ \vert \ A,A'\in att(R)\}\); \(\mathbf{PK}\) is a set of primary keys (for each relation name there is at most one primary key), \(\mathbf{PK}=\{R.A\ \vert \ A\in att(R)\}\); \(\mathbf{FK}\) is a set of foreign keys, \(\mathbf{FK}=\{R.A\rightarrow R'.A'\ \vert \ A\in att(R), R'.A'\in \mathbf{PK}\}\).
Let \(\mathbf{Const}\) and \(\mathbf{Id}\) be, respectively, sets of constants (strings, for simplicity) and identifiers (or surrogate keys). A tuple of type R is a set \(r=\{(A_1:a_1),\dots ,(A_n:a_n)\}\), where \(\{A_1,\dots ,A_n\}=att(R)\), \(a_i\in \mathbf{Const}\cup \mathbf{Id}\). A fact is an expression R(r), where r is a tuple of type R. A database with schema \({\mathcal R}\) is a finite set of facts, \(DB=\{R(r)\ \vert \ R\in \mathbf{R},\ r\text { is a tuple of type }R\}\). A database is consistent if satisfies all integrity constraints in \(\mathbf{IC}\) [1]. Further on, we will consider only consistent databases.
A keyword query is a set of keywords, \(Q=\{q_1,\dots ,q_k\}\), where each \(q\in Q\) is a constant. A keyword searches tuples interconnected by means of foreign keys and selects tuples containing the keywords (all or some of them). A formula contain(r.A, q) is true for a tuple r if the value of attribute A contains a keyword q. In general, a keyword can occur in any string-valued column as well as any subset of keyword from Q can be taken into account. We will restrict ourselves to the situation described below.
We assume that for a keyword \(q\in Q\), there is a pair \((R_q,A_q)\) such that q can be searched in the column \(R_q.A_q\). A keyword query Q is translated into a tuple relational calculus [1] query trc(Q):
where: (1) \(l\le m\); (2) for each tuple variable \(r_i\), a range formula \(R_i(r_i)\) is in \(\varPhi \); (3) \(\varPsi \) is a conjunction of formulas of the form \(contain(r_j.A_q,q)\), where \(R_q=R_j\); (4) for each i, \(1\le i< m\), \(R_i.A\rightarrow R_{i+1}.A'\in \mathbf{FK}\), for some \(A\in att(R_i)\), \(A'\in att(R_{i+1})\), and the join condition \(R_i.A= R_{i+1}.A'\) is in \(\varPhi \).
3 Motivation and Basic Concepts
Further on in this paper, we will present databases either in a tabular form or as sets of facts. Let us consider three relational databases \(DB_1\), \(DB_2\) and \(DB_3\) (Fig. 1), storing information about films and actors playing in those films. These databases are semantically homogeneous, but their structures differ slightly (in fact, only in presence or absence of some columns). Moreover, film identifiers (FilmId) have only local meaning and may be different in different databases (e.g., X1 in \(DB_2\) and Y1 in \(DB_3\)). So, values of FilmId are from \(\mathbf{Id}\), values of other attributes are from \(\mathbf{Const}\).
Let us consider a keyword query of size 2, \(Q=\{Brody,Polanski\}\). We expect that an answer to the query should contain all relevant data from \(DB_1\), \(DB_2\) and \(DB_3\). Let DB be a database created from \(DB_1\) by renaming Film1 to Film, and PlaysIn1 to PlaysIn. Then the translation of Q to trc(Q) with respect to a DB is:
Now, let us consider a naive way of answering Q by applying trc(Q) to DB. First, data from \(DB_1\), \(DB_2\), and \(DB_3\) are transformed to DB, as follows:
It is easy to see that the answer produced by Q(DB) (or by trc(Q)(DB)) is empty. It contradicts our intuition since we expect that the answer should be: (R.Brody, USA, 2002, “The Pianist”, R.Polanski), that can be deduced from the database. To this order, the following problems must be solved: (a) dealing with identifiers: X1 and Y1 denote in fact the same entity but they are different; (b) during the transformation some values in target tables are missing, i.e., can be represented by \(\mathtt{NULL}\)s; (c) constraints from \(\mathbf{IC}\) may be used; (d) direct transformation of entire databases i inefficient and should be avoided.
4 P2P Keyword Query Answerin
In this section, we discuss a way for solving problems enumerated in the end of Sect. 3. First, we will show how the theory of data exchange [10] may be used to deal with identifiers, null values and constraints. Next, a way utilizing P2P environment will be shown.
4.1 Query Answering Using Data Exchange with Constraints
We apply the theory of data exchange developed by Fagin et al. [9, 10]. In relational database data exchange, a source database with a schema \({\mathcal R}_s\) is transformed to a target database with a schema \({\mathcal R}_t\), by means of a set \(\Sigma \) of source-to-target dependencies (STDs) and a set \(\varDelta \) of equality-generating-dependencies (EGDs).
We assume that in a target database: (1) a source constant value from \(\mathbf{Const}\) different than \(\mathtt{NULL}\), is represented by itself; (2) a source identifier (such as a value of FilmId) is mapped to a variable (a labeled nulls) in a set \(\mathsf {Var}_\mathsf{{Id}}\); (3) a missing value and \(\mathtt{NULL}\) value is mapped to a variable in a set \(\mathsf {Var}_\mathsf{{V}}\). \(\mathbf{Const}\), \(\mathsf {Var}_\mathsf{{Id}}\) and \(\mathsf {Var}_\mathsf{{V}}\) are pairwise disjoint. Mapping between identifiers and variables in \(\mathsf {Var}_\mathsf{{Id}}\) is made by means of a Skolem function SkId(), that assigns to an identifier a unique and fresh variable from \(\mathsf {Var}_\mathsf{{Id}}\). SkId() is defined for any source database separately, i.e., equal identifiers in the same databases are mapped to the same variable, but equal identifiers in different databases are mapped to different variables.
The following set \(\Sigma \) of STDs defines transformation of union of source databases \(DB_1\), \(DB_2\) and \(DB_3\) to a target database DB:
A transformation process in data exchange is referred to as chase [3, 10]. In our case, in the result of chasing \(\{DB_1,DB_2,DB_3\}\) with respect to \(\Sigma \), and into a database with the schema of DB, denoted \(chase_{\Sigma }^{DB}(\{DB_1,DB_2,DB_3\})\), we obtain a database:
In the schema of \(DB'\), we have the following set of functional dependencies:
FD implies the following set of equality-generating-dependencies (EGDs):
Now, we chase \(DB'\) with respect to \(\varDelta \), and obtain \(DB''=chase_{\varDelta }^{DB}(DB')\), where EGDs generates equalities: \(X_2=X_1\), \(V_1\)=“R. Polanski”, \(V_2\)=“England”:
Then the answer to Q is (variables in tuples belonging to the result are discarded):
4.2 Query Answering in P2P System
Now we assume that databases depicted in Fig. 1 are stored in three peers forming a P2P data integration system (Fig. 2). We will discuss how the keyword query Q can be propagated and answered in the system, and how the way of the propagation influence the contents of the answer to the query.
In rewriting procedures in P2P query answering, we will need the following translation. Let \(Q=\{q_1,\dots ,q_k\}\) be a keyword query, then
where dl(q) is a datalog query over a database schema \({\mathcal R}\) and returning a database with a schema being a subset of \({\mathcal R}\).
Query Q can be answered using different strategies. We will consider three strategies depicted in Fig. 3. Moreover, in strategies (2) and (3), two different ways of merge are applied: in strategy (2) \(Peer_2\) applies either partial or total merge, similarly – \(Peer_3\) uses partial or total merge in strategy (3).
First, a keyword query \(Q=\{Brody, Polanski\}\) is sent to \(Peer_1\) that rewrites it to \(Q_1\) and translates to \(dlSet(Q_1)\) using the knowledge included in the schema of its database. Next, \(Peer_1\) propagates Q to its partners and waits for answers. Finally, because \(Peer_1\) is the target peer, it is responsible for producing the final answer. This is done by executing the query \(trc(Q_1)\) against the result of the partial or total merge over collected answers at \(Peer_1\).
Every partner, \(Peer_2\) and \(Peer_3\) in this case, behaves in the same way, i.e., rewrites and translates the query, propagates it to its partners and waits for answers. Additionally, a peer decides about the way of merging answers obtained from partners: in the partial merge only answers are taken into account, in the case of total merge besides of answers also the peer’s entire database is taken into account. The decision on this stage is significant since influence both the cost and information contents of the final answer. In rest of this section, we will illustrate this issue.
We have the following translations:
where, for example (prefixes .tgt and .src distinguish target tables from the source ones):
Analogously, for \(dl_{DB_1}(Brody)\), \(dl_{DB_1}(Polanski)\), and \(dl_{DB_3}(Brody)\).
We have:
In the following, we denote: \(Ans_j(DB_i)\) – a result of a merge by means of the chase procedure restricted to the schema of \(DB_i\); \(Ans_j\) – the final result obtained by executing trc(Q) over \(Ans_j(DB_1)\) (since in our example \(Peer_1\) is the final peer). Attribute names are omitted. Additionally, in the final result all variables (labeled nulls) are discarded.
Strategy (1)
Q is sent to \(Peer_1\), and \(Peer_1\) propagates it to \(Peer_2\) and \(Peer_3\) (Fig. 3(1)). All received answers: \(Q_1(DB_1)\), \(Q_2(DB_2)\), and \(Q_3(DB_3)\), are merged:
The final result is (note that a value of Year is not given):
Strategy (2)
Now, \(Peer_2\) propagates Q to \(Peer_3\) and waits for an answer \(Q_3(DB_3)\), Fig. 3(2). Then decides about partial or total merge:
-
Partial merge:
$$ \begin{array}{l} Ans_2(DB_2)=chase_{\Sigma \cup \varDelta }^{DB_2}(Q_2(DB_2),Q_3(DB_3))\\ =\{PlaysIn2(A.Brady,USA,X_1),Film2(X_1,The\ Pianist)\}.\\ \end{array} $$Next, \(Ans_2(DB_2)\) is sent to \(Peer_1\), and merged there with \(Q_1(DB_1)\) and \(Q_3(DB_3)\). The final answer \(Ans_2\) is equal to \(Ans_1\).
-
Total merge: The entire database \(DB_2\) is merged with \(Q_3(DB_3)\). However, the result is the same as by the partial merge, i.e.,
$$ \begin{array}{l} Ans_3(DB_2)=chase_{\Sigma \cup \varDelta }^{DB_2}(DB_2,Q_3(DB_3))=Ans_2(DB_2). \end{array} $$The final answer \(Ans_3\) is again equal to \(Ans_1\).
Strategy (3)
Now, \(Peer_3\) propagates Q to \(Peer_2\) and waits for \(Q_2(DB_2)\), Fig. 3(3). Then decides about partial or total merge:
-
Partial merge:
$$ \begin{array}{l} Ans_4(DB_3)=chase_{\Sigma \cup \varDelta }^{DB_3}(Q_3(DB_3),Q_2(DB_2))\\ =\{PlaysIn3(A.Brady,X_1,V_1),Film3(X_1,The\ Pianist,R.Polanski)\}.\\ \end{array} $$Again, the final answer \(Ans_4\) is equal to \(Ans_1\).
-
Total merge:
$$ \begin{array}{l} Ans_5(DB_3)=Q_3(chase_{\Sigma \cup \varDelta }^{DB_3}(Q_2(DB_2),DB_3))=\\ =\{PlaysIn3(A.Brady,X_1,2002),Film3(X_1,The\ Pianist,R.Polanski)\}.\\ \end{array} $$Now, the final answer has additionally value 2002 of Year inferred in the integration process:
$$ Ans_5=(A. Brady,USA,2002,The\ Pianist,R.Polanski). $$
5 Strategy Choice Criterion
In the previous section, we discussed some ways (strategies) for propagation, and answering keyword queries in P2P systems. It turns out that both costs of the execution and information contents of the final result may be different. Here we focus on the way of merging (partial or total) that is to be decided in a peer. It follows from the considered example in Sect. 4.2 that:
-
1.
In the case of strategy (2), in the both ways of merging the results were the same, i.e., \(Ans_2=Ans_3\). Of course, applying of the total merge is much more costly than applying the partial merge. So, the total merge should be applied only if there is a chance to discover some missing values (i.e., to replace variables by constants).
-
2.
In strategy (3), applying of the total merge was profitable. We were able to discover (infer) a value of Year, i.e., that Brody played in “The Pianist” in year 2002. Thus, \(Ans_5\supset Ans_4\).
Now, we will discuss a necessary condition for deciding the total merge in a peer.
Definition 1
Let Ans be an answer to a query Q, \(R(r)\in Ans\), and \((A:v)\in r\), where \(v\in \mathsf {Var}_\mathsf{{V}}\) and A is an attribute of R. Then we call v a missing value of A in Ans.
For example, \(V_1\) is a missing value of Year in \(Ans_4(DB_3)\), however, \(X_1\) in this answer is not a missing value since \(X_1\) is in \(\mathsf {Var}_\mathsf{{Id}}\) – not in \(\mathsf {Var}_\mathsf{{V}}\).
Theorem 1
Let v be a missing value of A in Ans, where \((A:v)\in r\) and \(R(r)\in Ans\). The necessary condition for discovering the value of v in a DB with a schema \((\mathbf{R},\mathbf{IC})\):
-
there is \(A'\in att(R)\) such that \(A'\rightarrow A\in \mathbf{IC}\), and
-
there is \(q\in Q\) such that: \(R=R_q\), \(A_q\in att(R)\), \(contains(A_q,q)\) occurs in trc(Q), and \(A'\rightarrow A_q\notin \mathbf{IC}\).
Then we say that the missing value of v can be discovered by means of the constraint (functional dependency) \(A'\rightarrow A\).
For example, \(Brody\in Q\), and \(contains(Name,``Brody'')\) occurs in trc(Q). Moreover, \(FilmId\rightarrow Year\in \mathbf{IC}\) and \(FilmId\rightarrow Name\notin \mathbf{IC}\). Thus, we see that the promise of Theorem 1 is satisfied for discovering the missing value of \(V_1\) in \(Ans_4(DB_3)\) by means of the constraint \(FilmId\rightarrow Year\) (referring to \(DB_3\)).
Proof
(of Theorem 1) Assume that the promise of the theorem is satisfied. Let \((A:v)\in r\), where: \(R(r)\in Ans\), \(R\in \mathbf{R}\). Additionally, let us assume that \((A':c_1)\in r\) and there is a tuple \(r'\) of type att(R), such that:
-
\(R(r')\in DB\),
-
\((A':c_1)\in r'\),
-
\((A:c_2)\in r'\),
-
\((A_q:c_3)\in r'\),
-
\(contains(A_q:q)\) is false in \(r'\).
By the assumption \(A'\rightarrow A\in \mathbf{IC}\), so it follows from the corresponding EGD:
that \(v=c_2\). Moreover, since \(contains(A_q:q)\) is false in \(r'\), the above substitution cannot be inferred by means of the partial merge. \(\square \)
6 Conclusions
We discussed the problem of answering keyword queries issued against a P2P-connected relational databases. In such a system, there are many autonomous peers (agents), which collaborates in the process of producing answers to queries. The system is flexible and peers can enter and leave the system dynamically. A query posed to a peer is propagated to its partners along semantic paths defined by mappings, those partners propagate the query to their partners, etc. Answers to the query flow back in the opposite directions. The target peer merge the answers producing the expected answer. We focused on translating and propagating keyword queries, as well as on the way of merging partial answers by a peer. The proper decision significantly influence both the quality of the answer as well as efficiency of processing. The discussed strategy was inspired by the SixP2P system originally designed for integrating XML data [5, 12].
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading, Massachusetts (1995)
Agrawal, S., Chaudhuri, S., Das, G.: Dbxplorer: a system for keyword-based search over relational databases. ICDE 2002, 5–16 (2002)
Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Relational and XML Data Exchange. Morgan & Claypool Publishers, Synthesis Lectures on Data Management (2010)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley Publishing Company (1999)
Brzykcy, G., Bartoszek, J., Pankowski, T.: Schema mappings and agents’ actions in P2P data integration system. J. Univers. Comput. Sci. 14(7), 1048–1060 (2008)
Chaudhuri, S., Das, G.: Keyword querying and ranking in databases. PVLDB 2(2), 1658–1659 (2009)
Chen, Y., Wang, W., Liu, Z., Lin, X.: Keyword search on structured and semi-structured data. ACM SIGMOD 2009, 1005–1010 (2009)
Dalvi, B.B., Kshirsagar, M., Sudarshan, S.: Keyword search on external memory data graphs. PVLDB 1(1), 1189–1204 (2008)
Fagin, R.: Equality-generating dependencies. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 1009–1010. Springer, US (2009)
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci 336(1), 89–124 (2005)
Hristidis, V., Papakonstantinou, Y.: DISCOVER: Keyword Search in Relational Databases. VLDB
Pankowski, T.: Query propagation in a P2P data integration system in the presence of schema constraints. In: Data Management in Grid and P2P Systems DEXA/Globe’08, LNCS 5187, pp. 46–57 (2008)
Yu, J.X., Qin, L., Chang, L.: Keyword search in relational databases: a survey. IEEE Data Eng. Bull. 33(1), 67–78 (2010)
Acknowledgments
This research has been supported by Polish Ministry of Science and Higher Education under grant 04/45/DSPB/0136.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pankowski, T. (2015). Keyword Search in P2P Relational Databases. In: Jezic, G., Howlett, R., Jain, L. (eds) Agent and Multi-Agent Systems: Technologies and Applications. Smart Innovation, Systems and Technologies, vol 38. Springer, Cham. https://doi.org/10.1007/978-3-319-19728-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-19728-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19727-2
Online ISBN: 978-3-319-19728-9
eBook Packages: EngineeringEngineering (R0)