Efficient processing of keyword queries over graph databases for finding effective answers

https://doi.org/10.1016/j.ipm.2014.08.002Get rights and content

Highlights

  • We define a new measure of relevance of a node in the graph to a keyword query.

  • We propose an extended answer structure for a top-k query over graph databases.

  • We propose an inverted list index and search algorithm to find top-k answer trees.

  • We enhanced the basic method for more efficient and scalable processing the query.

  • Experiments show that the proposed method can find effective top-k answers efficiently.

Abstract

In this paper, we study on effective and efficient processing of keyword-based queries over graph databases. To produce more relevant answers to a query than the previous approaches, we suggest a new answer tree structure which has no constraint on the number of keyword nodes chosen for each keyword in the query. For efficient search of answer trees on the large graph databases, we design an inverted list index to pre-compute and store connectivity and relevance information of nodes to keyword terms in the graph. We propose a query processing algorithm which aggregates from the pre-constructed inverted lists the best keyword nodes and root nodes to find top-k answer trees most relevant to the given query. We also enhance the method by extending the structure of the inverted list and adopting a relevance lookup table, which enables more accurate estimation of the relevance scores of candidate root nodes and efficient search of top-k answer trees. Performance evaluation by experiments with real graph datasets shows that the proposed method can find more effective top-k answers than the previous approaches and provides acceptable and scalable execution performance for various types of keyword queries on large graph databases.

Introduction

Recently, graph-structured data are widely used in many applications such as XML, bio-informatics, semantic web, ontologies, and social networking services. Keyword-based query over graph-structured databases has been attracting much attention since it enables users to represent their information need using only a set of keyword terms, instead of using a query language and understanding the underlying database schema (Chen et al., 2009, Dalvi et al., 2008, Golenberg et al., 2008, He et al., 2007, Kacholia et al., 2005, Kargar and An, 2011, Kim et al., 2011, Li et al., 2008, Park, 2013, Tran et al., 2009). Keyword-based query processing has also been studied extensively in the literature of relational databases, where relational data can be modeled as a directed graph of tuples based on the foreign-key relationships among tuples (Agrawal et al., 2002, Baid et al., 2010, Balmin et al., 2004, Bergamaschi et al., 2011, Bhalotia et al., 2002, Ding et al., 2007, Hristidis et al., 2003, Hristidis and Papakonstantinou, 2002, Li et al., 2009, Liu et al., 2006, Luo et al., 2007, Qin, Yu, Chang, 2009, Qin, Yu, Chang, Tao, 2009).

Keyword-based search on a graph-structured database usually returns a set of connected structures derived from the database, which represent how the data containing query keywords are interconnected in the database. In most approaches, a sub-tree of the graph is used to describe an answer to a given query. Since there can be a significant number of answer structures in a large graph database, a relevance scoring function is often used to rank the candidate answers and return top-k ones most relevant to the query.

The problem of answering keyword-based queries over graph-structured databases is described as follows. Let G = (V, E) be a directed graph representing a graph-structured database, where each node is labeled with some text. The nodes and edges in G may have weights on them. Given a keyword query Q over G consisting of a set of keywords, denoted by Q = {k1, k2, …, kl}, an answer to Q is defined as a sub-tree T of G satisfying the following properties: there exists a set of nodes in T called keyword nodes, where each node contains at least one keyword in Q, and the leaf nodes of T only come from those keyword nodes. Given a relevance scoring function rel(T), which maps an answer sub-tree T to a numeric score value measuring goodness of T or relevance to Q, top-k processing of Q should find k best answers with the highest values of rel(T).

To evaluate and rank the answer sub-trees, various scoring functions have been proposed in the literature based on different semantics, which will be described in Section 2. In this paper, we adopt distinct root-based semantics, where the weight of a sub-tree is computed as the sum of the shortest distance from the root to each keyword node and at most one sub-tree rooted at each node is considered an answer to the query (Dalvi et al., 2008, He et al., 2007, Kacholia et al., 2005). This approach can deal with top-k query processing over very large graph databases more efficiently than the other approaches based on Steiner tree-based semantics. It also enables effective indexing on the graph (He et al., 2007).

For example, suppose that a keyword query Q = {volcano, ocean} is given on a graph-structured data G in Fig. 1(a). As indicated in the figure, nodes K, L, M, O, and R are keyword nodes containing keyword volcano and nodes S and U are keyword nodes for ocean. Fig. 1(b) shows some possible answer trees rooted at node B, C, or D. Under the distinct root-based semantics, only one among the sub-trees TB1, TB2, and TB which are rooted at node B can be returned as an answer to Q. Note that in the previous approaches, only sub-trees including exactly one keyword node for each query keyword, such as TB1, TB2, TC1, and TD1, have been considered candidate answer trees. However, we consider that sub-trees which have more than one keyword node for each query keyword, such as TB and TC, also can be possible answers to Q.

The main contributions of our work are as follows:

  • To produce more effective and relevant search result for a given query, we propose an extended structure of answer trees and a new relevance metric and ranking mechanism for the answer trees. Different from the existing approaches, the proposed answer structure has no such constraint that it should include one and only one keyword node for each keyword in the query. That is, an answer tree is allowed to contain a part of query keywords and to have more than one node containing the same keyword, and based on the new measure of relevance, more extended and relevant answers can be generated.

  • For efficient finding of top-k answers in the proposed structure, we design an inverted list-style index to the keywords and nodes in the graph, which stores information on the connectivity and relevance of a node to each keyword term. Then we present a basic query processing algorithm which exploits the pre-constructed inverted lists to aggregate most relevant keyword nodes for each candidate answer tree with a distinct root and find top-k answer trees most relevant to the given query.

  • Aiming at improving the efficiency of the basic approach, we extended the above inverted list index to store at each entry additional relevance information of another entry related in the same list. We also introduce a relevance lookup table which pre-computes and stores the largest relevance value of each node to each keyword term in the graph. Then we present an enhanced search algorithm based on the extended inverted list and the relevance lookup table. It estimates the worst and best relevance scores of a node more closely to its actual score and thus can find top-k answer trees rooted at different nodes more efficiently than the basic approach.

The rest of the paper is organized as follows. Section 2 presents related work and motivation of our study. Section 3 defines a new answer structure and relevance measure for keyword queries over graph databases. In Section 4, we propose an inverted list index for keywords and nodes in the graph and a top-k query processing algorithm using the index. In Section 5, we improve the proposed indexing scheme and present a more efficient search method. We provide experimental results on the effectiveness and performance of the proposed method in Section 6 and draw a conclusion in Section 7.

Section snippets

Related work and motivation

There has been much work on keyword search over relational databases (Agrawal et al., 2002, Baid et al., 2010, Balmin et al., 2004, Bergamaschi et al., 2011, Bhalotia et al., 2002, Ding et al., 2007, Hristidis and Papakonstantinou, 2002, Hristidis et al., 2003, Li et al., 2009, Liu et al., 2006, Luo et al., 2007, Qin, Yu, Chang, 2009, Qin, Yu, Chang, Tao, 2009). Many approaches, however, use underlying schema information to generate candidate expressions on the schema graph and then translate

Answer trees and relevance measure

In this section, we propose a structure of answer trees and a relevance measure for them. Given a data graph G = (V, E), let K be the set of keyword terms extracted from the nodes in V(G). We first define relevance of a node in V(G) to a keyword term in K contained in a specific node in the graph. When a node contains a keyword term, the relevance of the node to the keyword is computed based on the tf-idf weighting scheme (Buttcher, Clarke, & Cormack, 2010) which is popularly used in information

Basic strategy

In this section, we present an indexing scheme and query processing algorithm to find k best answers to a given keyword query based on the answer structure and relevance measure defined in the previous section.

Enhanced approach

In the basic method described in Section 4, the worst score and best score of each node n are estimated assuming that all the unknown relevances of the entries of n unseen from the inverted lists are equal to the largest of the relevances of the entries at the current scan positions of the lists, i.e. maxCurScore. This strategy, however, is too conservative since the actual relevance of an entry of n unseen from a list L(ki) can be much smaller than curScorei, the relevance of the entry at the

Performance evaluation

In this section we evaluate effectiveness and efficiency of the proposed approach including basic method (BM) and enhanced method (EM-RL) by experiments using real datasets. We compare their performances with the BLINKS method (He et al., 2007) which adopts distinct root semantics and inverted list-style index similar to our approach.

In the implementation of the proposed methods in Java, we use a hash table to maintain data associated with each node in the graph whose entries have been read

Conclusion

In this paper, we propose a new ranked keyword search method for graph databases. To find more effective top-k answers to a given query, we define a new measure of relevance of a node to a keyword query and suggest an extended and flexible answer structure which may have multiple keyword nodes for a keyword in the query. For efficient top-k query processing based on the proposed answer structure and relevance measure, we design an inverted list index which stores reachability and relevance

References (31)

  • R. Fagin et al.

    Optimal aggregation algorithms for middleware

    Journal of Computer and System Sciences

    (2003)
  • Agrawal, S., Chaudhuri, S., & Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases....
  • A. Baid et al.

    Toward scalable keyword search over relational data

    Proceedings of the VLDB Endowment

    (2010)
  • Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). ObjectRank: Authority-based keyword search in databases. In...
  • Bergamaschi, S., Domnori, E., Guerra, F., Lado, R. T., & Velegrakis, Y. (2011). Keyword search over relational...
  • Best, H., Majumdar, D., Schenkel, R., Theobald, M., & Weikum, G. (2006). IO-Top-k: Index-access optimized top-k query...
  • Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., & Sudarshan, S. (2002). Keyword searching and browsing in...
  • Bruno, N., Gravano, L., & Marian, A. (2002). Evaluating top-k queries over web-accessible databases. In Proc. of IEEE...
  • S. Buttcher et al.

    Information retrieval: Implementing and evaluating search engine

    (2010)
  • Chen, Y., Wang, W., Liu, Z., & Lin, X. (2009). Keyword search on structured and semi-structured data. In Proc. of 2009...
  • B.B. Dalvi et al.

    Keyword search on external memory data graphs

    The Proceedings of the VLDB Endowment

    (2008)
  • Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., & Lin, X. (2007). Finding top-k min-cost connected trees in...
  • Güntzer, U., Balke, W.-T., & Kießling, W. (2001). Towards efficient multi-feature queries in heterogeneous...
  • Golenberg, K., Kimelfeld, B., & Sagiv, Y. (2008). Keyword proximity search in complex data graphs. In Proc. of 2008 ACM...
  • He, H., Wang, H., Yang, J., & Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In Proc. of 2007 ACM SIGMOD...
  • Cited by (19)

    • Graph cells: Top-k structural-textual aggregated query over information networks

      2021, Information Sciences
      Citation Excerpt :

      Its search process is based on the shortest path algorithm, and the internal dependency is realized by traversal. Subsequently, an extended answer structure with a new relevance measure similar to BLINKS was proposed in Ref. [20], which An inverted list index was also designed to pre-compute and store connectivity and relevance information of nodes to keywords in the graph. For graph based methods, [17] proposed a r-radius Steiner graph, identifying meaningful Steiner graphs with acceptable sizes, as an index to process keyword queries.

    • A natural language interface to a graph-based bibliographic information retrieval system

      2017, Data and Knowledge Engineering
      Citation Excerpt :

      The NLI designed in this paper is an NLI to graph databases (e.g., [40]). Graph databases have comparable expressive power to triple stores, but much higher scalability, making them more suitable to real-world applications [4]; moreover, graph databases have been increasingly used in information retrieval systems (e.g., [37]). Given the graph-like characteristics of bibliographic data as discussed in our previous work [51], a natural language interface to graph database-based bibliographic information retrieval systems provides a novel yet practical way of accessing and retrieving bibliographic data.

    • A relevance model for middle school students seeking information for an inquiry-based class history project

      2017, Information Processing and Management
      Citation Excerpt :

      But topical “aboutness” for the system-oriented view also must have a perspective to communicate with it. Borlund raises this point that system-oriented researchers evaluate the performance of information systems not only via topical goodness of fit but also by what Saracevic labeled “algorithmic relevance” (Saracevic, 2007a, p. 1931; see also, Borlund, 2003, pp. 914–915; for an example, see Park & Lim, 2015). Algorithmic relevance includes not only number of times the topic term appears in the document, and the proximity of topic terms to each other in the document, algorithmic relevance also defines and measures topical “aboutness” as “a function of the number of features in common between the query representation and the information objects” (Borlund, 2003, p. 914).

    • Efficient keyword search on graph data for finding diverse and relevant answers

      2023, International Journal of Web Information Systems
    • A Novel Data Set for Information Retrieval on the Basis of Subgraph Matching

      2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text