An effective 3-in-1 keyword search method over heterogeneous data sources
Introduction
Keyword search is a proven and widely popular mechanism for querying document systems and the World Wide Web. Recently, it has even been extensively applied to extract useful and relevant information from the Internet. Furthermore, the database (DB) research community has also recognized the benefits of keyword search and has been introducing keyword search capability into relational DBs [1], [2], [3], [4], [5], [6], [7], [8], XML DBs [9], [10], [11], [12], [13], [14], [15], [16] and graph DBs [17], [18], [19]. However, the existing web search engines cannot integrate information from multiple interrelated pages to answer keyword queries meaningfully. Next-generation web search engines require link-awareness, or more generally, the capability of integrating correlative information items that are linked through hyperlinks. Meanwhile, the efficiency of keyword search on structured and semi-structured data remains a challenging problem. This is so because the traditional approaches have always employed the inverted index to process keyword queries, which is effective for unstructured data but inefficient for semi-structured and structured data. This is because the inverted index is inadequate for identifying the “best” answers with complex structural information, which is rather rich in XML documents or relational DBs.
To the best of our knowledge, very few existing studies could be universally applied to unstructured data (e.g., text documents), semi-structured data (e.g., XML documents), structured data (e.g., relational DBs) and graph data. Therefore, providing both effective and efficient search ability over such heterogeneous collections within a single search engine remains a big challenge. As it is, the structure of the data, such as the potentially hierarchical embedding in XML documents, is not fully exploited for answering keyword queries. It is also not taken into account for result ranking in most search engines. Consequently, current implementations focus on either IR-style search to meaningfully rank the results but ignore the rich structural information, or DB-style search to discover answers by identifying structural relationships but employ a very straightforward ranking mechanism.
This less-than-ideal situation calls for a framework for indexing and querying over large collections of unstructured, semi-structured or structured data, and adaptive ranking of the results retrieved over those heterogeneous data. In this paper, we propose EASE, an Efficient and Adaptive keyword SEarch method, as an attempt in that direction. Our work is in line with the current trend of seamlessly integrating DBs and information retrieval (IR) techniques [20], [21]. EASE seamlessly integrates efficient query evaluation and adaptive scoring for ranking results. From the DB point of view, EASE provides an efficient algorithmic basis for scalable top-k-style processing of large amounts of heterogeneous data for the discovery of rich structural relationships. It works by employing an adaptive, efficient and novel index beyond the inverted index. From the IR viewpoint, EASE integrates an effective ranking mechanism to improve search effectiveness.
In our approach, we model unstructured, semi-structured and structured data as graphs, with nodes being documents, elements and tuples, respectively, and edges being hyperlinks, parent–child relationships (or IDREFS) and primary–foreign-key relationships, respectively. We enable efficient keyword queries on these heterogeneous data by summarizing, clustering the graphs and constructing graph indices. To facilitate efficient keyword-based query processing, we examine the issues of indexing and ranking to improve search quality. To the best of our knowledge, this is the first attempt to efficiently and adaptively process keyword queries on such heterogeneous data, and also the first work to propose the novel graph index, which is efficient in identifying rich structural relationships.
Our contributions in this paper are as follows:
We model unstructured, semi-structured and structured data as graphs and propose an efficient keyword search method, EASE, to adaptively process keyword queries over the heterogeneous data. We devise an effective graph index as opposed to the inverted index, to improve search efficiency and effectiveness.
We propose a partition-based method to maintain the graph index so as to reduce the graph-index size.
We present a novel ranking mechanism for effective keyword search by taking into account both the structural compactness of answers from the DB viewpoint and the textual relevancy from the IR point of view.
We examine the issues of indexing and ranking, and devise a simple and yet efficient indexing mechanism to index the structural relationships between the transformed data. The index is amenable to the deployment of existing top-k ranking methods.
We have conducted an extensive performance study using real datasets and various queries with different characteristics. The results show that EASE achieves both high search efficiency and accuracy, and outperforms existing state-of-the-art methods.
Section snippets
Unstructured data
Although many prior studies of keyword search over text documents (e.g., HTML documents) have been proposed, they all produce a list of individual pages as results. In the event that there are no pages that contain all the keywords, they will return pages with some of the input keywords ranked by relevancy. Even if two or more interrelated pages contain all the keywords, the existing methods cannot integrate the pages into one relevant and meaningful answer. For example, to search for
EASE: an effective and adaptive search method
The efficiency and advantages of using inverted indices for facilitating the computation of the “best” answers for online keyword queries are well recognized. However, the inverted indices are not effective for discovering the much richer structural relationships existing in DBs with complicated structures [17]. It is therefore important to be able to efficiently and effectively discover these structural relationships, and index them for fast and accurate response. Intuitively, a
Ranking functions
In this section, we first discuss how to meaningfully rank r-radius Steiner graphs and identify the top-k answers based on the existing proposals. Next, we propose a new measure based on the structural compactness between content nodes and the structural relevancy between input keywords with respect to an r-radius Steiner graph.
Indexing
To efficiently identify the top-k answers with the highest scores, we examine the issues of indexing in this section.
Given any two keywords and in the graph, and an r-radius graph , the scores of , ) and , ) in Eq. (4) and in Eq. (10) share the key feature that they can be pre-computed and materialized off-line. Based on this observation, we can materialize . We devise an extended inverted index (EI-Index) to maintain such scores.
Experimental study
We have designed and performed a comprehensive set of experiments to evaluate the search performance of EASE. We employed the datasets of DBLife,7 DBLP,8 and IMDB9 to evaluate EASE on unstructured, semi-structured and structured data, respectively. There were about 10,000 pages in the
Related work
The first area of research related to our work is keyword search over relational DBs by identifying Steiner trees. As opposed to the traditional Steiner tree-based methods, which identify the structural relationships online, EASE identifies and materializes the rather rich structural relationships so as to improve the online processing of keyword queries.
DBXplorer [1], DISCOVER-I [5], DISCOVER-II [4], BANKS-I [2] and BANKS-II [19] are systems built on top of relational DBs. DISCOVER and
Conclusion
In this paper, we have proposed an efficient and adaptive keyword search method, EASE, to answer keyword queries over unstructured, semi-structured and structured data. EASE seamlessly integrates the efficient query evaluation of DB and the adaptive scoring models of IR for the ranking of results. EASE models heterogeneous data as graphs and processes keyword queries on the graphs. To the best of our knowledge, this is the first attempt to efficiently and adaptively process keyword queries on
Acknowledgments
The research of B.C. Ooi was in part funded by NUS Grant R-252-000-338-112. This work was also in part supported by the National Natural Science Foundation of China under Grant No. 60573094, the National High Technology Development 863 Program of China under Grant No. 2007AA01Z152, the National Grand Fundamental Research 973 Program of China under Grant No. 2006CB303103, and 2008 HP Labs Innovation Research Program.
References (40)
- S. Agrawal, S. Chaudhuri, G. Das, Dbxplorer: a system for keyword-based search over relational databases, in: ICDE,...
- G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, S. Sudarshan, Keyword searching and browsing in databases using...
- B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, X. Lin, Finding top-k min-cost connected trees in databases, in: ICDE,...
- V. Hristidis, L. Gravano, Y. Papakonstantinou, Efficient IR-style keyword search over relational databases, in: VLDB,...
- V. Hristidis, Y. Papakonstantinou, Discover: keyword search in relational databases, in: VLDB,...
- F. Liu, C. Yu, W. Meng, A. Chowdhury, Effective keyword search in relational databases, in: SIGMOD,...
- Y. Luo, X. Lin, W. Wang, X. Zhou, Spark: top-k keyword query in relational databases, in: SIGMOD,...
- A. Markowetz, Y. Yang, D. Papadias, Keyword search on relational data streams, in: SIGMOD,...
- S. Cohen, J. Mamou, Y. Kanza, Y. Sagiv, Xsearch: a semantic search engine for XML, in: VLDB,...
- L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, Xrank: ranked keyword search over XML documents, in: SIGMOD, 2003, pp....