Elsevier

Information Systems

Volume 36, Issue 2, April 2011, Pages 248-266
Information Systems

An effective 3-in-1 keyword search method over heterogeneous data sources

https://doi.org/10.1016/j.is.2008.08.001Get rights and content

Abstract

Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogeneous data. To achieve high efficiency in processing keyword queries, we first model unstructured, semi-structured and structured data as graphs, and then summarize the graphs and construct graph indices instead of using traditional inverted indices. We propose an extended inverted index to facilitate keyword-based search, and present a novel ranking mechanism for enhancing search effectiveness. We have conducted an extensive experimental study using real datasets, and the results show that EASE achieves both high search efficiency and high accuracy, and outperforms the existing approaches significantly.

Introduction

Keyword search is a proven and widely popular mechanism for querying document systems and the World Wide Web. Recently, it has even been extensively applied to extract useful and relevant information from the Internet. Furthermore, the database (DB) research community has also recognized the benefits of keyword search and has been introducing keyword search capability into relational DBs [1], [2], [3], [4], [5], [6], [7], [8], XML DBs [9], [10], [11], [12], [13], [14], [15], [16] and graph DBs [17], [18], [19]. However, the existing web search engines cannot integrate information from multiple interrelated pages to answer keyword queries meaningfully. Next-generation web search engines require link-awareness, or more generally, the capability of integrating correlative information items that are linked through hyperlinks. Meanwhile, the efficiency of keyword search on structured and semi-structured data remains a challenging problem. This is so because the traditional approaches have always employed the inverted index to process keyword queries, which is effective for unstructured data but inefficient for semi-structured and structured data. This is because the inverted index is inadequate for identifying the “best” answers with complex structural information, which is rather rich in XML documents or relational DBs.

To the best of our knowledge, very few existing studies could be universally applied to unstructured data (e.g., text documents), semi-structured data (e.g., XML documents), structured data (e.g., relational DBs) and graph data. Therefore, providing both effective and efficient search ability over such heterogeneous collections within a single search engine remains a big challenge. As it is, the structure of the data, such as the potentially hierarchical embedding in XML documents, is not fully exploited for answering keyword queries. It is also not taken into account for result ranking in most search engines. Consequently, current implementations focus on either IR-style search to meaningfully rank the results but ignore the rich structural information, or DB-style search to discover answers by identifying structural relationships but employ a very straightforward ranking mechanism.

This less-than-ideal situation calls for a framework for indexing and querying over large collections of unstructured, semi-structured or structured data, and adaptive ranking of the results retrieved over those heterogeneous data. In this paper, we propose EASE, an Efficient and Adaptive keyword SEarch method, as an attempt in that direction. Our work is in line with the current trend of seamlessly integrating DBs and information retrieval (IR) techniques [20], [21]. EASE seamlessly integrates efficient query evaluation and adaptive scoring for ranking results. From the DB point of view, EASE provides an efficient algorithmic basis for scalable top-k-style processing of large amounts of heterogeneous data for the discovery of rich structural relationships. It works by employing an adaptive, efficient and novel index beyond the inverted index. From the IR viewpoint, EASE integrates an effective ranking mechanism to improve search effectiveness.

In our approach, we model unstructured, semi-structured and structured data as graphs, with nodes being documents, elements and tuples, respectively, and edges being hyperlinks, parent–child relationships (or IDREFS) and primary–foreign-key relationships, respectively. We enable efficient keyword queries on these heterogeneous data by summarizing, clustering the graphs and constructing graph indices. To facilitate efficient keyword-based query processing, we examine the issues of indexing and ranking to improve search quality. To the best of our knowledge, this is the first attempt to efficiently and adaptively process keyword queries on such heterogeneous data, and also the first work to propose the novel graph index, which is efficient in identifying rich structural relationships.

Our contributions in this paper are as follows:

  • We model unstructured, semi-structured and structured data as graphs and propose an efficient keyword search method, EASE, to adaptively process keyword queries over the heterogeneous data. We devise an effective graph index as opposed to the inverted index, to improve search efficiency and effectiveness.

  • We propose a partition-based method to maintain the graph index so as to reduce the graph-index size.

  • We present a novel ranking mechanism for effective keyword search by taking into account both the structural compactness of answers from the DB viewpoint and the textual relevancy from the IR point of view.

  • We examine the issues of indexing and ranking, and devise a simple and yet efficient indexing mechanism to index the structural relationships between the transformed data. The index is amenable to the deployment of existing top-k ranking methods.

  • We have conducted an extensive performance study using real datasets and various queries with different characteristics. The results show that EASE achieves both high search efficiency and accuracy, and outperforms existing state-of-the-art methods.

The rest of this paper is organized as follows. We present the r-radius Steiner graph problem in Section 2. Section 3 introduces a novel graph index. We present a novel scoring function in Section 4. We examine the issues of indexing and ranking, and propose an indexing mechanism in Section 5. Extensive experimental evaluations are provided in Section 6. We review the related work in Section 7 and conclude the paper with Section 8.

Section snippets

Unstructured data

Although many prior studies of keyword search over text documents (e.g., HTML documents) have been proposed, they all produce a list of individual pages as results. In the event that there are no pages that contain all the keywords, they will return pages with some of the input keywords ranked by relevancy. Even if two or more interrelated pages contain all the keywords, the existing methods cannot integrate the pages into one relevant and meaningful answer. For example, to search for

EASE: an effective and adaptive search method

The efficiency and advantages of using inverted indices for facilitating the computation of the “best” answers for online keyword queries are well recognized. However, the inverted indices are not effective for discovering the much richer structural relationships existing in DBs with complicated structures [17]. It is therefore important to be able to efficiently and effectively discover these structural relationships, and index them for fast and accurate response. Intuitively, a

Ranking functions

In this section, we first discuss how to meaningfully rank r-radius Steiner graphs and identify the top-k answers based on the existing proposals. Next, we propose a new measure based on the structural compactness between content nodes and the structural relevancy between input keywords with respect to an r-radius Steiner graph.

Indexing

To efficiently identify the top-k answers with the highest scores, we examine the issues of indexing in this section.

Given any two keywords ki and kj in the graph, and an r-radius graph SG, the scores of SCOREIR(ki, SG) and SCOREIR(kj, SG) in Eq. (4) and SIM(ki,kj|SG) in Eq. (10) share the key feature that they can be pre-computed and materialized off-line. Based on this observation, we can materialize SCORE(ki,kj|SG). We devise an extended inverted index (EI-Index) to maintain such scores.

Experimental study

We have designed and performed a comprehensive set of experiments to evaluate the search performance of EASE. We employed the datasets of DBLife,7 DBLP,8 and IMDB9 to evaluate EASE on unstructured, semi-structured and structured data, respectively. There were about 10,000 pages in the

Related work

The first area of research related to our work is keyword search over relational DBs by identifying Steiner trees. As opposed to the traditional Steiner tree-based methods, which identify the structural relationships online, EASE identifies and materializes the rather rich structural relationships so as to improve the online processing of keyword queries.

DBXplorer [1], DISCOVER-I [5], DISCOVER-II [4], BANKS-I [2] and BANKS-II [19] are systems built on top of relational DBs. DISCOVER and

Conclusion

In this paper, we have proposed an efficient and adaptive keyword search method, EASE, to answer keyword queries over unstructured, semi-structured and structured data. EASE seamlessly integrates the efficient query evaluation of DB and the adaptive scoring models of IR for the ranking of results. EASE models heterogeneous data as graphs and processes keyword queries on the graphs. To the best of our knowledge, this is the first attempt to efficiently and adaptively process keyword queries on

Acknowledgments

The research of B.C. Ooi was in part funded by NUS Grant R-252-000-338-112. This work was also in part supported by the National Natural Science Foundation of China under Grant No. 60573094, the National High Technology Development 863 Program of China under Grant No. 2007AA01Z152, the National Grand Fundamental Research 973 Program of China under Grant No. 2006CB303103, and 2008 HP Labs Innovation Research Program.

References (40)

  • S. Agrawal, S. Chaudhuri, G. Das, Dbxplorer: a system for keyword-based search over relational databases, in: ICDE,...
  • G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, S. Sudarshan, Keyword searching and browsing in databases using...
  • B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, X. Lin, Finding top-k min-cost connected trees in databases, in: ICDE,...
  • V. Hristidis, L. Gravano, Y. Papakonstantinou, Efficient IR-style keyword search over relational databases, in: VLDB,...
  • V. Hristidis, Y. Papakonstantinou, Discover: keyword search in relational databases, in: VLDB,...
  • F. Liu, C. Yu, W. Meng, A. Chowdhury, Effective keyword search in relational databases, in: SIGMOD,...
  • Y. Luo, X. Lin, W. Wang, X. Zhou, Spark: top-k keyword query in relational databases, in: SIGMOD,...
  • A. Markowetz, Y. Yang, D. Papadias, Keyword search on relational data streams, in: SIGMOD,...
  • S. Cohen, J. Mamou, Y. Kanza, Y. Sagiv, Xsearch: a semantic search engine for XML, in: VLDB,...
  • L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, Xrank: ranked keyword search over XML documents, in: SIGMOD, 2003, pp....
  • V. Hristidis, N. Koudas, Y. Papakonstantinou, D. Srivastava, Keyword proximity search in XML trees, in: IEEE TKDE, vol....
  • V. Hristidis, Y. Papakonstantinou, A. Balmin, Keyword proximity search on XML graphs, in: ICDE, 2003, pp....
  • G. Li, J. Feng, J. Wang, L. Zhou, Efficient keyword search for valuable LCAs over XML documents, in: CIKM,...
  • Z. Liu, Y. Chen, Identifying return information for XML keyword search, in: SIGMOD,...
  • C. Sun, C.Y. Chan, A.K. Goenka, Multiway SLCA-based keyword search in XML data, in: WWW,...
  • Y. Xu, Y. Papakonstantinou, Efficient keyword search for smallest LCAs in XML databases, in: SIGMOD, 2005, pp....
  • L. Guo, J. Shanmugasundaram, G. Yona, Topology search over biological databases, in: ICDE,...
  • H. He, H. Wang, J. Yang, P. Yu, Blinks: ranked keyword searches on graphs, in: SIGMOD,...
  • V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, H. Karambelkar, Bidirectional expansion for keyword...
  • S. Chaudhuri, R. Ramakrishnan, G. Weikum, Integrating DB and IR technologies: What is the sound of one hand clapping?...
  • Cited by (0)

    View full text