Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support

doi:10.1016/j.websem.2011.08.002

Journal of Web Semantics

Volume 9, Issue 4, December 2011, Pages 490-503

https://doi.org/10.1016/j.websem.2011.08.002 Get rights and content

Abstract

The Web contains a large amount of documents and an increasing quantity of structured data in the form of RDF triples. Many of these triples are annotations associated with documents. While structured queries constitute the principal means to retrieve structured data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both textual and structured data can address more complex information needs. However, hybrid search on the large scale Web environment faces several challenges. First, there is a need for repositories that can store and index a large amount of semantic data as well as textual data in documents, and manage them in an integrated way. Second, methods for hybrid query answering are needed to exploit the data from such an integrated repository. These methods should be fast and scalable, and in particular, they shall support flexible ranking schemes to return not all but only the most relevant results. In this paper, we present CE², an integrated solution that leverages mature information retrieval and database technologies to support large scale hybrid search. For scalable and integrated management of data, CE² integrates off-the-shelf database solutions with inverted indexes. Efficient hybrid query processing is supported through novel data structures and algorithms which allow advanced ranking schemes to be tightly integrated. Furthermore, a concrete ranking scheme is proposed to take features from both textual and structured data into account. Experiments conducted on DBpedia and Wikipedia show that CE² can provide good performance in terms of both effectiveness and efficiency.

Introduction

Recently, we have seen a strong increase in the availability of structured data on the Web, RDF in particular. This structured data might be associated with documents in the form of annotations or might be embedded in documents. The Web as it is today can be considered as a large repository of interlinked documents, structured data and annotations. Examples of publicly available datasets on the Web that contain these different types of data are Wikipedia¹ (textual data), DBPedia² (annotations and domain-independent structured data) and DBLP³ (annotations and structured data in the bibliographic domain).

Currently, keyword search is commonly supported by commercial search engines for the retrieval of documents. Beyond document retrieval, structured data on the Web can support other retrieval scenarios. Instead of documents, a search engine might return exact answers to the users as the result of a complex question, represented as a structured query. With the increase of annotations available on the Web, there is a potential to support more expressive document retrieval to address more complex needs. As discussed in [33], annotations can be seen as an additional layer of information on top of the documents, which can be exploited to answer more complex queries. In addition, annotations in form of RDFa⁴ and Microformats⁵ (data embedded in Web documents) are quite popular and have been used in search engines like Yahoo to improve retrieval accuracy [28].

Keyword queries and structured queries are the two principal means to find resources. More specialized systems such as Digital Library applications go further and employ hybrid queries combining both the advantages of structured queries and keyword queries. This type of queries is suitable for hybrid search [6], a paradigm that allows querying over textual and structured data in an integrated way. With hybrid search, the user can ask for documents or structured data, using both keywords and structural constraints. For example, a user can ask for pieces of data with descriptions containing a given keyword, e.g., “Find Turing Award Winners working at IBM that are associated with documents containing Algorithm”.

Hybrid search on a Web scale environment however brings several challenges. It requires the capacity to store and index a large amount of textual and structured data. In order to answer user queries against this data, it requires scalable solutions for integrated processing of hybrid queries so that results can be returned within a reasonable amount of time. Another crucial aspect is ranking because given the volume and the heterogeneity of Web resources, it is likely that a query results in a large number of candidates, which may differ in many aspects, including quality and recentness. Ranking is thus needed to help users focusing on the most relevant results.

In this paper, we elaborate on infrastructure components that are necessary to support large scale hybrid search. This work specifically addresses the above challenges and the main contributions are listed as follows:

•
We describe a unified framework to represent and to query documents and graph-structured data in an integrated way.
•
We leverage mature information retrieval and database technologies to build a repository, which can scale over a large amount of documents and graph-structured data.
•
We propose a novel data structure called Occurrence Probability Table (OPT) and on top of the data structure, a set of algorithms for hybrid query processing that allows a flexible integration of advanced ranking schemes.
•
We elaborate on a concrete ranking scheme which propagates and aggregates scores along a data structure called answer tree.

The repository and the hybrid query engine implementing our approach are embedded into an integrated solution called CE². We have conducted experiments with CE² on RDF data contained in DBpedia and on documents from Wikipedia. Results show that CE² supports effective ranking and scales to millions of documents and RDF triples.

The rest of this paper is organized as follows: Section 2 presents a formal model of hybrid search including definitions of resources, queries, answers and ranking. Section 3 introduces the architecture of CE². In Section 4 and Section 5, we elaborate on data storage and hybrid query processing in details. We show our experimental results in Section 6. Related work is presented in Section 7 and conclusions in Section 8.

Section snippets

Hybrid search

In this section, we will present a formal model of hybrid search and elaborate on its components.

Definition 1

A hybrid search model is a quadruple 〈G, Q, F, R (q_i, d_j)〉 where

(1)
G is a representation of resources.
(2)
Q is a representation of user information needs (queries).
(3)
F is a framework that models relationships between resources and queries. Given a query, F defines which resources constitute the answers.
(4)
R(q_i, d_j) is a ranking function defined by R(q_i, d_j) ∈ (0, 1] iff d_j is an answer to q_i and R(q_i, d_j) = 0 otherwise.

CE² architecture

CE² is built to store, index and perform hybrid search on textual and structured data. Fig. 3 shows the decomposition of CE² into two main components. The first one is the repository: textual data associated with entities and documents as well as annotations are stored in separate inverted indexes. Structured data other than annotations is kept in a database. The second component is a hybrid query engine that is composed of several sub-modules. The Query Planner decomposes a query into several

Data storage and index

An efficient data storage and index scheme is essential to deal with a large amount of resources. Since resources are composed of textual and structured data, it is natural to combine IR and DB technologies. While a database offers efficient storage and advanced query optimization for structured data, inverted indexes have been successfully employed to deal with a large amount of texts. Hence, we use a database to store entity resources, namely triples of the form e(v₁, v₂), where e ∈ L⧹{keyword,

Query evaluation process

This section describes the whole process of query processing. At first, the hybrid query is decomposed into several sub-queries. Each sub-query is then evaluated by the corresponding atomic query executor. The returned answers and their associated scores are propagated and aggregated according to an optimized query plan, which reflects the ranking principles mentioned in Section 2. Finally, the answers are ranked on the basis of the calculated scores.

Experiment

We have conducted all experiments on a workstation with 4 Pentium D 3.2 GHz processors and 4GB memory, running on Sun JRE 1.5 and Microsoft Windows Server 2003. Resources comprise RDF data from DBpedia [3] and documents from Wikipedia. Together, they represent more than 4.3 million triples and 2.1 million documents, and about 42 million annotations. Textual data and annotations are maintained in inverted indexes implemented using Lucene 2.4.1⁷, and the rest of the RDF

Related work

There exist several categories of related work. We structure our discussion to successively cover the following aspects: (1) hybrid search model, (2) storage of RDF data, (3) IR and DB integration, and (4) ranking.

Hybrid search model – recently, search involving the combination of text, structured data and annotations has attracted much attention. In [18], a logical framework has been proposed to support a wide range of queries over annotations. The use of annotations for document retrieval has

Conclusions and future work

We have elaborated on a model for hybrid search. With respect to this model, we have leveraged database and IR technologies to scale over large amounts of textual and structured data. In particular, we have presented algorithms and a data structure called OPT to support hybrid query processing against these resources. Ranking plays a central role in our hybrid search model and is thus tightly integrated into query processing. We have provided an implementation called CE² for data storage and

References (37)

S. Brin et al.
The anatomy of a large-scale hypertextual web search engine
Comput. Netw.
(1998)
H. Wang et al.
Semplore: a scalable IR approach to search the web of data
J. Web Sem.
(2009)
D.J. Abadi et al.
Scalable semantic web data management using vertical partitioning
K. Aberer, K.-S. Choi, N.F. Noy, D. Allemang, K.-I. Lee, L.J.B. Nixon, J. Golbeck, P. Mika, D. Maynard, R. Mizoguchi,...
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z.G. Ives, Dbpedia: a nucleus for a web of open data, in:...
B. Bamba, S. Mukherjea, Utilizing resource importance for ranking semantic web query results, in: C. Bussler, V....
H. Bast, I. Weber, The completesearch engine: Interactive, efficient, and towards IR & DB integration, in: CIDR, 2007....
R. Bhagdev et al.
Hybrid search: effectively combining keywords and semantic searches
G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, S. Sudarshan, Keyword searching and browsing in databases using...
A. Bhaskar et al.
Quark: an efficient xquery full-text implementation

B. Bhattacharjee, S. Padmanabhan, T. Malkemus, M. Huras, Efficient query processing for multi-dimensionally clustered...

C. Botev et al.

A texquery-based xml full-text search engine

J. Broekstra, A. Kampman, F. van Harmelen, Sesame: a generic architecture for storing and querying rdf and rdf schema,...

P. Castells et al.

An adaptation of the vector-space model for ontology-based information retrieval

IEEE Trans. Knowl. Data Eng.

(2007)

G. Cheng et al.

Searching linked objects with falcons: approach, implementation and evaluation

Int. J. Semant. Web Inf. Syst.

(2009)

M. d’Aquin, M. Sabou, M. Dzbor, C. Baldassarre, L. Gridinoc, S. Angeletou, E. Motta, WATSON: A gateway for the semantic...

L. Ding et al.

Swoogle: a search and metadata engine for the semantic web

O. Erling, I. Mikhailov, RDF support in the virtuoso DBMS, in: S. Auer, C. Bizer, C. Müller, A.V. Zhdanova (Eds.),...

Cited by (0)

View full text