Elsevier

Information Sciences

Volume 180, Issue 14, 15 July 2010, Pages 2763-2776
Information Sciences

Collection-integral source selection for uncooperative distributed information retrieval environments

https://doi.org/10.1016/j.ins.2010.03.020Get rights and content

Abstract

We propose a new integral-based source selection algorithm for uncooperative distributed information retrieval environments. The algorithm functions by modeling each source as a plot, using the relevance score and the intra-collection position of its sampled documents in reference to a centralized sample index. Based on the above modeling, the algorithm locates the collections that contain the most relevant documents. A number of transformations are applied to the original plot, in order to reward collections that have higher scoring documents and dampen the effect of collections returning an excessive number of documents. The family of linear interpolant functions that pass through the points of the modified plot is computed for each available source and the area that they cover in the rank-relevance space is calculated. Information sources are ranked based on the area that they cover. Based on this novel metric for collection relevance, the algorithm is tested in a variety of testbeds in both recall and precision oriented settings and its performance is found to be better or at least equal to previous state-of-the-art approaches, overall constituting a very effective and robust solution.

Introduction

The unprecedented increase in the creation of digital information [28] and the rapid proliferation of the Web has created an environment where users have to search multiple online information sources to satisfy their information needs. Conventional search engines, such as Google1 and Yahoo!2 provide a solution to the above problem by indexing multiple content offering sources and providing a single point of search [26], [58], [1], [15], but are faced with a number of limitations.

The prohibitive size and rate of growth of the web make it impossible to be indexed completely, thus it is indexed only partially potentially excluding valuable information. Additionally, significant amounts of up-to-date information is often unavailable, because of the irregularity of index updates [27]. More importantly, a large number of web sites, collectively known as invisible web [7], [41] are either not reachable by search engines or do not allow their content to be indexed by them, instead offering their own search solutions.3 Even publicly available, up-to-date and authoritative government information is often not indexable by search engines [33]. Studies have indicated that the size of the invisible web may be 2–50 times the size of the web indexable by search engines [7]. Last but not least, there are vast amounts of corporate knowledge accumulated in privately owned and managed networks that remain outside their reach [25]. Thus, a user posing a query to a general-purpose search engine may be missing highly qualitative and relevant information [6].

Distributed information retrieval [8], [54], also known as federated search [48], offers a prospective solution to the above issues by offering users the capability of simultaneously searching multiple remote information sources,4 that may be general-purpose search engines or invisible web sites, through a single interface. Therefore, Federated Search systems allow users to discover information that would otherwise be unavailable to them through a general-purpose search engine or would require a significant amount of manual labor, in order to query each individual information source.

The challenge posed by DIR is on how to combine the results from multiple, independent, heterogeneous document collections into a single merged result in a way that the effectiveness of the combination approximates or even surpasses the effectiveness of searching the entire set of documents as a single collection, if one were possible.

The distributed information retrieval process can be perceived as three separate but often interleaved sub-processes: Source representation [10], [47], in which surrogates of the available remote collections are created. This process takes place before the user submits a query to the DIR system. Source selection [9], [40], [47], [55], in which a subset of the available information sources is chosen to process the query once it has been submitted, and results merging [13], [48], [38], [45], [22], in which the separate results are combined into a single merged result list which is returned to the user.

The focus of this paper is the source selection problem. Previous research [35], [51], [12] has demonstrated that the source selection phase is vital to the overall effectiveness of the retrieval process. Delegating the query to all the available information sources is very inefficient both in time and bandwidth requirements and usually results in a decreased effectiveness, due to the introduction of “noise” in the final results in the form of nonrelevant documents. Therefore, it is important to select the most appropriate and relevant sources for each information need.

The algorithm that is presented here is based on a novel, integral-based modeling of collection relevance. The algorithm requires a centralized sample index [13], [38], containing all the documents sampled from the remote collections during the source representation phase. When a query is submitted, the algorithm models each available collection as a plot, using the relevance scores of its sampled documents in reference to the centralized sample index and their intra-collection rankings, i.e. the positions of the documents in regard to other documents belonging to the same collection. The created plot is subsequently modified in order to emphasize collections that have high-scoring documents and dampen the effect of collections returning a large number of average-scoring documents. Lastly, the algorithm computes the family of linear interpolant functions that pass through the points of the modified plot for each collection. Information sources are ranked based on the area that they cover in the modified rank-relevance space. Based on the above modeling of the collections, the algorithm provides an effective and efficient solution to the source selection problem, particularly in dynamic environments, without the need of any training.

The rest of the paper is structured as follows. Section 2 reports on prior work. Section 3 describes the new methodology proposed in this paper. Section 4 describes the setup of the experiments conducted. Section 5 reports and discusses the results obtained and Section 6 concludes the paper, summarizing the findings.

Section snippets

Prior work

Source selection has received considerable attention. In this section we present previous research which is relevant to our approach. We start the analysis with a overview of the most prominent approaches for source representation, in order to explain the aims of the particular process and the methodologies that have been presented in the past. The goal of this overview is to introduce the reader into the actual environment in which most source selection algorithms have been designed to

Integral based source selection

The motivation behind the new source selection algorithm is to locate the collections with the most relevant documents per-query without requiring any training, thus making it applicable in dynamic environments, such as the web, using a simple and novel integral-based metric for collection relevance. In order to achieve this goal, the algorithm moves away from treating remote collections as simple aggregations of documents and focuses on the individual sampled documents.

In order for the

Experiment setup

IR testbeds offer the capability of conducting experiments in a completely isolated and controlled environment, giving researchers the ability to easily test new theories and approaches and verify reported results.

An IR testbed is comprised of three components: a document corpus, a set of topics (i.e. a number of predefined queries) and lastly a set of relevance judgments (the “right answers”) for each of the topics. Each of the components of a IR testbed serve a specific purpose: The document

Results

The performance of the algorithms using the recall Rk measure is presented in Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8. In the Uniform testbed (Fig. 3), most of the algorithms perform similarly. The relative uniformity of the distribution of relevant documents at this testbed and the lack of complete information about the collections,12 make the goal of locating the collections that have the

Conclusions and future work

A new integral-based source selection algorithm for uncooperative distributed information retrieval environments was presented in this paper. It proposes a novel metric for measuring collection relevance, based on the area that a collection covers in the rank – relevance space of its sampled documents. In the future, we intend to study the attributes this metric more thoroughly, exploiting various integral-based properties that can be applied in the source selection process.

An important aspect

Acknowledgments

This paper is part of the 03ED404 research project, implemented within the framework of the “Reinforcement Programme of Human Research Manpower” (PENED) and co-financed by National and Community Funds (25% from the Greek Ministry of Development-General Secretariat of Research and Technology and 75% from E.U.-European Social Fund).

References (61)

  • Javed A. Aslam, Mark Montague, Models for metasearch, in: SIGIR ’01, ACM, New York, NY, USA, 2001, pp....
  • Thi Truong Avrahami et al.

    The fedlemur project: federated search in the real-world

    J. Am. Soc. Inform. Sci. Technol.

    (2006)
  • M. Baillie, L. Azzopardi, F. Crestani, Adaptive query-based sampling of distributed collections, in: Proc. SPIRE Conf.,...
  • M.M. Sufyan Beg

    A subjective measure of web search quality

    Inform. Sci.

    (2005)
  • Michael K. Bergman, The deep web: surfacing hidden value, September...
  • J. Callan

    Advances in Information Retrieval

    (2000)
  • James P. Callan, Zhihong Lu, W. Bruce Croft, Searching distributed collections with inference networks, in: SIGIR ’95:...
  • Jamie Callan et al.

    Query-based sampling of text databases

    ACM Trans. Inform. Syst.

    (2001)
  • J.P. Callan, W.B. Croft, S.M. Harding, Inquery retrieval system, in: DEXA-92, Third International Conference on...
  • Suleyman Cetintas, Luo Si, Hao Yuan, Learning from past queries for resource selection, in: CIKM ’09: Proceeding of the...
  • Nick Craswell, David Hawking, Paul Thistlewaite, Merging results from isolated search engines, in: Proceedings of the...
  • Norbert Fuhr

    A decision-theoretic approach to database selection in networked ir

    ACM Trans. Inform. Syst.

    (1999)
  • L. Gravano, K. Chang, H. Garcia-Molina, A. Paepcke, Starts: Stanford protocol proposal for internet retrieval and...
  • Luis Gravano et al.

    Gloss: text-source discovery over the internet

    ACM Trans. Database Syst.

    (1999)
  • D.K. Harman, Overview of the fourth text retrieval conference (trec-4), in: D.K. Harman (Ed.), The Third Text REtrieval...
  • Donna Harman, Overview of the first trec conference, in: SIGIR ’93: Proceedings of the 16th Annual International ACM...
  • David Hawking et al.

    Methods for information server selection

    ACM Trans. Inform. Syst.

    (1999)
  • David Hawking, Paul Thomas, Server selection methods in hybrid portal search, in: SIGIR ’05: Proceedings of the 28th...
  • Nadine Hchsttter et al.

    What users see – structures in search engine results pages

    Inform. Sci.

    (2009)
  • Carl Lagoze et al.

    The open archives initiative: building a low-barrier interoperability framework

    Joint Conference on Digital Libraries

    (2001)
  • Cited by (0)

    View full text