Collection-integral source selection for uncooperative distributed information retrieval environments
Introduction
The unprecedented increase in the creation of digital information [28] and the rapid proliferation of the Web has created an environment where users have to search multiple online information sources to satisfy their information needs. Conventional search engines, such as Google1 and Yahoo!2 provide a solution to the above problem by indexing multiple content offering sources and providing a single point of search [26], [58], [1], [15], but are faced with a number of limitations.
The prohibitive size and rate of growth of the web make it impossible to be indexed completely, thus it is indexed only partially potentially excluding valuable information. Additionally, significant amounts of up-to-date information is often unavailable, because of the irregularity of index updates [27]. More importantly, a large number of web sites, collectively known as invisible web [7], [41] are either not reachable by search engines or do not allow their content to be indexed by them, instead offering their own search solutions.3 Even publicly available, up-to-date and authoritative government information is often not indexable by search engines [33]. Studies have indicated that the size of the invisible web may be 2–50 times the size of the web indexable by search engines [7]. Last but not least, there are vast amounts of corporate knowledge accumulated in privately owned and managed networks that remain outside their reach [25]. Thus, a user posing a query to a general-purpose search engine may be missing highly qualitative and relevant information [6].
Distributed information retrieval [8], [54], also known as federated search [48], offers a prospective solution to the above issues by offering users the capability of simultaneously searching multiple remote information sources,4 that may be general-purpose search engines or invisible web sites, through a single interface. Therefore, Federated Search systems allow users to discover information that would otherwise be unavailable to them through a general-purpose search engine or would require a significant amount of manual labor, in order to query each individual information source.
The challenge posed by DIR is on how to combine the results from multiple, independent, heterogeneous document collections into a single merged result in a way that the effectiveness of the combination approximates or even surpasses the effectiveness of searching the entire set of documents as a single collection, if one were possible.
The distributed information retrieval process can be perceived as three separate but often interleaved sub-processes: Source representation [10], [47], in which surrogates of the available remote collections are created. This process takes place before the user submits a query to the DIR system. Source selection [9], [40], [47], [55], in which a subset of the available information sources is chosen to process the query once it has been submitted, and results merging [13], [48], [38], [45], [22], in which the separate results are combined into a single merged result list which is returned to the user.
The focus of this paper is the source selection problem. Previous research [35], [51], [12] has demonstrated that the source selection phase is vital to the overall effectiveness of the retrieval process. Delegating the query to all the available information sources is very inefficient both in time and bandwidth requirements and usually results in a decreased effectiveness, due to the introduction of “noise” in the final results in the form of nonrelevant documents. Therefore, it is important to select the most appropriate and relevant sources for each information need.
The algorithm that is presented here is based on a novel, integral-based modeling of collection relevance. The algorithm requires a centralized sample index [13], [38], containing all the documents sampled from the remote collections during the source representation phase. When a query is submitted, the algorithm models each available collection as a plot, using the relevance scores of its sampled documents in reference to the centralized sample index and their intra-collection rankings, i.e. the positions of the documents in regard to other documents belonging to the same collection. The created plot is subsequently modified in order to emphasize collections that have high-scoring documents and dampen the effect of collections returning a large number of average-scoring documents. Lastly, the algorithm computes the family of linear interpolant functions that pass through the points of the modified plot for each collection. Information sources are ranked based on the area that they cover in the modified rank-relevance space. Based on the above modeling of the collections, the algorithm provides an effective and efficient solution to the source selection problem, particularly in dynamic environments, without the need of any training.
The rest of the paper is structured as follows. Section 2 reports on prior work. Section 3 describes the new methodology proposed in this paper. Section 4 describes the setup of the experiments conducted. Section 5 reports and discusses the results obtained and Section 6 concludes the paper, summarizing the findings.
Section snippets
Prior work
Source selection has received considerable attention. In this section we present previous research which is relevant to our approach. We start the analysis with a overview of the most prominent approaches for source representation, in order to explain the aims of the particular process and the methodologies that have been presented in the past. The goal of this overview is to introduce the reader into the actual environment in which most source selection algorithms have been designed to
Integral based source selection
The motivation behind the new source selection algorithm is to locate the collections with the most relevant documents per-query without requiring any training, thus making it applicable in dynamic environments, such as the web, using a simple and novel integral-based metric for collection relevance. In order to achieve this goal, the algorithm moves away from treating remote collections as simple aggregations of documents and focuses on the individual sampled documents.
In order for the
Experiment setup
IR testbeds offer the capability of conducting experiments in a completely isolated and controlled environment, giving researchers the ability to easily test new theories and approaches and verify reported results.
An IR testbed is comprised of three components: a document corpus, a set of topics (i.e. a number of predefined queries) and lastly a set of relevance judgments (the “right answers”) for each of the topics. Each of the components of a IR testbed serve a specific purpose: The document
Results
The performance of the algorithms using the recall measure is presented in Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8. In the Uniform testbed (Fig. 3), most of the algorithms perform similarly. The relative uniformity of the distribution of relevant documents at this testbed and the lack of complete information about the collections,12 make the goal of locating the collections that have the
Conclusions and future work
A new integral-based source selection algorithm for uncooperative distributed information retrieval environments was presented in this paper. It proposes a novel metric for measuring collection relevance, based on the area that a collection covers in the rank – relevance space of its sampled documents. In the future, we intend to study the attributes this metric more thoroughly, exploiting various integral-based properties that can be applied in the source selection process.
An important aspect
Acknowledgments
This paper is part of the 03ED404 research project, implemented within the framework of the “Reinforcement Programme of Human Research Manpower” (PENED) and co-financed by National and Community Funds (25% from the Greek Ministry of Development-General Secretariat of Research and Technology and 75% from E.U.-European Social Fund).
References (61)
- et al.
An efficient algorithm for full text retrieval for multiple keywords
Inform. Sci.
(1998) - et al.
Engineering a multi-purpose test collection for web retrieval experiments
Inform. Process. Manage.
(2003) - et al.
Combining information from multiple search engines – preliminary comparison
Inform. Sci.
(2010) - et al.
Real life, real users, and real needs: a study and analysis of user queries on the web
Inform. Process. Manage.
(2000) A knowledge engineering approach to knowledge management
Inform. Sci.
(2007)- et al.
Large-scale information retrieval with latent semantic indexing
Inform. Sci.
(1997) - et al.
Grid-enabling data mining applications with datamininggrid: an architectural perspective
Future Generat. Comput. Syst.
(2008) - et al.
Improving the performance of federated digital library services
Future Generat. Comput. Syst.
(2008) - et al.
A grid-based architecture for personalized federation of digital libraries
Library Collect. Acquisit. Tech. Ser.
(2006) - et al.
On the quality of resources on the web: an information retrieval perspective
Inform. Sci.
(2007)