A link-based collection fusion strategy

https://doi.org/10.1016/S0306-4573(99)00019-9Get rights and content

Abstract

This paper presents a method for solving the collection fusion problem in hypermedia digital libraries. The proposition which is explored and evaluated is that across document links between hypermedia documents residing in distributed hypermedia collections can supply sufficient useful information to allow effective collection fusion. In contrast to other collection fusion strategies, the link-based fusion strategy does not require a learning phase before it can be utilised and, also does not use any information from remote collections other than the returned list of documents. Because of these characteristics the proposed fusion strategy is suitable for very large and extremely dynamic environments in which other collection fusion strategies (e.g. learning collection fusion strategies) may be inapplicable. Evaluation of the link-based fusion strategy demonstrates that the proposed strategy is more effective and efficient than the uniform strategy which can be applied under the same conditions.

Introduction

The term collection fusion problem is used to describe the problem for distributed information retrieval (IR) posed by the need to select of sources likely to provide relevant information, and the production of a single integrated set of search results in a form in which they can be effectively examined by information seekers. Voorhees et al. (1994), concisely characterises collection fusion as the data fusion problem in which the results of query runs in different, autonomous and distributed document collections must be merged to produce a single, effective result.

In conventional information seeking environments the searching process is confined to a single collection, so the source selection problem is trivial. Likewise, the problem of preparing the search results for user examination is relatively less complicated, since all the candidate documents come from a single collection. Nowadays, however, the number of computer networks proliferates and an ever increasing number of diverse distributed information sources are becoming available. The term digital libraries (DLs) may be used to describe these highly dynamic, interactive and distributed information seeking environments (Fox et al., 1995). Hypermedia digital libraries (HDLs) are digital libraries based on a hypermedia paradigm (Balasubramanian, 1995). HDLs have multiple collections and users face the collection fusion problem.

Existing collection fusion strategies can be classified using two criteria. The first criterion is based on the necessity or otherwise of a learning phase before a collection fusion strategy can be utilized. Voorhees et al. (1995) report two isolated fusion strategies which require training queries. The first strategy is called modelling relevant document distribution (MRDD) and uses a set of training queries explicitly to build a model of the distributions of the relevant documents in the individual collections. The other is called the query clustering (QC) fusion strategy and learns a measure of the quality of search for a particular topic area of the collections.

The second criterion classifies fusion strategies into two categories (Voorhees, 1996): those using only the ranked list of documents returned to produce the single result (isolated strategies) and those using additional data from remote collections to merge the results (integrated strategies). Since integrated strategies have access to additional information, they can be expected to be more effective than isolated strategies. The shortcoming is that integrated strategies demand larger network resources since additional information must be exchanged and may involve more steps which potentially makes them less efficient. The MRDD and QC fusion strategies are isolated methods. Callan et al. (1995) have presented an integrated fusion strategy using inference networks. This approach is used to rank document collections instead of documents in a single collection. Collection wide statistics such as icf (measuring the proportion of collections containing the term) are used in this collection fusion strategy. Another similar approach was presented in the GLOSS system (Gravano et al., 1994) which produces an estimate of the relevant documents in a collection by using some collection wide statistics.

This paper aims to discuss the collection fusion problem in the specific context provided by highly dynamic and distributed environments such as digital libraries or the World Wide Web. An important observation within this context is that collection fusion strategies requiring a learning phase before they can be utilised, are clearly less convenient for large and very dynamic electronic environments. Learning may involve computation over large amounts of data and therefore be time consuming. It will also usually require some sort of training data to facilitate the learning phase. Furthermore, if the learning phase produces data which directly or indirectly relate to the content of documents, any change to the content (e.g. add new documents or change existing ones) will gradually invalidate the data produced after the learning phase. Frequently, a new learning phase will need to be conducted again in order to regenerate a new set of (representative) learning data.

Another concern which needs careful consideration is integration strategies. Digital libraries are heterogeneous environments with no central authority. Therefore, it is not realistic to presuppose (as the integrated fusion strategies do) that additional information beyond the ranked list of documents can be provided. For example, some of the integrated collection fusion strategies, require individual results to be supplied together with numerical relevance scores. However, in heterogeneous environments retrieval systems conducting the actual searching may not be able to produce and therefore to provide relevance scores (e.g. simple Boolean based IR). Also, exchange of additional information from remote collections beyond the ranked list of documents may be inefficient. Furthermore, although experiments (Viles & French, 1995) have shown that complete exchange of collection wide statistics is not necessary to achieve a sufficient level of effectiveness, the need for partial exchange of such information in large electronic environments will potentially lead to large communication and time costs.

Finally, some of the fusion strategies reviewed earlier, do not focus on source selection (e.g. fusion with inference networks). Instead, they retrieve documents from all available distributed collections and concentrate on the effective merging of the results (i.e. to place the most relevant documents appropriately high in the rank list). This clearly has a negative effect on efficiency. For distributed environments such as digital libraries, it is probably more desirable to see an appropriate balance between increasing effectiveness and minimising the number of libraries involved (i.e. efficiency). In fact, when this approach has been tested (i.e. effort to minimise the libraries involved), the effectiveness results were little affected (Callan et al., 1995).

The aim of the research which is presented in this paper was to design, implement and evaluate a collection fusion strategy which can be applied in very large and dynamic information seeking environments. To attain this goal the proposed collection fusion strategy does not require any learning phase to be undertaken before it can be utilised nor does it use additional information from remote collections other than the returned list of documents.

The proposed method utilises links between documents in hypermedia collections to provide a solution to the collection fusion problem. The method presupposes the presence of hypermedia links and can be utilised in any hypermedia environment comprising different distributed hypermedia collections. In that sense, it differs from the collection fusion strategies mentioned earlier, because these methods can be applied in any distributed environment (i.e. without links between documents). However, our method is very easily applicable in extremely dynamic and highly distributed electronic environments, where learning or integrated fusion strategies may not be easily applicable.

A more extensive review of the background of this work may be found in Salampasis (1997).

Section snippets

A link-based collection fusion strategy

From the early days of computerised IR research, relationships between documents based on bibliographic links have been utilised for a variety of reasons (e.g. Salton, 1971). More recently, links have been utilised in different settings in order to increase the effectiveness of information retrieval. Savoy (1996) developed an extended vector processing scheme using additional information expressed by bibliographic links, in order to increase the effectiveness of retrieval in hypermedia

Aims

The aim of the experiment is to evaluate the effectiveness and the efficiency of the link-based collection fusion strategy and to compare it with other collection fusion strategies. The uniform fusion strategy is used as the baseline against which our link-based fusion method is evaluated. This decision is made because it is generally accepted that in order to prove its usefulness, a fusion strategy must be compared with the uniform strategy (Voorhees et al., 1995). The uniform method assumes

Conclusions

We presented, evaluated and discussed a new link-based collection fusion strategy. The fusion strategy uses a new set of algorithms and methods to solve the problem of merging results from multiple parallel searches of different collections. It is novel because for the first time it introduces the use of links to solve the collection fusion problem. In the past, several research efforts have shown that retrieval effectiveness can be increased if links are incorporated into classical IR

References (26)

  • Frakes, M. & Yates, B. (Eds.) (1992). Information retrieval: data structures and algorithms. New York: Prentice...
  • P Frei et al.

    Making use of hypertext links when retrieving information

  • M Frisse

    Searching for information in a hypertext medical book

    Communications of the ACM

    (1988)
  • Cited by (0)

    View full text