Applying statistical principles to data fusion in information retrieval

https://doi.org/10.1016/j.eswa.2008.01.019Get rights and content

Abstract

Data fusion in information retrieval has been investigated by many researchers and quite a few data fusion methods have been proposed. However, their effect on effectiveness has not been well understood. In this paper, we apply statistical principles to data fusion and obtain some useful conclusions, which can be used as a guideline for data fusion methods. Based on that, CombSum, the linear combination methods, and the correlation methods can be justified in certain conditions. We also investigate how to improve the effectiveness of some existing data fusion methods such as CombSum and the linear combination method. Experimental results with TREC data are reported to support the conclusions.

Introduction

In information retrieval, a variety of representation techniques for queries and documents have been proposed, and many retrieval techniques have also been developed for obtaining higher retrieval effectiveness. These techniques are comparable in performance and there is no all-time winner. In such a situation, using a few independent information retrieval systems or using one single retrieval system but several different query representations or several different parameter settings to search the same document collection for the same information need, and then to merge these results from these different information retrieval mechanisms1 for better retrieval effectiveness is the primary idea for data fusion. Previous research (e.g., in Lee, 1997, Montague and Aslam, 2002, Vogt and Cottrell, 1999, Wu and McClean, 2006b among others) demonstrates that data fusion is an effective technique for achieving better retrieval results.

Data fusion (also known as meta-search) has been investigated by many researchers and quite a few data fusion algorithms such as CombSum (Fox et al., 1993, Fox and Shaw, 1994), CombMNZ (Fox et al., 1993, Fox and Shaw, 1994), the linear combination methods (Vogt and Cottrell, 1998, Vogt and Cottrell, 1999), Borda fusion (Aslam & Montague, 2001), Bayesian fusion (Aslam & Montague, 2001), Condorcet fusion (Montague & Aslam, 2002), and the correlation methods (Wu and McClean, 2005, Wu and McClean, 2006b) have been proposed and extensive experimentation has been conducted to evaluate these algorithms. In all these algorithms except Condorcet, relevance scores are estimated for the documents in all component results, and then different functions are used to calculate these scores for the merging process.

Condorcet fusion is very different from the others. It is borrowed from the political science for majority voting. It considers all possible head-to-head ranking competitions among all possible document pairs. Then all the documents can be ranked according to the number of competitions they have won. In this paper, we do not investigate data fusion methods like Condorcet fusion. We focus our attention on data fusion methods which use a function to calculate scores for all the documents, and then rank them according to the calculated scores.

A key question about data fusion is: why can data fusion quite often bring improvement in effectiveness? One answer to this question can be “the multiple evidence principle”. This is generally true because the more often a document appears in multiple component results, the more likely that document is relevant to the information need. However, it is not very precise. Each of the proposed data fusion methods can be regarded as using this multiple evidence principle in a particular way, but the multiple evidence principle does not tell us which one is the best.

In this paper, we would like to investigate this issue based on statistical principles and sampling theory. We shall specify the best way of calculating relevance probability scores and the three conditions that need to be satisfied. Furthermore, we shall discuss how to make the conditions to be better satisfied by some sophisticated treatment of the component results. On the other hand, we shall discuss how to compensate for them when the conditions are not well satisfied. Some experimental results will also be reported to corroborate the conclusions.

The rest of this paper is organised as follows: in Section 2 we review some related work on data fusion. Section 3 discusses how to apply statistical principles and sampling theory to the data fusion problem. Especially, we discuss why some existing data fusion methods such as CombSum and the linear combination method are good methods and how we can use them for better effectiveness. Section 4 presents some further empirical investigation results. Finally, Section 5 concludes the paper.

Section snippets

Previous work

Fox and colleagues (Fox et al., 1993, Fox and Shaw, 1994) introduced a group of data fusion methods such as CombSum and CombMNZ. CombSum sets the score of each document in the combination to the sum of the scores obtained by the individual information retrieval mechanisms, while in CombMNZ the score of each document is determined by multiplying this sum by the number of mechanisms which provide non-zero scores. More formally, suppose we have a group of documents D={d1,d2,,dn} and m information

Applying statistical principles to data fusion

In this section we describe how the best way is to calculate relevance scores according to statistical principles and some conditions needs to be satisfied accordingly.

Definition 1

(Valid result) For a group of documents D = {d1,d2,…,dn} and a given query q, a retrieval result R={r1,r2,,rn} is a valid result, if for each i, 0ri1. Here ri denotes the estimated probability that document di is relevant to query q.

Geometrically, each valid result is a point in a n-dimensional space, where n is the number of

Empirical investigation

In Section 3 we have analysed the data fusion problem based on statistical principles. In this section we carry out further empirical investigation about several issues: score normalisation, eliminating the effect of divergent effectiveness of component results by using performance weights, and eliminating the effect of unevenly distributed component results by using strata weights.

Conclusions

In this paper we have analysed the data fusion problem based on statistical principles and sampling theory. We conclude that, when the three conditions, which are effective component results, comparable scores in every component result and across different component results, and evenly distributed samples in the sample space, are met, CombSum is the appropriate method for calculating relevance probability scores. Score normalization and performance weights are two good measures that can make

References (16)

  • S. Wu et al.

    Performance prediction of data fusion for information retrieval

    Information Processing & Management

    (2006)
  • Aslam, J. A., & Montague, M. (2001). Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR...
  • S. Beitzel et al.

    On fusion of effective retrieval strategies in the same information retrieval system

    Journal of the American Society of Information Science and Technology

    (2004)
  • W.G. Cochran

    Sampling techniques

    (1963)
  • Fox, E. A., Koushik, M. P., Shaw, J., Modlin, R., & Rao, D. (1993). Combining evidence from multiple searches. In The...
  • Fox, E. A., & Shaw, J. (1994). Combination of multiple searches. In The second text retrieval conference (TREC-2) (pp....
  • Lee, J. H. (1997). Analysis of multiple evidence combination. In Proceedings of the 20th annual international ACM SIGIR...
  • Lillis, D., Toolan, F., Collier, R., & Dunnion, J. (2006). ProbFuse: a probabilistic approach to data fusion. In...
There are more references available in the full text version of this article.

Cited by (0)

A short version of this paper was published in the Proceedings of the 2007 IEEE International Conference on Systems, Man, and Cybernetics, October, 2007, Montreal, Canada, pp. 313–319.

View full text