1 Introduction

Attaining high precision at the very top ranks of a document list returned in response to a query is an important challenge that search engines have to address. To that end, researchers have proposed, among others, a re-ranking paradigm: automatically re-ordering the documents in an initially retrieved list so as to improve precision at top ranks (e.g., Willett 1985; Kleinberg 1997; Liu and Croft 2004; Diaz 2005; Kurland and Lee 2005). The motivation is based on the fact that the ratio of relevant to non-relevant documents in the initial list is often much larger than that in the entire corpus.

We present a novel approach to re-ranking an initially retrieved list. Our approach is based on utilizing information induced from a second list that is retrieved in response to the query. The second list could be produced by using, for example, a different retrieval method and/or query/document representations (Croft 2000b). Specifically, our approach is based on exploiting information induced from similarities between documents in the two lists for re-ranking the initial list.

Indeed, it has long been acknowledged that fusing retrieved lists—i.e., conceptually combining “experts recommendations”—is quite effective for retrieval (Thompson 1990; Fox and Shaw 1994; Lee 1997; Vogt and Cottrell 1999; Croft 2000b; Aslam and Montague 2001; Dwork et al. 2001; Montague and Aslam 2002; Beitzel et al. 2004; Lillis et al. 2006; Shokouhi 2007). Many fusion methods “reward” documents that are highly ranked in many of the lists. The effectiveness of this principle is often attributed to the fact that there is high overlap between relevant documents in the lists and low overlap between non-relevant documents (Lee 1997). However, it turns out that on many occasions, the lists to be fused contain different relevant documents (Das-Gupta and Katzer 1983; Griffiths et al. 1986; Soboroff et al. 2001; Beitzel et al. 2003; Beitzel et al. 2004), especially if highly effective retrieval strategies are used to produce the lists (Soboroff et al. 2001; Beitzel et al. 2003; Beitzel et al. 2004). This fact can have some positive recall-effect on the resultant fusion performance, specifically, if the relevant documents are highly ranked in the lists to be fused (Beitzel et al. 2003; Beitzel et al. 2004). However, the different relevant documents in the lists do not provide relevance-status support to each other so as to boost the chances that they will end up at the highest ranks of the final fused list.

Our models address this potential relevant-set “mismatch” by letting similar—but not necessarily identical—documents to provide relevance-status support to each other. Case in point, similar documents can potentially be viewed as discussing the same topics. Specifically, if relevant documents are assumed to be similar following the cluster hypothesis (van Rijsbergen 1979), then they can “support” each other via inter-document similarities. Thus, the basic principle underlying our methods—inspired by work on re-ranking a list using inter-document similarities within that list (Diaz 2005; Kurland and Lee 2005)—is as follows. We reward documents in the initial list that are highly ranked, and that are similar to documents that are highly ranked in the second list.

Our models are shown to be effective in re-ranking TREC runs (Voorhees and Harman 2005). For example, we demonstrate the effectiveness of our models in re-ranking the best-performing run in a track using the second-best performing run; and, in re-ranking a randomly selected run using a second randomly selected run. The performance of our models also favorably compares with that of a highly effective fusion method used to integrate the two runs, namely, combMNZ (Fox and Shaw 1994; Lee 1997).

Our models are also effective in addressing the long-standing challenge of integrating the results produced by standard document-based retrieval with those produced by cluster-based retrieval (Jardine and van Rijsbergen 1971; Croft 1980). It was observed that using the two retrieval paradigms yields result lists with different sets of relevant documents—a scenario motivating the development of our models (Griffiths et al. 1986). Indeed, using our methods to re-rank the result list of document-based retrieval utilizing that of cluster-based retrieval yields performance that is superior to that of using each alone. Furthermore, the performance is also superior to that of using effective fusion methods to integrate the two lists; and, to that of a state-of-the-art re-ranking method that operates only on the document-based retrieved list.

2 Retrieval framework

We use q and d to denote a query and a document, respectively. Our goal is to re-rank an initial list of documents, \({{\mathcal{L}}_{\rm init}}\), which was retrieved in response to q by some search algorithm that was run over a given corpus, so as to improve precision at top ranks. To that end, we assume that a second list, \({{\mathcal{L}}_{\rm help}}\), was also retrieved in response to q using, for example, a different search algorithm or/and query representation.

To indicate that d is a member of list \(\mathcal{L}\), we write \(d \in \mathcal{L}\). We use \({Score_{\mathcal{L}}(d)}\) to denote d’s non-negative retrieval score in \(\mathcal{L}\); for \(d \not \in \mathcal{L},\,Score_{\mathcal L}(d)\, {\mathop{=}\limits^{\rm def}}\, 0\). The methods we present use a measure sim(xy) of the similarity between texts x and y; we describe the measure in Sect. 5. We will also make use of Kronecker’s delta function: for argument \(s,\,{\delta\left[s\right]} =1\) if s holds, and 0 otherwise.

2.1 Similarity-based re-ranking

One potential way to re-rank \({{\mathcal{L}}_{\rm init}}\) using \({{\mathcal{L}}_{\rm help}}\) is by rewarding documents that are highly ranked in \({{\mathcal{L}}_{\rm init}}\), and that are also positioned at high ranks of \({{\mathcal{L}}_{\rm help}}\). Indeed, such ranking principle underlies most standard approaches for fusion of retrieved lists (Fox and Shaw 1994; Lee 1997; Croft 2000b). However, there is evidence that different retrieved lists might contain different relevant documents (Das-Gupta and Katzer 1983; Griffiths et al. 1986; Soboroff et al. 2001). This observation holds whether the lists are retrieved in response to different query representations (Das-Gupta and Katzer 1983), or produced by different retrieval algorithms using the same query representation (Griffiths et al. 1986). Hence, standard fusion methods can fall short in these cases.Footnote 1

To address this challenge, we can exploit a potentially rich source of information not utilized by current fusion methods, namely, inter-document similarities. Case in point, if the top-ranked document in \({{\mathcal{L}}_{\rm init}}\) is not the same document as the top-ranked one in \({{\mathcal{L}}_{\rm help}}\), but their content overlaps to a major extent, then the latter can potentially provide relevance-status support to the former. More generally, documents that are highly ranked in \({{\mathcal{L}}_{\rm help}}\), and hence, are presumed to be relevant by the ranking method that created \({{\mathcal{L}}_{\rm help}}\), can provide relevance-status support to those documents in \({{\mathcal{L}}_{\rm init}}\) that they are similar to.

Thus, the following (re-)ranking principle for documents in \({{\mathcal{L}}_{\rm init}}\) emerges. A document in \({{\mathcal{L}}_{\rm init}}\) should be ranked high if it is (1) initially highly ranked in \({{\mathcal{L}}_{\rm init}}\), and (2) similar to (many) documents that are highly ranked in \({{\mathcal{L}}_{\rm help}}\).

The qualitative relevance-scoring principle just described conceptually generalizes the one underlying standard fusion approaches. That is, if we deem two documents “similar” if and only if they are the same document, then a document is presumed to be relevant if it is highly ranked in both retrieved lists. Furthermore, this relevance-scoring principle is conceptually a generalization of recently proposed approaches for re-ranking a list based on inter-document similarities within the list (Baliński and Daniłowicz 2005; Diaz 2005; Kurland and Lee 2005). Such methods reward documents in the list that are both highly ranked and highly similar to many other highly ranked documents. Thus, if \({{\mathcal{L}}_{\rm help}}\) is set to be \({{\mathcal{L}}_{\rm init}}\), our relevance-scoring principle echoes these approaches.

2.2 Algorithms

Following some previous work on re-ranking search results (Diaz 2005; Kurland and Lee 2005), we let document d h in \({{\mathcal{L}}_{\rm help}}\) to provide relevance-status support only to documents in \({{\mathcal{L}}_{\rm init}}\) that are most similar to it—i.e., its α nearest neighbors in the similarity space, defined as follows:

Definition 1

Let d h be a document in \({{\mathcal{L}}_{\rm help}}\). Neighbors(d h ; α) is the set of α documents d in \({{\mathcal{L}}_{\rm init}}\) that yield the highest sim(d h d); α is a free parameter. Ties are broken by document ID.

Thus, document d in \({{\mathcal{L}}_{\rm init}}\) gets relevance -status support from its set of supporters: the documents in \({{\mathcal{L}}_{\rm help}}\) that it is among the nearest-neighbors of; formally,

Definition 2

\( Supporters(d)\, {\mathop{=}\limits^{\rm def}}\, \{d_h \in {\mathcal L}_{\rm help}: d \in Neighbors(d_h; \alpha)\} \).

We can now quantify the above stated relevance-scoring principle. That is, we reward document d in \({{\mathcal{L}}_{\rm init}}\) if it is initially highly ranked, and if its supporters are highly ranked in \({{\mathcal{L}}_{\rm help}}\) and provide d with a large extent of support; we use the inter-document similarity estimate to measure the level of support. The resultant algorithm, SimRank, then scores d by:Footnote 2

$$ Score_{\rm SimRank}(d)\, {\mathop{=}\limits^{\rm def}}\,Score_{{\mathcal L}_{\rm init}}(d) \sum_{d_h\in Supporters(d)} Score_{{\mathcal L}_{\rm help}}(d_h) sim(d_h,d). $$
(1)

Note that documents in \(\mathcal{L}_{\rm init}\) that appear in \(\mathcal{L}_{\rm help}\) receive self support by the virtue of self similarity. We might want, however, to further reward these documents as work on fusion of retrieved lists has demonstrated the merits in doing so (Fox and Shaw 1994; Lee 1997). Therefore, inspired by the CombMNZ fusion method (Fox and Shaw 1994; Lee 1997), we also consider the SimMNZRank algorithm that doubles the score of d if it appears in \({{\mathcal{L}}_{\rm help}}\):Footnote 3

$$ Score_{\rm SimMNZRank}(d)\, {\mathop{=}\limits^{\rm def}}\, (\delta\left[d \in {\mathcal L}_{\rm init}\right] + \delta\left[d \in {\mathcal L}_{\rm help}\right]) Score_{\rm SimRank}(d). $$
(2)

3 Applications

The proposed methods can be used with any two lists that are retrieved in response to q. Indeed, in Sect. 5.1 we demonstrate their effectiveness when employed over TREC runs (Voorhees and Harman 2005).

In addition, we can use our methods to tackle the long standing challenge of integrating the results of cluster-based retrieval with those of document-based retrieval (Griffiths et al. 1986), as we elaborate next.

3.1 Integrating document-based and cluster-based retrieval

One of the most common cluster-based retrieval paradigms follows van Rijsbergen’s cluster hypothesis (van Rijsbergen 1979), which states that “closely-associated documents tend to be relevant to the same requests”. The idea is to cluster the corpus into clusters of similar documents; then, at retrieval time, the constituent documents of the clusters most similar to the query are presented as results. The hypothesis is that these clusters contain a high percentage of relevant documents (Jardine and van Rijsbergen 1971; Croft 1980; Voorhees 1985; Griffiths et al. 1986). Griffiths et al. (1986) observed that the list of documents retrieved in this fashion yields the same retrieval effectiveness as that resulting from comparison of the query with documents—i.e., document-based retrieval. However, the overlap between the relevant document sets in the two lists was somewhat small.

This mismatch between sets of relevant documents was an important motivating factor in deriving our approach. Hence, we will study the effectiveness of our methods in re-ranking the results of document-based retrieved (\({{\mathcal{L}}_{\rm init}}\)) using those of cluster-based retrieval (\({{\mathcal{L}}_{\rm help}}\)) in Sect. 5.2.

4 Related work

If we consider documents to be similar if and only if they are the same document, then our methods reduce to using the principle underlying most fusion methods; that is, rewarding documents that are highly ranked in many of the fused lists (Fox and Shaw 1994; Lee 1997; Vogt and Cottrell 1999; Croft 2000b; Dwork et al. 2001; Aslam and Montague 2001; Montague and Aslam 2002; Beitzel et al. 2004; Lillis et al. 2006; Shokouhi 2007). Yet, we note that our methods re-rank \(\mathcal{L}_{\rm init}\), while fusion methods produce result lists that can contain documents that are in \(\mathcal{L}_{\rm help}\) but not in \({{\mathcal{L}}_{\rm init}}\). In Sect. 5 we demonstrate the merits of our approach with respect to a highly effective fusion method, namely, CombMNZ (Fox and Shaw 1994; Lee 1995).

Traditional fusion methods use either the ranks of documents, or their relevance scores, but not the documents’ content (Fox and Shaw 1994; Vogt and Cottrell 1999; Croft 2000b; Dwork et al. 2001; Montague and Aslam 2002; Beitzel et al. 2004; Lillis et al. 2006). One reason for doing so is lack of (quick) access to the document content. We hasten to point that our re-ranking methods need not necessarily compute inter-document similarities based on the entire document content. For example, snippets (i.e., document summaries) can potentially be used for computing inter-document similarities, as is the case, for instance, in some work on clustering the results returned by Web search engines (Zamir and Etzioni 1998). Snippets, and other document features, were also utilized in some fusion models (Craswell et al. 1999; Beitzel et al. 2005; Selvadurai 2007); in contrast to our methods, inter-document similarities were not exploited.

Recent work (Kozorovitzky and Kurland 2009) utilizes inter-document similarities for fusing a few retrieved lists. Documents that are ranked high in many of the lists and that are similar to many other documents that are ranked high in many of the lists, end at high ranks of the final list. There are two important differences between this work and ours. The first is that while the proposed method in Kozorovitzky and Kurland (2009) is a symmetric fusion method in that all lists to be fused are of the same importance, our methods could be viewed as asymmetric fusion techniques as they re-rank a list based on inter-document similarities with a second list. In fact, due to the nature of the approaches, it is not clear how to naturally transform the method in Kozorovitzky and Kurland (2009) to an asymmetric technique for re-ranking, or alternatively, transform our methods into symmetric fusion approaches.

The second important difference between the work in Kozorovitzky and Kurland (2009) and our methods lies in the actual techniques used for exploiting inter-document similarities. One of the strengths and key underlying ideas of the fusion method in Kozorovitzky and Kurland (2009) is the use of PageRank (Brin and Page 1998) so as to utilize a recursive definition of which documents are central to the entire set of fused lists. Our re-ranking methods, on the other hand, could be viewed as looking for central documents within \({{\mathcal{L}}_{\rm init}}\) by using a simple weighted in-degree criterion employed over a one-way bipartite graph wherein documents in \({{\mathcal{L}}_{\rm init}}\) are on one side, and documents in \({{\mathcal{L}}_{\rm help}}\) are on the other side; edges point from the latter list to the former. Hence, as in some previous work on re-ranking that uses cluster-based information (Kurland and Lee 2006), we could potentially use the Hubs and Authorities algorithm (Kleinberg 1997) over such a bipartite graph having documents in \({{\mathcal{L}}_{\rm init}}\) serve for authorities, and documents in \({{\mathcal{L}}_{\rm help}}\) serve for hubs. However, we leave this implementation for future work so as to focus on the basic question of whether \({{\mathcal{L}}_{\rm help}}\) can help to re-rank \({{\mathcal{L}}_{\rm init}}\) using simple techniques for exploiting inter-document similarities.

There is a large body of work on re-ranking a retrieved list using inter-document similarities within the list (e.g., Willett 1985; Liu and Croft 2004; Baliński and Daniłowicz 2005; Diaz 2005; Kurland and Lee 2005; Zhang et al. 2005; Kurland 2006; Kurland and Lee 2006; Liu and Croft 2006a; Yang et al. 2006; Diaz 2008; Liu and Croft 2008). As described in Sect. 2, our models could conceptually be viewed as a generalization of some of these approaches (e.g., Baliński and Daniłowicz 2005; Diaz 2005; Kurland and Lee 2005) to the two lists case.

Furthermore, our SimRank algorithm is reminiscent of a cluster-based re-ranking model wherein similarities between documents in the list and clusters of documents in the list are utilized (Kurland and Lee 2006). We demonstrate the merits of our methods with respect to a state-of-the-art cluster-based (one list) re-ranking method (Kurland 2006) in Sect. 5.2. We also note that inter-document, and more generally, inter-item, similarities were utilized in other applications; for example, cross-lingual retrieval (Diaz 2008), prediction of retrieval effectiveness (Diaz 2007a), and text summarization and clustering (Erkan and Radev 2004; Mihalcea and Tarau 2004; Erkan 2006).

We posed our re-ranking methods as a potential means for integrating document-based and cluster-based retrieval results. Some previous work (Liu and Croft 2004; Kurland and Lee 2004) has shown the effectiveness of cluster-based smoothing of document language models as a way of integrating document-based and cluster-based information. We demonstrate in Sect. 5.2 the merits of using our methods with respect to one such state-of-the-art approach (Kurland and Lee 2004).

5 Evaluation

We next explore the effectiveness (or lack thereof) of our re-ranking methods when employed to re-rank TREC runs; and, when used to integrate document-based and cluster-based retrieval.

Following some previous work on re-ranking (Kurland and Lee 2005; Kurland and Lee 2006), we use a language-model-based approach to estimate similarities between texts. Specifically, let \({p}_{z}^{Dir[\mu]}(\cdot)\) denote the unigram, Dirichlet-smoothed, language model induced from text z, where μ is the smoothing parameter (Zhai and Lafferty 2001). Unless otherwise specified, we set μ = 1000 following previous recommendations (Zhai and Lafferty 2001). We define the inter-text similarity estimate for texts x and y as follows:

$$ \left.sim(x,y) \,{\mathop{=}\limits^{\rm def}} \, \exp\left(-D\left({p}^{Dir[0]}_{x}(\cdot) \right\| {p}^{Dir[\mu]}_{y}(\cdot)\right) \right); $$

D is the KL divergence. The effectiveness of this estimate was demonstrated in previous work on utilizing inter-document similarities for re-ranking (Kurland and Lee 2005; Kurland and Lee 2006).

We apply tokenization, Porter stemming, and stopword removal (using the INQUERY list) to all data using the Lemur toolkitFootnote 4, which is also used for language-model-based retrieval.

We posed our re-ranking methods as a means to improving precision at top ranks. Therefore, we use the precision of the top 5 and 10 documents (p@5, p@10), and the mean reciprocal rank of the first relevant document (MRR) as evaluation measures (Shah and Croft 2004; Voorhees and Harman 2005). We use the two-tailed paired t-test at a 95% confidence level to determine statistically-significant differences in performance between two retrieval methods (Sanderson and Zobel 2005; Smucker et al. 2007).

The main goal of the evaluation to follow is to study the potential merits of our novel similarity-based re-ranking paradigm in various setups. That is, we aim at studying whether utilizing similarities between the list to be re-ranked and the second list can indeed yield improved performance. To that end, we start by neutralizing free parameter values effects. Specifically, in Sects. 5.1 and 5.2 we set α, the ancestry parameter used by our methods, to a valueFootnote 5 in {5, 10, 20, 30, 40, 50} that yields optimized (average) p@5 performance over the entire set of queries for a given setup and corpus.Footnote 6 Accordingly, we set the free parameters of all the methods that we consider as reference comparisons to values that optimize p@5. Then, in Sect. 5.3 we present the performance numbers of our models, and those of the reference comparisons, when the values of the free parameters are learned across queries.

Efficiency considerations Naturally, running a second retrieval so as to re-rank the originally retrieved list incurs computational overhead. However, for retrieval methods depending on occurrence of query terms in documents (e.g., vector space model, the query likelihood model in the language modeling approach (Song and Croft 1999)), the two retrievals could potentially be efficiently combined. Furthermore, retrieval systems that already depend on fusion of different retrieved lists (Croft 2000b) can benefit from employing our methods as the resultant effectiveness is often improved as we show next.

Our re-ranking methods require computing inter-document similarities. We first note that these are computed for documents at top ranks of the lists. In the experimental setups to follow, each list contains only a few dozen documents. Similar efficiency considerations were echoed in work on re-ranking a single list using information induced from inter-document similarities within the list (Baliński and Daniłowicz 2005; Diaz 2005; Kurland and Lee 2005), specifically, using clusters of documents in the list (Willett 1985; Liu and Croft 2004; Kurland and Lee 2006; Liu and Croft 2006a, b; Yang et al. 2006; Kurland 2008; Liu and Croft 2008). Furthermore, one can potentially compute inter-document similarities based on summaries (snippets) of documents rather than on the entire document content so as to alleviate the computational cost. Indeed, such practice was employed in work on clustering the results of Web search engines (Zamir and Etzioni 1998).

5.1 Re-ranking TREC runs

Our first order of business is to study the general effectiveness of our re-ranking approach. To that end, we re-rank TREC runs (Voorhees and Harman 2005)—i.e., document lists submitted by TREC participants in response to a query.

For experiments we use the ad hoc track of trec3, the Web tracks of trec9 and trec10, and the robust track of trec12. Some of these tracks were used in work on fusion (e.g., Aslam and Montague 2001; Montague and Aslam 2002). Details of the tracks are given in Table 1.

Table 1 TREC data used for experiments

We employ our methods with three different approaches for selecting the run to be re-ranked (\({{\mathcal{L}}_{\rm init}}\)), and the second run used for re-ranking (\({{\mathcal{L}}_{\rm help}}\)). Namely, we use (1) the two best performing runs (Sect. 5.1.1) in a track, (2) two runs in a track with median-level performance (Sect. 5.1.2), and (3) two randomly selected runs from a track (Sect. 5.1.3). The runs are chosen from all submitted runs in a track—i.e., those available on the TREC website and which include both manual and automatic runs. In all cases, \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) contain the top-50 ranked documents in the selected runs, following previous findings about utilizing inter-document similarities for re-ranking being most effective when using relatively short lists (Diaz 2005; Kurland and Lee 2005).

To study the importance of utilizing inter-document similarities between documents in \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\), we use two fusion approaches as reference comparisons to our re-ranking methods. The first is CombMNZ (Fox and Shaw 1994; Lee 1997), which is considered a highly effective fusion method. CombMNZ scores document d in \({{\mathcal{L}}_{\rm init} \cup {\mathcal{L}}_{\rm help}}\) with \({({\delta \left[d \in {\mathcal{L}}_{\rm init}\right]} + {\delta \left[{d \in {\mathcal{L}}_{\rm help}}\right]}) ({Score_{{\mathcal{L}}_{\rm init}}(d)}+{Score_{{\mathcal{L}}_{\rm help}}({d})})}\). The second reference comparison, CombMult, could conceptually be regarded as a special case of SimRank wherein two documents are deemed similar if and only if they are the same document. Specifically, CombMult scores d in \({{\mathcal{L}}_{\rm init} \cup {\mathcal{L}}_{\rm help}}\) by \({{Score_{{\mathcal{L}}_{\rm init}}({d})} \cdot {Score_{{\mathcal{L}}_{\rm help}}(d)}}\); to avoid zero-score multiplication, we define (only for CombMult) \( Score_{\mathcal L}(d)\, {\mathop{=}\limits^{\rm def}} \min\nolimits_{d^{\prime} \in {\mathcal L}}\, Score_{\mathcal L}(d^{\prime}) \) where \(\mathcal{L}\) is one of the two lists \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\), and d is in \(\mathcal{L}\) but not in the other list. It is also important to note that CombMNZ and CombMult can use documents from both \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) in the final resultant list, while our re-ranking methods can only use documents from \({{\mathcal{L}}_{\rm init}}\).

We normalize retrieval scores for compatibility. While this is not required by our methods, except for cases of negative retrieval scores, it is crucial for the CombMNZ method that we use as a reference comparison. Specifically, let \({{s_{\mathcal{L}}(d)}}\) be d’s original retrieval score in the run that was used to create \(\mathcal{L}\); we set \({Score_{\mathcal{L}}(d)}\), the retrieval score used by our methods and the reference comparisons, to \({{\frac{{s_{\mathcal{L}}(d)}- {\rm min}_{d' \in \mathcal{L}}{s_{\mathcal{L}}(d')}}{{\max}_{d' \in \mathcal{L}}{s_{\mathcal{L}}(d')} - {\rm min}_{d' \in \mathcal{L}}{s_{{\mathcal{L}}}(d')}}}}\)—i.e., we use min-max normalization (Lee 1997).

5.1.1 Best runs

In the first TREC-based experimental setup we set \({{\mathcal{L}}_{\rm init}}\), the list upon which re-ranking is performed, to the top-k documents in the run that yields the best p@5—the evaluation measure for which we optimize performance—among all submitted runs. The list \({{\mathcal{L}}_{\rm help}}\), which is used to re-rank \({{\mathcal{L}}_{\rm init}}\), is set to the k highest ranked documents in the run that yields the second-best p@5; k = 50 as noted above.

It is important to note that the experimental setting just described does not reflect a real-life retrieval scenario as no relevance judgments are usually available, and hence, one cannot (usually) determine, but rather only potentially estimate (Carmel and Yom-Tov 2010), a-priori which retrievals would be of higher quality than others. However, this setting enables us to study the potential of our approach as we aim to improve performance over that of a highly effective retrieval. Furthermore, we note that since previous work has shown that it is often the case that for high quality retrievals the resultant lists contain different relevant documents (Soboroff et al. 2001)—a scenario motivating the development of our methods—we have chosen the second-best run for the list \({{\mathcal{L}}_{\rm help}}\).

The performance numbers of our methods are presented in Table 2. Our first observation is that both SimRank and SimMNZRank are in general quite effective in re-ranking \({{\mathcal{L}}_{\rm init}}\). Specifically, in all reference comparisons (track × evaluation measure), except for those of trec9, their performance is better—in many occasions to a substantial extent and to a statistically significant degree—than that of the best run used to create \({{\mathcal{L}}_{\rm init}}\). For trec9, the performance of both methods is lower than that of the best run, although not to a statistically significant degree. The substantial, and statistically significant, performance differences between the best run and the second-best run for trec9 imply to the challenge of improving on the performance of the former.

Table 2 Re-ranking (using SimRank and SimMNZRank) the best-p@5 run in a track by utilizing the second-best-p@5 run

Another observation that we make based on Table 2 is that the performance of SimRank and SimMNZRank is superior in most relevant comparisons, and often, to a statistically significant degree, to that of the second-best run that was used to create \({{\mathcal{L}}_{\rm help}}\).

In comparing SimRank with SimMNZRank we see that the latter posts performance that is superior in most relevant comparisons to that of the former. Thus, rewarding documents in \({{\mathcal{L}}_{\rm init}}\) that also appear in \({{\mathcal{L}}_{\rm help}}\), as is done by SimMNZRank as opposed to SimRank, has positive impact on re-ranking performance.

We can also see in Table 2 that SimRank and SimMNZRank post performance that is in most relevant comparisons (track × evaluation measure) superior to that of the CombMult fusion method. Furthermore, their performance also transcends that of CombMNZ in most comparisons for trec3, trec9 and trec10; for trec12, CombMNZ posts better performance. The performance differences between our methods and CombMNZ, however, are not statistically significant. Not less important is the fact that while both fusion methods (CombMult and CombMNZ) rarely post statistically significant performance improvements over the best run, each of our re-ranking methods does so in about half of the relevant comparisons (track × evaluation measure). Thus, we conclude that there is merit in utilizing inter-document similarities between \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\).

Flipping roles The performance results from above are for re-ranking the best-p@5 run using the second-best-p@5 run. We now turn to explore the effect of flipping the roles of the two runs. That is, re-ranking the second-best-p@5 run (\({{\mathcal{L}}_{\rm init}}\)) using the best-p@5 run (\({{\mathcal{L}}_{\rm help}}\)). The performance numbers are presented in Table 3.

Table 3 Re-ranking (using SimRank and SimMNZRank) the second-best-p@5 run in a track by utilizing the best-p@5 run

We can see in Table 3 that re-ranking the second-best-p@5 run using our methods always results in performance that is much better than the original performance of the run; the performance improvements are also statistically significant in a vast majority of the cases. Furthermore, except for trec9, the performance of our methods often transcends that of the best-p@5 run that was used for re-ranking.

In addition, we see in Table 3 that our methods tend to outperform the CombMult fusion method. However, except for trec3, our methods are outperformed by CombMNZ. We attribute this finding to the fact that our methods re-rank the second-best-p@5 run, which is often substantially less effective than the best-p@5 run, while CombMNZ uses documents from both runs. A case in point, we see that for trec9—for which the performance differences between the best and second-best runs are huge—the performance difference between CombMNZ and our methods are quite substantial. Additional experiments—the results of which are omitted to avoid cluttering the presentation—show that for trec10 and trec12, our methods post comparable performance to that of a variant of CombMNZ that uses only documents from the second-best-p@5 run in the final result list. We come back to the trec9 case in Sect. 5.1.3.

5.1.2 Median runs

Our focus at the above was on using the two best performing runs in a track. We now turn to study the effectiveness of our methods when using runs that have median-level performance. We let our methods re-rank the run with the minimal p@5 performance that is above the median p@5 performance for a track (\({{\mathcal{L}}_{\rm init}}\)) using the run with the maximal p@5 performance that is below the median (\({{\mathcal{L}}_{\rm help}}\)). The performance results, along with those of the fusion methods, are presented in Table 4.

Table 4 Re-ranking the run with the minimal p@5 performance that is above the median for a track (\({{\mathcal{L}}_{\rm init}}\)) using the run with the maximal p@5 performance that is below the median (\({{\mathcal{L}}_{\rm help}}\))

We can clearly see in Table 4 that re-ranking \({{\mathcal{L}}_{\rm init}}\) using \({{\mathcal{L}}_{\rm help}}\) yields performance that is consistently (and considerably) better than that of each. Furthermore, our SimRankMNZ method posts performance that is superior to that of the fusion methods (CombMNZ and CombMult) in a majority of the relevant comparisons; specifically, the improvements for trec10 and trec12 are substantial and often statistically significant. Moreover, SimMNZRank posts more statistically significant improvements over \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) than the fusion methods do. These findings further attest to the benefit of utilizing inter-document similarities between the two lists.

5.1.3 Randomly selected runs

The selection of runs at the above was based on their performance. As already stated, in reality there is rarely information about the quality of the run as no relevance judgments are available. While in such cases one can potentially employ performance-prediction methods (Carmel and Yom-Tov 2010), this is out of the scope of this paper. We thereby turn to examine a realistic retrieval scenario wherein no a-priori knowledge (or prediction) of the quality of the runs to be used is available. Specifically, the runs used to create the lists \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) are both randomly selected from all those available for a track. In Table 5 we present the average performance numbers of our methods, and those of the reference comparisons, when randomly sampling 30 pairs of runs.Footnote 7 Statistically significant differences (or lack thereof) between two methods are determined with respect to the average (over 30 samples) performance per query.

Table 5 Re-ranking a randomly selected run (\({{\mathcal{L}}_{\rm init}}\)) using a second randomly selected run (\({{\mathcal{L}}_{\rm help}}\))

We can see in Table 5 that both SimRank and SimMNZRank post performance that is better to a substantial extent, and to a statistically significant degree, in all reference comparisons than that of the list to be re-ranked (\({{\mathcal{L}}_{\rm init}}\)), and that of the list used for re-ranking (\({{\mathcal{L}}_{\rm help}}\)).

Another observation that we make based on Table 5 is that our models, SimRank and SimMNZRank, outperform both fusion methods (CombMult and CombMNZ) in almost all reference comparisons; furthermore, the vast majority of these improvements are also statistically significant. These findings further support the merits of utilizing inter-document similarities between the two lists.

A note on trec9. It can be seen in Table 5 that the (average) performance difference between \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) is much larger for trec9 than for the other tracks. As it turns out, some of the runs submitted for trec9 are of very poor quality. Furthermore, some of these poor-performing runs were selected—as part of the 30 sampled pairs—to serve as the second run in a pair (\({{\mathcal{L}}_{\rm help}}\)) while the first run in the pair (\({{\mathcal{L}}_{\rm init}}\)) could be of much higher quality. For example, there are pairs of runs for which the p@5 performance of (\({{\mathcal{L}}_{\rm init},\,{\mathcal{L}}_{\rm help}}\)) is: (35.2,6.0), (40.8,2.4), and (43.6,7.6). Since the opposite case of having a second run which is of much higher quality than that of the first run happened in very few of the 30 sampled pairs, we have also addressed the following experimental setup. We took each of the 30 sampled pairs, and flipped the roles of \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\). Then, we employed both our methods and the fusion approaches. The resultant performance is presented in Table 6.

Table 6 Flipping \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) in the “random-selection of TREC runs” setup from Table 5

We can see in Table 6 that our methods outperform the ranking of the list upon which re-ranking is performed (\({{\mathcal{L}}_{\rm init}}\)) in a statistically significant manner. We can also see in Table 6 that the performance of the fusion methods is somewhat better than that of our methods. However, these relative improvements—especially those for p@5, the metric for which performance was optimized—are much smaller than those posted by our methods over the fusion approaches in Table 5.

To further explore the trec9 case, we sampled 70 additional pairs of runs. We use these 70 pairs, along with the 30 pairs for which the performance was presented in Table 6, for evaluation. The resultant performance numbers are presented in Table 7. We can see that our methods not only outperform the ranking of the list upon which re-ranking is performed in a substantial and statistically significant manner, but they also outperform the fusion methods in most relevant comparisons; in many cases, these improvements are also statistically significant.

Table 7 Using 100 randomly selected pairs of runs for evaluation over trec9 (The 30 pairs used in Table 6 plus 70 additional pairs)

All in all we showed that our methods are highly effective in re-ranking a randomly selected TREC run using a second randomly selected run. Specifically, the methods are much more effective on average than the fusion methods. We also showed that in the case that the list to be re-ranked is of very low quality, and the list used for re-ranking is of reasonable (high) quality, our methods underperform the fusion methods, but are still effective for re-ranking.

5.1.4 Summary of the performance results for TREC runs

We have used our methods with four different TREC settings, that is, re-ranking (1) the best-performing run using the second-best performing run, (2) the second-best performing run using the best-performing run, (3) a median-level performing run using another median-level performing run, and (4) a randomly selected run using another randomly selected run; specifically, we used 30 randomly selected pairs. For p@5 (the metric for which performance was optimized), there are 16 cases of relevant comparisons (4 settings × 4 tracks per setting) of methods in Tables 2, 3, 4, and 5. Using these cases to contrast the p@5 performance of SimMNZRank (our better performing model) with that of \({{\mathcal{L}}_{\rm init}}\) (the list upon which re-ranking is performed) and that of CombMNZ and CombMult (the fusion methods), we can see that:

  • SimMNZRank posts performance that is better than that of \({{\mathcal{L}}_{\rm init}}\) in 94% of the cases; 80% of these improvements are also statistically significant. The average relative improvement over \({{\mathcal{L}}_{\rm init}}\)’s ranking (including performance degradations) is 14.1%. While CombMNZ also improves on \({{\mathcal{L}}_{\rm init}}\)’s ranking in 94% of the cases, only 53% of these improvements are statistically significant. Furthermore, the average performance improvement of CombMNZ over \({{\mathcal{L}}_{\rm init}}\) is only 10.3%. The performance of CombMult is better than that of \({{\mathcal{L}}_{\rm init}}\)’s ranking in 88% of the cases; 57% of these improvements are statistically significant. The average relative performance improvement of CombMult over \({{\mathcal{L}}_{\rm init}}\) is only 7.8%.

  • SimMNZRank outperforms CombMNZ in 75% of the cases; 58% of these improvements are also statistically significant. On the other hand, CombMNZ never outperforms SimMNZRank in a statistically significant manner with respect to p@5.

  • SimMNZRank outperforms CombMult in 84% of the cases; 40% of these improvements are also statistically significant. CombMult never outperforms SimMNZRank in a statistically significant manner with respect to p@5.

All in all, these findings attest to the effectiveness of our approach. Specifically, the performance comparison with the fusion methods attests to the merits of utilizing inter-document similarities between the lists. We also showed that the performance of our methods is relatively less effective—yet, still highly effective for re-ranking—when the list to be re-ranked (\({{\mathcal{L}}_{\rm init}}\)) is of much worse quality than the list used for re-ranking (\({{\mathcal{L}}_{\rm help}}\)). (Refer back to Sects. 5.1.1 and 5.1.3.) Thus, applying performance-prediction methods (Carmel and Yom-Tov 2010) to potentially identify which of two given lists is of higher quality so as to have this list serve as \({{\mathcal{L}}_{\rm init}}\) (the list to be re-ranked) is an interesting future venue we intend to explore.

5.2 Integrating document-based and cluster-based retrieval

We next study the effectiveness of our methods in re-ranking a list retrieved by a standard document-based approach (\({{\mathcal{L}}_{\rm init}}\)) using a list retrieved by a cluster-based method (\({{\mathcal{L}}_{\rm help}}\)).

To create \({{\mathcal{L}}_{\rm init}}\), we use the standard KL-retrieval approach (Lafferty and Zhai 2001), denoted DocRet. We set \({{\mathcal{L}}_{\rm init}}\) to the 50 documents d in the corpus that yield the highest sim(qd). The document language model smoothing parameter, μ, is set to 1000 as mentioned above. Naturally, then, we use the same retrieval method with μ optimized for p@5—the measure for which we optimize the performance of our re-ranking methods—as a reference comparison, denoted DocRetOpt.

To create \({{\mathcal{L}}_{\rm help}}\), we first cluster the corpus into static (offline) nearest-neighbor overlapping clusters, which were shown to be highly effective for cluster-based retrieval (Griffiths et al. 1986; Kurland and Lee 2004). Specifically, for each document d we define a cluster that contains d and its α − 1 nearest-neighbors d′ (\(d' \not = d\)), which are determined by sim(dd′); we set α = 10 as such small nearest-neighbor clusters are known to yield effective retrieval performance (Griffiths et al. 1986; Kurland and Lee 2004; Tao et al. 2006). As is common in work on cluster-based retrieval in the language modeling framework (Liu and Croft 2004; Kurland and Lee 2004; Kurland and Lee 2006), we represent cluster c by the big document that results from concatenating its constituent documents. The order of concatenation has no effect since we only use unigram language models that assume term independence.

To exploit the overlap of clusters, we use the effective Bag-Select retrieval algorithm (Kurland and Lee 2004), referred to here as ClustRet. Specifically, let TopRetClust(q) be the set of 50 clusters c that yield the highest sim(qc). The score of document d is \(Score_{{{{\rm ClustRet}}}}(d)\, {\stackrel{{{\rm def}}}{=}}\, {sim(q,d)} \cdot {\#\{{c}}{ \in {TopRetClust({q})}: d \in {c}}\} \). Thus, d is ranked high if it is a member of many top-retrieved clusters and if it exhibits high query similarity. Finally, we set \({{\mathcal{L}}_{\rm help}}\) to the list of documents that are members of the clusters in TopRetClust(q); the documents are ordered by their ClustRet-assigned scores.

As at the above, we use the CombMNZ fusion method—employed over the lists \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\) using normalized retrieval scores—as a reference comparison.

We also consider two additional reference comparisons that utilize a state-of-the-art cluster-based retrieval approach, namely, the interpolation algorithm (Kurland and Lee 2004). Specifically, given a set of clusters S, document d that belongs to at least one cluster in S is assigned with the score λsim(qd) + (1 − λ)∑ cS sim(qc)sim(cd); λ is a free parameter. The interpolation algorithm was shown to be highly effective for re-ranking the cluster-based retrieved list \({{\mathcal{L}}_{\rm help}}\) (i.e., setting \( S\,{\mathop{=}\limits^{def}}\, TopRetClust({q}) \)) (Kurland and Lee 2004); we use Interp(stat) to denote this implementation. The interpolation algorithm was also shown to yield state-of-the-art performance in re-ranking a list retrieved by document-based retrieval (\({{\mathcal{L}}_{\rm init}}\)) (Kurland 2006). Specifically, such implementation, denoted Interp(dyn), uses 50 dynamic, query-specific nearest-neighbor clusters (of 10 documents) that are created from documents in \({{\mathcal{L}}_{\rm init}}\) (Kurland 2006). For both Interp(stat) and Interp(dyn), we set λ to a value in \(\{{0,0.1,\ldots,1}\}\) so as to optimize p@5.

To further study the importance of using the cluster-based retrieved list for re-ranking the document-based retrieved list, we also experiment with an implementation of SimRank, denoted SimRankNoHelp,Footnote 8 wherein \( {\mathcal L}_{\rm help}\, {\mathop{=}\limits^{def}}\, {\mathcal L}_{\rm init} \)—i.e., inter-document similarities within \({{\mathcal{L}}_{\rm init}}\) are used for its re-ranking. The value of α is chosen, as at the above, to optimize p@5.

Note that while the reference comparisons Interp(dyn) and SimRankNoHelp re-rank \({{\mathcal{L}}_{\rm init}}\) based on inter-document similarity information induced only from \({{\mathcal{L}}_{\rm init}}\), and Interp(stat) re-ranks \({{\mathcal{L}}_{\rm help}}\) based on inter-document-similarities induced only from \({{\mathcal{L}}_{\rm help}}\), our methods (SimRank and SimMNZRank) re-rank \({{\mathcal{L}}_{\rm init}}\) using inter-document-similarities between \({{\mathcal{L}}_{\rm init}}\) and \({{\mathcal{L}}_{\rm help}}\).

Since clustering the Web corpora from trec9 and trec10 is computationally demanding, we used for the experiments here the four TREC corpora specified in Table 8. These corpora were used in some previous work on re-ranking (Liu and Croft 2004; Kurland and Lee 2005; Kurland and Lee 2006). For queries we used the titles of TREC topics. Pre-processing of all data was performed as described above for the TREC runs.

Table 8 TREC data used for experiments with integrating cluster-based and document-based retrieval

The performance numbers of our re-ranking methods and those of the reference comparisons are presented in Table 9. Our first observation is that the performance of SimRank and SimMNZRank is better in most relevant comparisons—often to a statistically significant degree—than that of the document-based retrieval (DocRet) that was used to create the list \({{\mathcal{L}}_{\rm init}}\) upon which re-ranking is performed. Furthermore, the performance of each of our methods is also consistently better than that of the optimized document-based retrieval method (DocRetOpt). In addition, both re-ranking methods post performance that is better in almost all relevant comparisons than that of the cluster-based retrieval method (ClustRet) that was used to create \({{\mathcal{L}}_{\rm help}}\).

Table 9 Performance numbers of SimRank and SimMNZRank when re-ranking a list (\({{\mathcal{L}}_{\rm init}}\)) retrieved by a document-based approach (DocRet) using a list (\({{\mathcal{L}}_{\rm help}}\)) retrieved by a cluster-based approach (ClustRet)

We can also see in Table 9 that the re-ranking methods (especially SimMNZRank) post performance that is superior in a majority of the relevant comparisons (sometimes to a statistically significant degree) to that of the CombMNZ fusion method. Hence, inter-document similarities induced between the lists seem to be a helpful source of information for re-ranking as was the case for the TREC runs.

Another observation that we make based on Table 9 is that SimMNZRank is superior in most relevant comparisons to the interpolation algorithms. While only a few of these performance differences are statistically significant (see the ROBUST case), SimMNZRank posts more statistically significant performance improvements over DocRet and DocRetOpt than the interpolation algorithms do. The performance improvements posted by SimMNZRank with respect to Interp(dyn) attest to the benefit in using the cluster-based retrieved list (\({{\mathcal{L}}_{\rm help}}\)) to re-rank \({{\mathcal{L}}_{\rm init}}\)—recall that Interp(dyn) is a re-ranking method that uses information only within \({{\mathcal{L}}_{\rm init}}\). This finding is further supported by the fact that in a majority of the relevant comparisons, both SimRank and SimMNZRank post better performance than that of SimRankNoHelp that uses inter-document-similarities only within \({{\mathcal{L}}_{\rm init}}\).

5.3 Setting free-parameter values

Heretofore, the free parameter that our methods incorporate (α), and the free parameters of the various reference comparisons, were set to values that optimize the average p@5 over the entire set of given queries per corpus (track). This practice enabled us to study the potential effectiveness of our approach—specifically, the merits of using inter-document-similarities between the lists—with respect to that of the various reference comparisons when neutralizing free-parameter values effects. Now, we turn to study the question of whether effective values of the free parameter α, and those of the free parameters employed by the reference comparisons, generalize across queries.

We use a cross-validation train/test procedure performed over queries so as to set the values of free parameters. We randomly split the queries in each corpus (track) into two sets of equal sizes. One set (the “train set”) is used to determine the optimal value of free parameters using exhaustive search and relevance judgments (p@5 is the measure for which performance is optimized); the second query set (the “test set”) is used for testing the performance with the determined values of the free parameters. We then flip the roles of the two query sets. Thus, the performance for each query in a corpus (track) is based on free-parameter values that were determined using other queries. Since using a single random split might result in performance that is not representative due to query variability issues, especially in light of the fact that there is a relatively small number of queries in each corpus (track), we use five random splits and report the average resultant performance.

5.3.1 Re-ranking TREC runs

We next study the performance of our methods when re-ranking TREC runs using cross validation (as described above) to set the value of the free parameter α. We use the setup of random selection of TREC runs that was described in Sect. 5.1.3. As the other TREC setups used at the above are based on knowing in advance the relative performance of runs, this setup is the most “practical” among all considered. Furthermore, setting the free-parameter value using cross validation results in a realistic retrieval scenario. Recall that the reported performance is an average over 30 random selections of pairs of runs. Statistical significance of performance differences is determined with respect to the average performance per query. The performance numbers are presented in Table 10.

Table 10 Re-ranking a randomly selected run (\({{\mathcal{L}}_{\rm init}}\)) using a second randomly selected run (\({{\mathcal{L}}_{\rm help}}\)) when the value of the free parameter of our methods (α) is set using \({\bf \underline{cross\;validation}}\)

We can see in Table 10 that our methods substantially outperform the intial ranking of the list \({{\mathcal{L}}_{\rm init}}\) upon which re-ranking is performed in all reference comparisons; these improvements are also statistically significant. Furthermore, the performance of our methods is better in all relevant comparisons—to a statistically significant degree—than that of the second list (\({{\mathcal{L}}_{\rm help}}\)) that was used for re-ranking.

Another observation that we make based on Table 10 is that our methods consistently outperform the fusion methods (CombMNZ and CombMult); most of the performance improvements are also statistically significant. (In fact, all improvements for p@5—the metric on which optimization in the learning phase is based—are statistically significant.)

Thus, we see that our methods are highly effective in re-ranking when the value of the free parameter (α) is set using a held-out query set. Furthermore, the improvements over the fusion methods attest, again, to the merit of utilizing inter-document similarities between the two lists.

5.3.2 Integrating document-based and cluster-based retrieval

Table 11 presents the performance numbers of our methods when re-ranking the results of document-based retrieval using those of cluster-based retrieval. The experimental setup details are those described in Sect. 5.2, except for using the cross-validation procedure described above for setting free parameter values: (1) the free parameter α, which is incorporated by our methods and by the reference comparison SimRankNoHelp, (2) the free parameter λ on which the reference comparisons Interp(stat) and Interp(dyn) depend, and (3) document language model smoothing parameter used by DocRetOpt—the optimized document-based retrieval. As noted above, the performance numbers of the methods that incorporate these free parameters are averages over 5 random train/test splits.Footnote 9

Table 11 Performance numbers of SimRank and SimMNZRank when re-ranking a list (\(\mathcal{L}_{\rm init}\)) retrieved by a document-based approach (DocRet) using a list (\(\mathcal{L}_{\rm help}\)) retrieved by a cluster-based approach (ClustRet)

We can see in Table 11 that both our methods (SimRank and SimMNZRank) are more effective in most relevant comparisons—often, to a substantial extent and to a statistically significant degree—than the document-based retrieval used to create the list to be re-ranked (DocRet). Our methods are also more effective than the cluster-based retrieval used for re-ranking (ClustRet) in most relevant comparisons.

We can also see in Table 11 that our methods outperform CombMNZ in a vast majority of the relevant comparisons. The improvements posted by SimMNZRank over CombMNZ are often statistically significant. These findings further demonstrate the merits of utilizing inter-document similarities between the two lists.

Another observation that we make based on Table 11 is that our better performing model, SimMNZRank, outperforms Interp(dyn) and SimRankNoHelp—both of which utilize inter-document similarities only within the list to be re-ranked—in a majority of the relevant comparisons, although not to a statistically significant degree. Furthermore, SimMNZRank posts more statistically significant improvements over the document-based ranking (DocRet), the cluster-based ranking (ClustRet), and the optimal document-based ranking (DocRetOpt) than Interp(dyn) and SimRankNoHelp do. Recall that Interp(dyn) is a state-of-the-art re-ranking method. Hence, these findings further attest to the merits of utilizing inter-document-similarities with a second retrieved list.

All in all, the findings at the above show that the value of the free parameter (α) incorporated by our methods can be effectively set using a held-out query set, resulting in highly effective re-ranking approaches.

6 Conclusions and future work

We have addressed the challenge of attaining high precision at top ranks in the ad hoc retrieval task. Our approach is based on re-ranking an initially retrieved list using information induced from a second list that was retrieved in response to the same query. The second list could be produced by using, for example, a different query representation and/or retrieval model. While traditional methods for fusing retrieved lists are based on retrieval scores and ranks of documents in the lists, our re-ranking models utilize an additional source of information, namely, inter-document-similarities. More specifically, documents in the list to be re-ranked that are similar to documents highly ranked in the second list are rewarded.

The first task in which we employed our methods was re-ranking a TREC run using a second run. We have shown that our methods are very effective with several ways of selecting the pair of runs: (1) re-ranking the best performing run using the second-best performing run (and vice versa), (2) re-ranking a median-level performing run using a second median-level performing run, and (3) re-ranking a randomly selected run using a second randomly selected run. Our methods also often outperform fusion methods that utilize only retrieval scores. This finding further attests to the merits of utilizing inter-document-similarities between the lists.

The second task in which we employed our re-ranking methods was that of integrating document-based and cluster-based retrieved results—a long standing challenge. We demonstrated the merits of re-ranking the results of document-based retrieval using those of cluster-based retrieval. Furthermore, we showed that our approach yields better performance than that of a state-of-the-art cluster-based re-ranking method that utilizes information only from the list to be re-ranked. This finding further supported the merits of using information induced from a second retrieved list.

Our re-ranking approach requires a second retrieval on top of the initial retrieval, and hence, there is some computational overhead incurred. On the other hand, we showed that the resultant effectiveness improvements over the initial retrieval, and over the second retrieval, are quite substantial and are often also statistically significant. Furthermore, if a retrieval system already depends on fusion, then our approach often improves on standard fusion.

For future work we plan to explore more sophisticated methods for utilizing inter-document-similarities between the lists—e.g., graph-based methods such as PageRank (Brin and Page 1998) and HITS (hubs and authorities) (Kleinberg 1997) when employed over links induced by inter-document similarities (Kurland and Lee 2005; Kurland and Lee 2006).

Another future venue that emerged in the evaluation presented above is list selection. That is, given two retrieved lists of significantly different quality (effectiveness), our methods are relatively more effective when re-ranking the more effective list using the second list than vice versa. Indeed, when re-ranking a low-quality list using a list which is of much higher quality, the performance of our approach is often inferior to that of fusion methods; yet, our methods are still effective for re-ranking. Thus, adapting query-performance predictors (Carmel and Yom-Tov 2010) for estimating which of two retrieved lists is of higher quality—on a per-query basis—so as to have it serve as the list upon which re-ranking is performed, is a venue we intend to explore. Among the challenges involved in pursuing this goal is the fact that we might not have knowledge of the underlying retrieval methods used to produce the lists, and most query-performance predictors are based on certain assumptions with regard to the retrieval method (Carmel and Yom-Tov 2010). Furthermore, retrieval scores might not be available, and therefore, query-performance predictors that rely on retrieval scores might not be applicable (Carmel and Yom-Tov 2010). Yet, our methods, as fusion methods, can potentially use rank, rather than retrieval scores, information. More generally, most query-performance predictors are designed to predict the performance of a single retrieval system across different queries, rather than predict the relative performance of different retrieval systems over a single query, although there is some recent work in this direction (Aslam and Pavlu 2007; Diaz 2007b).