1 Introduction

A standard paradigm to addressing the ad hoc (query-based) retrieval task is devising document and query representations, and using their similarity to induce ranking. In the vector space model, for example, a vector representing the query and that representing a document can be compared using the cosine similarity measure (Salton et al. 1975). In the language modeling framework, the KL divergence between the query and document language models often serves for ranking (Lafferty and Zhai 2001).

There has been much work on devising query representations, document representations, and similarity measures. For instance, various approaches for automatic query expansion have been developed (Buckley et al. 1994; Xu and Croft 1996; Lavrenko and Croft 2001; Zhai and Lafferty 2001a). Furthermore, there is a large body of work on integrating representations and similarity measures (Croft 2000b). Our focus in this paper is on the document side, that is, (specific) document representations and their integration.

The document representation task has attracted quite a lot of research attention throughout the history of information retrieval. The effectiveness of using manually versus automatically selected terms to index the document with was studied (Salton and Lesk 1968). Using specific index terms versus using the entire document, or its abstract or title, was also explored (Fisher and Elchesen 1972; McGill et al. 1979; Katzer et al. 1982). In many cases, the conclusion was that integrating different document representations can yield retrieval performance that is better than that of using each representation alone (Katzer et al. 1982). Cognition-based arguments, for example, were proposed to support the merits of such integration (Ingwersen 1994). Another form of document representation that was explored is based on automatic summarization, performed in a query-independent (Radev et al. 2002) or query-dependent (Tombros and Sanderson 1998) manner. Such representations can help the user, for example, to effectively examine search results.

Document representation can also be based on information that is not part of the document itself, e.g., so as to cope with the vocabulary mismatch between relevant documents and the query. For example, a document can be “expanded” using bibliographic information (Salton 1963; Kwok 1975), or a thesaurus (Joyce and Needham 1958). Alternatively, an expanded document form can be derived using similar documents in the corpus (Singhal and Pereira 1999; Kurland and Lee 2004; Liu and Croft 2004; Tao et al. 2006), or by utilizing topic-based information that is induced from the corpus (Deerwester et al. 1990; Hofmann 1999; Wei and Croft 2006; Yi and Allan 2009). On the Web, hyperlink (and hypertext) information can be used to enrich the document representation (McBryan 1994; Craswell et al. 1999; Kraaij et al. 2002; Ogilvie and Callan 2003; Metzler et al. 2009). Recently, temporal versions of the document have been used to form a representation (Elsas and Dumais 2010).

In that respect, the work on cluster-based retrieval could be viewed as representing a conceptual approach that treats a document as part of its corpus context, rather than in isolation. Examples include enriching a document model using information induced from clusters of similar documents (Singhal and Pereira 1999; Kurland and Lee 2004; Liu and Croft 2004, 2006b; Tao et al. 2006); and more generally, using document-cluster associations to identify documents pertaining to the query (Jardine and van Rijsbergen 1971; Croft 1980; Voorhees 1985; Willett 1985; Kurland and Lee 2004; Liu and Croft 2004, 2006b, 2008; Kurland and Lee 2006; Yang et al. 2006).

A conceptually opposite approach to expanding a document representation is manifested in passage-based document ranking models. The goal of such methods is to address the fact that a long and/or topically heterogeneous relevant document might contain only a small part (passage) with information pertaining to the query. A common retrieval method is ranking a document by the highest query-similarity exhibited by any of its passages (Salton et al. 1993; Callan 1994; Wilkinson 1994; Liu and Croft 2002).

Thus, the cluster-based and passage-based document ranking paradigms could be viewed as two extremes of the spectrum of approaches utilizing different document representations, that is, expansion versus contraction. Furthermore, these paradigms essentially address different, yet potentially complementary, goals: exploiting corpus context versus handling long/heterogeneous documents. Naturally, then, the following research questions rise. Can cluster-based and passage-based information be effectively integrated, along with whole-document-based information, so as to improve upon using each alone? are there cases wherein using cluster-based information is clearly more effective than using passage-based information and vice versa? We note that although demonstrated to be effective for document retrieval, cluster-based and passage-based information have been utilized separately in different retrieval methods.

To address the research questions just stated, we perform the following study. We devise two retrieval methods that integrate whole-document-based, cluster-based, and passage-based information. The first is a language-model-based (LM) method that integrates language models induced from documents, clusters, and passages. The method generalizes some previously proposed ranking methods that utilize either passage-based or cluster-based information, but not both. As such, the LM method enables us to thoroughly study the relative performance contributions of each of the information types it leverages. The second method that we present is based on a discriminative approach. Specifically, we use a learning-to-rank algorithm (Joachims 2002; Liu 2009) that utilizes information induced from documents, their passages, and clusters.

We use the proposed methods for the re-ranking task, which has attracted quite a lot of research attention lately (Liu and Croft 2004, 2006a, b, 2008; Diaz 2005; Kurland and Lee 2005, 2006; Yang et al. 2006; Kurland 2009). That is, re-ordering documents in an initially retrieved list so as to improve precision at the very top ranks. Extensive empirical evaluation performed using six TREC corpora shows that both the LM method and the learning-to-(re-)rank approach are highly effective in re-ranking. Specifically, the performance transcends that of a state-of-the-art cluster-based re-ranking method, and, that of a commonly used passage-based document ranking approach. Furthermore, the performance is often better than that of a state-of-the-art pseudo-feedback-based query expansion approach. The latter comparison could conceptually be viewed as contrasting two paradigms: enriching query representation versus enriching (and/or contracting) document representation.

The findings with respect to the questions posted above that emerge in the study we performed are as follows. Using passage-based information is much more effective than using cluster-based information for corpora containing very long and topically-heterogeneous documents; e.g., TREC’s FR corpus. Yet, even for such corpora, integration of the two types of information can yield performance that substantially transcends that of using each alone. For the rest of the corpora we examine, using cluster-based information is much more effective than using passage-based information, but yet again, their integration can yield improved performance. More generally, we show that integration of whole-document, cluster and passage-based information can yield clear merits over using any subset of these three information types. Finally, we show that while simple learning, performed across queries, of the relative impact of these information types yields highly effective re-ranking performance, there is large room for improvement that can potentially be attained by devising methods for setting this balance on a per-query basis.

All in all, we note that our contributions are two fold. First, we study the relative merits of using whole-document-, cluster-, and passage-based information, and their integration, in the ad hoc retrieval setting. Second, we present re-ranking methods that integrate these three information types and which yield high precision at top ranks. Naturally, users would like to see the documents pertaining to their information needs at the highest ranks of the retrieved lists. Furthermore, for applications such as question answering that rely on search as an intermediate step high precision at top ranks is important (Voorhees 2002). Finally, we note that while the focus of this paper is on the “document side”, further future performance improvements can potentially be attained by integration with techniques relying on the “query side”; e.g., query expansion.

2 Related work

There is a large body of work on re-ranking an initially retrieved list using information induced from clusters of documents in the list (Willett 1985; Liu and Croft 2004, 2006a, b, 2008; Kurland and Lee 2006; Yang et al. 2006; Kurland 2009). As will be shown in Sect. 3.1.1, our main (re-)ranking method generalizes a state-of-the-art cluster-based re-ranking model (Kurland 2009), which does not utilize passages. The relative performance merits of our model, which utilizes passage-based information on top of cluster-based information, are demonstrated in Sect. 4.3.

Graph-based re-ranking methods utilizing inter-item similarities have become quite common (e.g., Baliński and Daniłowicz 2005; Diaz 2005; Kurland and Lee 2005; Kurland and Lee 2006; Yang et al. 2006; Krikon et al. 2009). Specifically, document centrality (Kurland and Lee 2005), cluster centrality (Kurland and Lee 2006), and passage centrality (Krikon et al. 2009) induced over such graphs, were shown to be effective for re-ranking documents. Extending our model by using such centrality measures is a future venue we aim to explore. Indeed, the merits of such practice were demonstrated in work that also uses passages as proxies for documents (Krikon et al. 2009); however, cluster-based information, which is highly effective for re-ranking as we show in Sect. 4.3, was not utilized (Krikon et al. 2009). Re-ranking an initially retrieved list using inter-document similarities was also employed for searching over digital libraries (Van and Beigbeder 2008), cross-lingual retrieval (Diaz 2008), and fusion of retrieved lists (Meister et al. 2010).

A common passage-based document retrieval method is ranking a document by the highest query-similarity that any of its passages exhibits (Callan 1994; Wilkinson 1994; Kaszkiel and Zobel 2001; Liu and Croft 2002; Bendersky and Kurland 2008; Na et al. 2008); and, interpolating this similarity score with a document-query similarity score (Buckley et al. 1994; Callan 1994; Wilkinson 1994; Cai et al. 2004; Bendersky and Kurland 2008). Our re-ranking model generalizes these methods as will be shown in Sect. 3.1.1. Furthermore, we show in Sect. 4.3 that the model posts much better performance than that of these methods.

There is some work on discriminative models for passage-based document retrieval (Wang and Si 2008). In contrast to the learning-to-re-rank approach that we present, cluster-based information, which is highly effective for re-ranking as we show in Sect. 4.3, is not utilized.

Utilizing information induced from passages could be viewed as a means for exploiting relationships between terms that are somewhat close to each other in the text. Using Markov random fields (Metzler and Croft 2005), positional language models (Lv and Zhai 2009; Zhao and Yun 2009), and approaches that utilize the document structure (e.g., for XML documents) (Beigbeder et al. 2009) has also been suggested for exploiting information induced from inter-term and term-(document) position proximities. In Sect. 4.3 we use unigram language models in our re-ranking approaches so as to facilitate the comparison with previous work in the language modeling framework on (i) passage-based (Liu and Croft 2002), (ii) cluster-based (Kurland and Lee 2004; Kurland 2009), and (iii) relevance-model-based (Lavrenko and Croft 2001; Abdul-Jaleel et al. 2004) retrieval that used these unigram models. However, we hasten to point out that the Markov random field approach, and/or positional language models, can be used in our methods for estimating the document-query and passage-query “match” so as to potentially improve performance—a venue which we leave for future work.

Previous work on passage-based retrieval has focused on identifying and utilizing different types of passages. For example, (i) discourse passages are inferred from document markup (e.g., sentences or SGML tags) (Salton and Buckley 1991; Callan 1994; Wilkinson 1994; Cai et al. 2004; Hussain 2004), (ii) semantic passages are induced based on presumed topic shifts in a document (Hearst and Plaunt 1993; Mittendorf and Schäuble 1994; Ponte and Croft 1997; Denoyer et al. 2001; Jiang and Zhai 2004), and (iii) fixed (or variable) length passages are simply windows of consecutive terms in the document (Callan 1994; Kaszkiel and Zobel 1997; Liu and Croft 2002; Wade and Allan 2005; Na et al. 2008; Wang and Si 2008). While we focus on the latter in the evaluation presented in Sect. 4, as those were shown to be highly effective for document retrieval (see Sect. 4.2 for further details), we note that our re-ranking methods are not committed to any specific type of passages.

Furthermore, there is a large body of work on devising passage-based (Liu and Croft 2002; Abdul-Jaleel et al. 2004; Murdock and Croft 2005; Wade and Allan 2005; Bendersky and Kurland 2008) and cluster-based (Liu and Croft 2006b, 2008; Tao et al. 2006) language models. These language models could be used by our models, which are not committed to a specific language-model induction technique, so as to potentially improve their performance.

3 Re-ranking search results

Notational conventions

We use qd, and \({\mathcal{D}}\) to denote a query, a document, and a corpus of documents, respectively. Our goal is to re-rank an initial list, \({\mathcal{D}}_{\rm init}\) (\(\subset {\mathcal{D}}\)), which was retrieved by some search algorithm in response to q, so as to improve precision at top ranks. To that end, a set of clusters of similar documents, \(Cl({\mathcal{D}}_{\rm init})\), created from documents in \({\mathcal{D}}_{\rm init}\) by some clustering algorithm, is utilized; c is used to denote a cluster. Our re-ranking methods also exploit information induced from passages in documents. We use g to denote a passage, and write \(g \in d\) if g is part of d. The methods we present are not committed to a specific clustering algorithm, nor to a specific technique of segmenting documents to passages.

3.1 A language-model-based approach

We rank the documents in \({\mathcal{D}}_{\rm init}\) using a probabilistic approach. Specifically, we aim to estimate \({p}(d\vert q)\)—the probability that d is relevant to the information need expressed by q. Assuming a uniform prior distribution over documents, the following rank equivalence holds

$$ p(d\vert q)\mathop{=}\limits^{\text{rank}}p(q \vert d). $$
(1)

In the language-modeling framework (Ponte and Croft 1998; Croft and Lafferty 2003), for example, \(p(q\vert d)\) is regarded as the probability of generating the terms in q by a language model induced from d. However, we hasten to point out that the derivation to follow is not committed to any specific paradigm of estimating probabilities, albeit we will use language-model-based estimates for implementation.Footnote 1

Clusters in \(Cl({\mathcal{D}}_{\rm init})\) could potentially be thought of as representing query-related “aspects”, by the virtue of the way they are created, that is, from documents retrieved in response to the query (Liu and Croft 2004; Kurland and Lee 2006). We therefore use clusters as proxies for d (Kurland and Lee 2004):

$$ p(q \vert d)= \sum_{c\in Cl({\mathcal{D}}_{\rm init})} p(q \vert d,c) p(c \vert d). $$
(2)

To estimate \(p({q}\vert{d,c})\), we use a simple mixture governed by a free parameter λ clust : \((1-\lambda_{clust})p({q}\vert{d})+\lambda_{clust} p({q}\vert{c})\). As \(p({c}\vert{d})\) is a probability distribution over \(Cl({\mathcal{D}}_{\rm init})\), the universe of clusters that we consider, we can use some probability algebra to derive a previously-proposed cluster-based retrieval algorithm (Kurland and Lee 2004; Kurland 2009):Footnote 2

$$ Score_{clust}(d|q)\mathop{=}\limits^{\text{def}} (1-\lambda_{clust})p({q}\vert{d})+\lambda_{clust}\sum_{c\in Cl({\mathcal{D}}_{\rm init})}p({q}\vert{c})p({c}\vert{d}). $$
(3)

Consequently, document d is highly ranked if it exhibits a good “match” to the query, as measured by \(p(q\vert d)\), and if it is strongly associated with clusters of documents in \({\mathcal{D}}_{\rm init}\) (as measured by \(p({c}\vert{d}))\) that are a good “match” to the query \((p({q}\vert{c}))\).

A potential shortcoming of the ranking function in (3) is that d is treated as a whole unit. Indeed, it could be the case that only a small part (passage) of d contains information pertaining to q, and d is still deemed relevant—e.g., by TREC’s relevance-judgment regime (Voorhees and Harman 2005). More generally, since passages could be considered as more coherent units than documents, they can potentially serve as proxies in estimating the document-query match—\(p({q}\vert{d})\) in our case. For example, some previous work (Bendersky and Kurland 2008) has demonstrated the merits in using

$$ Score_{psg}({d|q})\mathop{=}\limits^{\text{def}} (1-\lambda_{psg}) p({q}\vert{d}) +\lambda_{psg} \max_{g_i \in d} p({q}\vert{g_i}) $$
(4)

as an estimate for \(p({q}\vert{d})\); λ psg is a free parameter. Such an approach can help to address the above-mentioned scenario of having a single passage in a document that contains query-pertaining information.

To integrate information induced from both passages and clusters, we can use the estimate from (4) for \(p({q}\vert{d})\) in (3) so as to get:

$$\begin{aligned}Score({d|q})\mathop{=}\limits^{\text{def}}&(1-\lambda_{clust})(1-\lambda_{psg})p({q}\vert{d}) \\ & +(1-\lambda_{clust})\lambda_{psg}\max_{g_i\in d}p({q}\vert{g_i})+\lambda_{clust}\sum_{c\in Cl({\mathcal{D}}_{\rm init})} p({q}\vert{c})p({c}\vert{d}).\end{aligned} $$
(5)

Algorithm

The probabilities in (5) can be estimated in various ways. Here, we follow common practice in the language-modeling framework (Ponte and Croft 1998; Croft and Lafferty 2003). Specifically, we use a language-model-based estimate, p y (x), for \(p({x}\vert{y})\); p y (x) is based on the probability of generating the text x by a language model induced from text y. (Specific language-model induction details are described in Sect. 4.1). Thus, we arrive to our \(\underline{\hbox{c}}\hbox{luster-}\underline{\hbox{d}}\hbox{ocument-} \underline{\hbox{p}}\hbox{assage }\underline{\hbox{l}}\hbox{anguage-} \underline{\hbox{m}}\hbox{odel-based}\) re-ranking algorithm, henceforth referred to as CDPlm:

$$ \begin{aligned} Score_{CDPlm}({d|q})\mathop{=}\limits^{\text{def}} & (1-\lambda_{clust})(1-\lambda_{psg})p_{d}(q) \\ & +(1-\lambda_{clust})\lambda_{psg} \max_{g_i\in d} p_{g_i}(q) + \lambda_{clust} \sum_{c\in Cl({\mathcal{D}}_{\rm init})} p_{c}(q)p_{d}(c). \end{aligned} $$
(6)

CDPlm is a three-component mixture model. The first component is based on the direct “match” between d and q. The second component uses d’s passage that exhibits the best “match” to q as a proxy in estimating d’s “match” to q. The third component uses clusters as proxies for d.

3.1.1 Generalizing previous models

The CDPlm method, and more generally, the ranking criterion in (5) on which it is based, generalize various previously proposed document ranking methods. For example, setting λ clust  = λ psg  = 0 in (6)—i.e., using no passage-based and cluster-based information—yields the standard language model approach (Ponte and Croft 1998). Alternatively, setting only λ psg  = 0, hence using no passage-based information, we get, as mentioned above, a previously-proposed cluster-based ranking model (Kurland and Lee 2004), with which we empirically compare CDPlm in Sect. 4.3.

Setting λ clust  = 0, that is, ignoring cluster-based information, yields a commonly used passage-based document ranking approach (Buckley et al. 1994; Callan 1994; Cai et al. 2004; Wilkinson 1994) with which we empirically compare CDPlm in Sect. 4.3; further setting λ psg  = 1 yields another commonly used passage-based document ranking principle (Callan 1994; Kaszkiel and Zobel 2001; Wilkinson 1994; Liu and Croft 2002; Bendersky and Kurland 2008).

3.2 Learning to re-rank

The CDPlm method is based on estimating the probability of document relevance using language-model estimates. We now turn to devise an alternative re-ranking method that is based on a discriminative approach, but which also uses language-model-based estimates. Specifically, we employ a commonly used learning to rank method, SVMrank (Joachims 2006), which uses support vector machines. The learner is presented with examples of queries and rankings of the initial document lists for these queries; the rankings are determined using relevance judgments. The learned ranking function is then used to re-rank an initial list for a new query.

Each document d in the initial list is represented by a vector of features that presumably indicate its relevance to the query. A weight vector for the features is learned so as to discriminate non-relevant documents from relevant documents for (roughly speaking) as many such pairs as possible in the training set (Joachims 2002).Footnote 3 We use a linear kernel SVM; hence, the learned function is linear in features. Now, recall that our CDPlm method is a linear mixture of three information types (whole-document-based, passage-based, and cluster-based). Hence, we use these three as features representing a document with respect to a query so as to study whether the balance between them can be learned using a discriminative approach as that employed by SVMrank:

  1. 1.

    The document-based feature:

    $$ {\bf DocFeature}(d) \mathop{=}\limits^{\text{def}} p_{d}(q). $$
  2. 2.

    The cluster-based feature:

    $$ {\bf ClustFeature}(d) \mathop{=}\limits^{\text{def}} \sum_{c \in Cl({\mathcal{D}}_{\rm init})}p_{c}(q)p_{d}(c). $$
  3. 3.

    The passage-based feature:

    $$ {\bf PsgFeature}(d)\mathop{=}\limits^{\text{def}} \max_{g_i\in d} p_{g_i}(q) $$

The resultant (re-)ranking model is denoted CDPsvm.

We note that using binary features that indicate whether document d is among the top-ranked documents with respect to a specific feature value, as originally proposed (Joachims 2002), has shown no merit.Footnote 4 Furthermore, adding features that utilize some types of passage-based and document-based information other than those utilized by CDPlm has yielded no performance gains. For example, using in addition to the features described above passage centrality and document centrality induced over similarity-based graphs—as those were shown to be effective for re-ranking (Kurland and Lee 2005; Krikon et al. 2009)—has not yielded performance improvements.

4 Evaluation

In what follows we present an evaluation of the performance of the CDPlm and CDPsvm methods. The rest of this section is organized as follows. In Sect. 4.1 we describe the language model estimate used for implementation. Section 4.2 provides details with respect to the experimental setup. Section 4.3 presents the results of our experiments.

4.1 Language-model induction

In this section, we refer to documents, passages, and queries as term sequences. A cluster is represented by the long document that results from concatenating its constituent documents (Kurland and Lee 2004; Liu and Croft 2004). The order of concatenation has no effect since we use unigram language models that assume term independence.

Let \(p_{z}^{Dir{[\mu]}}(\cdot)\) be the Dirichlet-smoothed unigram language model induced from text z (a query, document, cluster, or passage) with smoothing parameter μ (Zhai and Lafferty 2001b). We use a previously-proposed estimate based on the KL divergence (Lafferty and Zhai 2001; Kurland and Lee 2004, 2005):

$$ p_{y}(x) \mathop{=}\limits^{\text{def}} \exp\left(-D \left( p_{x}^{Dir[0]}(\cdot)\left\vert\left\vert p_{y}^{Dir[\mu]}(\cdot)\right.\right.\right)\right). $$

The estimate was shown to be effective in work on cluster-based retrieval (Kurland and Lee 2004; Kurland 2009) with which we compare our methods, and passage-based retrieval (Krikon et al. 2009). For example, the estimate addresses underflow and length-based issues that result from assigning language-model probabilities to long sequences of text (Lafferty and Zhai 2001; Lavrenko et al. 2002; Kurland and Lee 2005), e.g., p d (c). While the estimate does not constitute a probability distribution—as is the case for unigram language models—normalizing it to this end yields no performance merits as was the case in some previous work (Krikon et al. 2009; Kurland 2009).

4.2 Experimental setup

We conducted experiments using the TREC corpora specified in Table 1. For each corpus we report the average document length of a document in the corpus, and the average similarity between passages in a document in the initial list, \({\mathcal{D}}_{\rm init}\), to be re-ranked (further details below). The latter is computed by \(\frac{1}{|{\mathcal{D}}_{\rm init}|}\sum_{d \in {\mathcal{D}}_{\rm init}} \frac{\sum_{g_i \in d,g_j \in d} p_{g_i}(g_j)}{m(d)^2}\), where m(d) is the number of passages in d and \(\vert {\mathcal{D}}_{\rm init} \vert\) is the number of documents in \({\mathcal{D}}_{\rm init}\).Footnote 5 The motivation for using these corpora is based on the different types of documents that they contain (news articles, federal register records, and Web pages), the varying average document length and presumed document “homogeneity” (as measured by inter-passages similarities) that can affect the relative effectiveness of document-based, passage-based and cluster-based retrieval; and, compliance with previous work on cluster-based re-ranking (Kurland 2009) and passage-based retrieval (Callan 1994; Liu and Croft 2002) that used some of these corpora and with which we compare our models.

Table 1 TREC corpora used for experiments

Specifically, AP, SJMN and WSJ are news corpora. TREC8, which is considered a hard benchmark (Voorhees 2005), is mainly composed of news documents, but also contains federal register records. FR is composed of only federal register records. Furthermore, passage-based document ranking methods are known to be more effective than whole-document-based approaches for FR (Callan 1994; Liu and Croft 2002; Bendersky and Kurland 2008; Wang and Si 2008). This finding is often attributed to the fact that the FR documents are very long and “heterogeneous”. Indeed, the average document length for FR is much higher than that for other corpora; and, the average within-document inter-passage similarity is quite low with respect to that for the news corpora. We come back to these points later on. WT10G is a Web corpus that contains quite long (on average) documents. Furthermore, the Web documents are quite “heterogeneous” as measured by the within-document inter-passage similarities.

We used titles of TREC topics for queries.Footnote 6 Tokenization and Porter stemming were applied using the Lemur toolkit (http://www.lemurproject.org). Stop words were not removed. The Lemur and Zettair (http://www.seg.rmit.edu.au/zettair) toolkits were used for experiments.

We use the experimental setup proposed in some previous work on re-ranking (Kurland and Lee 2005, 2006; Kurland 2009; Krikon et al. 2009). The list \({\mathcal{D}}_{\rm init}\), upon which re-ranking is performed, is set to the 50 documents in the corpus that yield the highest p d (q)—i.e., a standard language-model-based approach. We note that re-ranking methods that utilize inter-document similarities—in our case, using information induced from document clusters—are known to be most effective when employed over relatively short retrieved lists (Diaz 2005; Kurland 2006). The document language-model smoothing parameter, μ, is set to optimize MAP (at 1000) so as to have an initial list of reasonable quality. In Sect. 4.3 we show that when employed over such a reasonable ranking, our re-ranking methods can yield performance that is better than that of state-of-the-art retrieval methods, whether used to rank the entire corpus or only re-rank the initial list.

The goal of re-ranking methods is to improve precision at the very top ranks. Therefore, we focus on the precision of the top 5 and 10 documents (p@5 and p@10, respectively) as evaluation measures. Statistically significant performance differences are determined using the two-tailed paired t test at a confidence level of 95% (Sanderson and Zobel 2005; Smucker et al. 2007).

As mentioned above, our methods are not committed to a specific type of passages and clusters. We use half-overlapping windows of 150 terms for passages, as these were shown to be effective (e.g., in comparison to other types of passages and in comparison to windows of 50 and 25 terms) in work on passage-based document retrieval (Callan 1994; Wang and Si 2008), specifically, in the language modeling framework (Liu and Croft 2002; Bendersky and Kurland 2008; Krikon et al. 2009).

To cluster \({\mathcal{D}}_{\rm init}\), we employ a commonly used nearest-neighbor-based approach that yields overlapping clusters (Griffiths et al. 1986; Kurland and Lee 2006; Liu and Croft 2006a; Kurland 2009). For each d (\(\in {\mathcal{D}}_{\rm init}\)) we define a cluster that contains d and the k − 1 documents d i in \(\mathcal{D}_{\rm init}\) (d i d) that yield the highest p d_i (d). We use clusters of k = 10 documents, as such small clusters were shown to be effective in work on cluster-based retrieval, specifically, for the re-ranking task (Kurland and Lee 2006; Liu and Croft 2006a; Kurland 2009).

Parameters

The smoothing parameter, μ, is set to 2000 (Zhai and Lafferty 2001b) in all methods, except for estimating p d (q), where we use the value chosen for creating \({\mathcal{D}}_{\rm init}\) so as to maintain consistency with the initial ranking.

The CDPlm method incorporates two free parameters, λ clust and λ psg , which control the relative impact of cluster-based and passage-based information, respectively. To thoroughly study the relative merits of using these information types, and the overall resultant effectiveness of CDPlm, we use the following experimental settings.

In Sect. 4.3.1 we study the optimal performance that can be attained by CDPlm and the components it is composed of. To that end, we set λ clust and λ psg to values that yield optimized performance on a per-query basis. This practice enables to compare the relative effectiveness of whole-document-, passage-, and cluster-based information when completely neutralizing free-parameter-values effects. Then, in Sect. 4.3.2 we set λ clust and λ psg to values that result in optimized average performance over the set of queries for corpus. Doing so helps to shed light on the potential performance of CDPlm when using the same (effective) parameter values for all queries. Finally, in Sect. 4.3.3 we present performance numbers when learning the values of the free parameters of CDPlm, and those of the reference comparisons, using a leave-one-out cross-validation procedure performed over queries.

The evaluation metric for which performance is optimized in all cases is p@5.Footnote 7 The values of λ clust and λ psg are chosen from \(\{0,0.1,\ldots,1\}\). For compatibility, we also use in Sect. 4.3.4 a leave-one-out cross validation procedure to train/test the learning-to-re-rank method, CDPsvm.

Efficiency considerations

We segment documents to passages prior to retrieval time. Hence, the main computational overhead posted by our methods on top of the initial retrieval is clustering the initially retrieved list; specifically, computing inter-document similarities. However, the initial list is quite short—composed of only 50 documents—and therefore, this overhead is not substantial. Furthermore, inter-document-similarities could be computed based on snippets of documents, rather than using whole-document content, as was done for example, in work on clustering the results of Web search engines (Zamir and Etzioni 1998). Similar efficiency considerations were echoed in previous work on using query-specific clusters—i.e., clusters of top-retrieved documents—for re-ranking (Willett 1985; Liu and Croft 2004, 2006a; Kurland and Lee 2005, 2006; Yang et al. 2006) and, in work on graph-based re-ranking methods that utilize inter-document-similarities among top-retrieved documents (Diaz 2005; Kurland and Lee 2005; Krikon et al. 2009).

4.3 Experimental results

In what follows we present and analyze the performance of CDPlm and its components (Sects. 4.3.14.3.3), and that of CDPsvm (Sect. 4.3.4), when re-ranking an initial list that was retrieved using a language-model-based approach as described above. In Sect. 4.3.5 we study the effectiveness of CDPlm in re-ranking an initial list that was retrieved using Okapi-BM25 (Robertson et al. 1994).

4.3.1 Optimal-performance analysis

Our first order of business is studying the relative effectiveness of using whole-document-based, cluster-based, and passage-based information. To that end, we use free-parameter settings that yield specific instances of CDPlm [refer back to (6)]. Furthermore, we neutralize the effect of free parameters that are not fixed, by using values that yield optimal p@5 on a per-query basis, as explained above. Such practice enables a fair comparison of the optimal performance that can be attained by CDPlm and its components. The parameter settings are:

  • Doc clust  = λ psg  = 0): the initial ranking that is based solely on whole-document information;

  • Clust clust  = 1): uses only cluster-based information; this is a previously proposed cluster-based (re-)ranking method (Kurland and Lee 2004; Kurland 2009);

  • Psg clust  = 0, λ psg  = 1): a commonly used method that utilizes only passage-based information (Callan 1994; Liu and Croft 2002);

  • DocClust psg  = 0): uses document-based and cluster-based information, and was shown to yield state-of-the-art re-ranking performance (Kurland 2009);Footnote 8

  • DocPsg clust  = 0): uses document-based and passage-based information; this is also a commonly used passage-based ranking approach (Buckley et al. 1994; Callan 1994; Wilkinson 1994; Cai et al. 2004; Bendersky and Kurland 2008); and,

  • ClustPsg psg  = 1): uses cluster-based and passage-based information.

Table 2 presents the performance numbers. The numbers in the first row represent the upper bound on performance; that is, the performance attained by positioning all relevant documents in the initial list, \({\mathcal{D}}_{\rm init}\), at the highest ranks.

Table 2 Optimal-performance analysis of the information types utilized by CDPlm; free-parameter values are set to optimize per-query performance

Our first observation based on Table 2 is that when used alone, cluster-based information (Clust) is in most cases more effective than whole-document-based (Doc) and passage-based (Psg) information. The notable exception is the FR corpus for which Clust posts poor performance in comparison to that of Doc and Psg. This finding can be explained by the statistics presented in Table 1 about FR containing long and heterogeneous documents as manifested in within-document inter-passage similarities. As clustering is based on inter-document similarities, and those could be dominated by many non-query-related aspects in case the documents are highly heterogeneous, clusters then convey less effective information for re-ranking than in cases wherein documents are relatively “homogeneous”.Footnote 9 Indeed, WT10G, which also contains heterogeneous documents (refer to Table 1), is the second corpus in addition to FR, for which Clust underperforms Doc (in terms of p@5); for the news-based corpora, which contain relatively short and homogeneous documents, this does not happen. Nevertheless, we can see that integrating cluster-based information with whole-document-based (DocClust) or passage-based information (ClustPsg) yields effective re-ranking performance even for the FR and WT10G corpora.

More generally, the integration of any two types of information yields performance that is substantially better than that of using each alone; furthermore, the resultant performance is much better than that of the initial ranking. Specifically, the ClustPsg method outperforms both Clust and Psg by a considerable margin. As the performance of Psg is often below that of the initial document-based ranking, and that of Clust is often beyond that of the initial ranking, we conclude that passage-based and cluster-based information are complementary, and there are clear merits in integrating them.

We can also see in Table 2 that the performance of CDPlm, which integrates whole-document-, cluster-, and passage-based information, is better to a substantial (and often to a statistically significant) degree than that of its specific instances that utilize one or two of the three information types. Thus, the overall picture rising from Table 2 is that the integration of whole-document-, cluster-, and, passage-based information has a clear potential. In other words, if we were able to automatically set per each query the λ clust and λ psg parameters, which control the relative impact of the information types, to highly effective values, then the integration of these information types would be of clear merit. Still, there is much room for improvement, as the “upper bound” numbers attest, and which can be addressed by using some of the approaches discussed in Sect. 2 in addition to CDPlm.

4.3.2 Performance analysis when using the same effective free-parameter values for all queries

The analysis presented above focused on the optimal potential performance of CDPlm and its components. The optimal performance was attained by setting free-parameter values to optimize performance per each query. We now turn to analyze the potential effectiveness of CDPlm when using the same (effective) free-parameter values for all queries per corpus; specifically, λ clust and λ psg are set to values that optimize average (over queries) p@5. Naturally, finding such effective parameter values is a task at its own right, which we address in the next section using cross validation. Yet, such practice enables us to study, using a setup more practical than that above, the relative benefits of cluster-based and passage-based information. Furthermore, we can contrast the performance of CDPlm with that of reference comparisons when (partially) ameliorating the effects of free-parameter values, yet avoiding per-query fitting of parameter values.

We first study the general effectiveness of CDPlm as a re-ranking method. To that end, we compare its performance with that of the initial ranking upon which re-ranking is performed. Recall that the initial ranking was created by a standard language-model approach (p d (q)) wherein the smoothing parameter, μ, was optimized for MAP. Hence, we also compare CDPlm with optimized baselines that use p d (q) to rank all documents in the corpus, with μ optimized for p@5 and p@10, independently. We can see in Table 3 that CDPlm consistently and substantially outperforms both the initial ranking and the optimized baselines, often, to a statistically significant degree.Footnote 10

Table 3 Comparison with the initial ranking and optimized baselines when using the same (optimized) free-parameter values for all queries

To further study the impact of cluster-based and passage-based information, we present in Fig. 1 the effect of varying the values of λ clust and λ psg on the p@5 performance of CDPlm; when setting one of the two parameters to some value, the value of the second parameter is set so as to maximize average (over queries) p@5. It is important to note that while λ clust solely determines the impact of cluster-based information, both λ psg and λ clust determine that of passage-based information. [Refer back to (6)].

Fig. 1
figure 1

Effect of varying λ clust (first and second rows) and λ psg (third and fourth rows) on the p@5 performance of CDPlm. The performance of the initial ranking, depicted with horizontal lines, is presented for reference. Note: figures are not to the same scale

Putting aside the case for the FR corpus, we can see in Fig. 1 that the performance of CDPlm is much superior to that of the initial ranking for a vast majority of the values of λ clust (≠ 0), and for all values of λ psg . These findings attest to the merits of the way CDPlm utilizes and integrates passage-based and cluster-based information. Furthermore, we can see that using \(\lambda_{clust} \in \{0.1,0.2\}\) and \(\lambda_{psg} \in \{0.2,0.3\}\) often yields near-optimal performance.

For the FR corpus, we see as shown above, that using cluster-based information results in in-effective re-ranking performance. Furthermore, only relatively large values of λ psg —i.e., putting a lot of emphasis on passage-based information—yield performance that is (much) better than that of the initial ranking. (For λ psg  = 1, no document-based information is used on top of passage-based information, and hence, there is a relative decrease in performance.) Similarly, we see that putting too much emphasis on cluster-based information is not effective for WT10g, which as FR, contains long and heterogeneous documents.

Comparison with pseudo-feedback-based query expansion

The CDPlm method uses information from the initial list, \({\mathcal{D}}_{\rm init}\), to re-rank it. Pseudo-feedback-based query expansion methods, on the other hand, use information from \({\mathcal{D}}_{\rm init}\) to construct a query model using which the entire corpus is re-ranked. Furthermore, CDPlm, as noted above, can conceptually be viewed as integrating different approaches for representing a document, while query expansion methods focus on the query representation. Thus, we turn to compare the performance of CDPlm with that of a state-of-the-art pseudo-feedback-based query expansion approach, namely, relevance model number 3 (RM3) (Lavrenko and Croft 2001; Abdul-Jaleel et al. 2004). For completeness of comparison, we also study a variant, RM3(re), which uses the constructed relevance model to re-rank \({\mathcal{D}}_{\rm init}\), rather than to rank the entire corpus. Ranking with a relevance model is based on its cross entropy with the document language model (Lavrenko 2004).

The values of the free parameters of RM3 and RM3(re) are set to optimize average p@5 over the set of queries per corpus, as is the case for CDPlm. Specifically, the (Jelinek–Mercer) smoothing parameter used for relevance-model construction is chosen from \(\{0,0.1,0.3,\ldots 0.9\}\); the number of terms used by the models is chosen from {25, 50, 75, 100, 500, 1000, 5000, ALL}, where “ALL” stands for using all terms in the vocabulary; and, the interpolation parameter that controls the reliance on the original query is set to a value in \(\{0,0.1,\ldots,0.9\}\). The (Dirichlet) document language model smoothing parameter (μ) used for ranking with a relevance model is set to 2000 as in all other methods. Table 4 presents the performance comparison.

Table 4 Comparison with a relevance model used to either rank all corpus (RM3) or to re-rank the initial list (RM3(re))

We can see that the performance of CDPlm is superior to that of the relevance models in most relevant comparisons (corpus × evaluation measure). Specifically, CDPlm posts p@5 performance—the metric for which performance was optimized—that is substantially better than that of the relevance models over AP and TREC8; the improvement over RM3 for AP is also statistically significant. We can also see that in the few cases that CDPlm is outperformed by the relevance models the performance differences are not statistically significant.

4.3.3 Learning free-parameter values

Heretofore, we evaluated the potential performance of CDPlm, and that of the reference comparisons, by ameliorating issues that rise from free-parameter values. Now, we turn to study whether effective parameter values generalize from one query to another. We note that this study is different than that presented in Fig. 1, wherein we analyzed the robustness of the average (over queries) performance of CDPlm with respect to free-parameter values.

We take a leave-one-out cross-validation approach. The free-parameter values of a method per query are set to those optimizing p@5 performance over all other queries for the same corpus. We present the resultant performance of CDPlm and the reference comparisons in Table 5.

Table 5 Performance numbers when learning free parameter values using a leave-one-out cross validation procedure

Our first observation based on Table 5 is that CDPlm outperforms the initial ranking in almost all reference comparisons; often, the improvements are substantial and statistically significant. This finding further attests to the effectiveness of CDPlm in re-ranking.

Another observation we make based on Table 5 is that CDPlm outperforms its specific instantiations, DocPsg (a standard passage-based ranking method (Buckley et al. 1994; Callan 1994; Wilkinson 1994; Cai et al. 2004; Bendersky and Kurland 2008)) and DocClust [a state-of-the-art cluster-based re-ranking approach (Kurland and Lee 2004; Kurland 2009)] in most relevant comparisons; in several cases (e.g., refer to AP and SJMN), the performance differences are also statistically significant. Furthermore, DocPsg and DocClust never outperform CDPlm in a statistically significant manner. These findings show that the relative importance of whole-document-, passage-, and cluster-based information, as determined by CDPlm’s free-parameters’ values, can be relatively effectively learned across queries. Naturally, however, the performance numbers (both for CDPlm and for DocPsg and DocClust) are much lower than those presented in Table 2, which were attained by setting parameter values so as to optimize per-query performance. Hence, there is much room for improvement that can potentially be obtained by devising methods for automatically setting the relative importance of whole-document, passage, and cluster-based information on a per-query basis.

We can also see in Table 5 that CDPlm outperforms RM3, which ranks the entire corpus, and RM3(re), which re-ranks the initial list, in most relevant comparisons; some of these performance differences are also statistically significant. We also note that while RM3 outperforms CDPlm over WSJ—although, not to a statistically significant degree—the statistically significant improvements posted by CDPlm over RM3 for WT10G are quite striking. As is the case for CDPlm, the relevance model approach can benefit much from devising methods for automatically setting free-parameter values on a per-query basis. A case in point, compare the performance numbers of the relevance-model implementations presented in Tables 4 and 5—the former, which in many cases are much better than the latter, are based on using free-parameter values that result in optimized average performance for a corpus, and the latter are based on using cross validation to set free-parameter values.

All in all, we see that in general, when learning free parameter values using cross validation, CDPlm is the most effective method among those presented in Table 5. (Note that the p@5—the metric based on which learning of free parameter values was performed—posted by CDPlm is the best for four out of six corpora; furthermore, CDPlm is the only method in Table 5 that is never outperformed in a statistically significant manner by any other method.)

4.3.4 Learning to re-rank

The learning-to-re-rank method, CDPsvm, uses SVMrank (Joachims 2006). We use the default values for all SVMrank parameters, except for that of c, which controls the bias-variance trade-off. As it turns out, c has considerable impact on the resultant re-ranking performance. Thus, we present performance numbers for two settings of CDPsvm.

The first setting, CDPsvm(B), is based on using a leave-one-out cross validation for training/testing SVMrank over all queries for each value of c. Then, the value of c that yields the \(\underline{\hbox{b}}\hbox{est}\) (average over queries) p@5 performance is selected, and the resultant performance is reported. In the second setting, CDPsvm(L), the value of c is \(\underline{\hbox{l}}\hbox{earned}\) for each query as follows. We perform a leave-one-out cross-validation over the rest of the queries to find the value of c that optimizes p@5. Using this value, we then learn a model using these queries and apply it to the query at hand. The values of c are chosen from \(\{10^{-5},5*10^{-5},\ldots,0.1,0.5,5,50,500,1000,5000,10000\}\).

For comparison purposes, we present the performance of CDPlm when its two free parameters, λ clust and λ psg are optimized for average-over-queries performance (CDPlm(B)), as was the case in Table 3; and, its performance when using leave-one-out cross validation to learn the values of these parameters (CDPlm(L)), as was the case in Table 5.

We can see in Table 6 that in most reference comparisons, the implementations of our methods improve over the initial ranking, often to a substantial and statistically significant degree. This finding further supports the merits of integrating cluster-, document-, and passage-based information for re-ranking, whether using a probabilistic model (CDPlm) or a learning-to-rank approach (CDPsvm).

Table 6 Comparison of CDPsvm with CDPlm

Evidently, the potential performance of CDPlm is somewhat better than that of CDPsvm as manifested in the best-parameter-values setups (‘B’) for most relevant comparisons. Now, recall that both CDPlm and CDPsvm use a linear interpolation of the same language-model-based estimates. Hence, these performance differences—although not statistically significant—may imply that learning a “good” balance between the three information types (cluster-, document-, and passage-based) in a discriminative manner by SVMrank can fall short, possibly due to query-variability issues (Peng et al. 2010).

The comparison between CDPlm and CDPsvm when learning free parameter values (‘L’) reveals that the performance of the former is in most relevant comparisons somewhat superior to that of the latter; for WT10G, the difference is quite substantial and also statistically significant.

We can also see in Table 6 that in some cases the performance of CDPlm and CDPsvm can quite decrease when moving from the best (‘B’) to the learned (‘L’) parameter settings. Thus, while both CDPlm and CDPsvm are very effective in most reference comparisons when learning free-parameter values, there is still room for improvement with respect to setting these values on a per-query basis—a challenging task, as mentioned above, that we leave for future work.

4.3.5 Re-ranking an Okapi-BM25-based initially retrieved list

Insofar, the initial list, \({\mathcal{D}}_{\rm init}\), upon which re-ranking was performed, was set to the 50 documents that were the highest ranked by a language-model-based approach. We now turn to study whether CDPlm is effective in re-ranking an initial list that is retrieved by a different retrieval method; specifically, we use Okapi-BM25 (Robertson et al. 1994). As was the case for the initial language-model-based ranking, we set Okapi’s free-parameters to values that optimize MAP (@1000) so as to create an initial list of a reasonable quality. Following previous recommendations (Robertson et al. 2000, 2004), we use the following free-parameter values ranges: \(k_1 \in \{0.1,0.25,0.5,0.75,0.9,1,1.2,2,2.5,3\}\); \(k_3 \in \{0.1,0.2,0.5,0.8,1,2,5,7,10,15,20\}\); and, \(b \in \{0.1,0.2,0.3,0.5,0.75,0.85,0.95, 1,1.5,2.5,3\}\). The 50 highest ranked documents are re-ranked using CDPlm, which uses language-model-based estimates as described above; CDPlm’s free-parameter values are set to optimize average p@5 performance per corpus as was the case in Sect. 4.3.2 The performance numbers are presented in Table 7.

Table 7 Performance of CDPlm when re-ranking an initial list of documents that was retrieved using Okapi-BM25

As we can see in Table 7, the performance of CDPlm is superior to that of the Okapi-based initial ranking in almost all reference comparisons; furthermore, in quite a few cases the improvements are statistically significant. These findings further demonstrate the effectiveness of CDPlm in re-ranking.

Note

For the WSJ corpus the Okapi-BM25 initial ranking can be quite improved if stopwords are removed from queries. (We used Zettair’s stopword list; recall that in our experimental setup above stopwords were not removed from queries and documents.) For the other corpora, however, it is the case that the improvements are smaller or there are no improvements or there can even be performance degradation. For WSJ removing stopwords from queries results in initial Okapi-BM25 ranking with p@5 and p@10 of 57.2 and 49.4, respectively. Employing CDPlm upon this initial ranking yields p@5 and p@10 of 61.2 and 54.2, respectively; the p@10 improvement is also statistically significant. Thus, we see that even when improving the effectiveness of the initial ranking (by using a different pre-processing regime here), CDPlm still posts quite substantial performance improvements over this ranking.

5 Conclusions and future work

Cluster-based and passage-based document ranking approaches could be viewed as employing two opposite approaches for document representation. Cluster-based document retrieval is often based on expanding the document representation with corpus context manifested in the clustering structure. Passage-based document retrieval is based on focusing on a specific part of the document.

We presented a study of the relative merits of each of these approaches, and of the potential of integrating them. To perform the study, we devised two retrieval methods that integrate whole-document-, cluster-, and passage-based information. The first is a probabilistic approach that uses document-based, passage-based and cluster-based language models. The second is a discriminative, learning-to-rank, approach that uses language-model-based estimates.

We evaluated and studied the proposed methods when applied for the re-ranking task—re-ordering documents in an initially retrieved list so as to improve precision at the very top ranks. We showed that the methods consistently and substantially outperform the initial ranking. The resultant performance of the probabilistic approach also transcends that of document ranking methods that use either cluster-based or passage-based information, but not both. Hence, the empirical findings support the complementary nature of these two information types, and the potential in integrating them. Furthermore, we showed that the integration can yield performance that often transcends that of a state-of-the-art pseudo-feedback-based query expansion method—i.e., an approach that focuses on query representation, rather than on document representation, which is the focus of this paper.

In addition, the study showed that using cluster-based information is much more effective than using passage-based information for document ranking, except for corpora containing very large (and heterogeneous) documents for which the reverse holds. Nevertheless, integrating cluster-based and passage-based information can yield performance that substantially transcends that of using each alone. More generally, we showed that integrating these two types of information with whole-document-based information can yield performance that is substantially better than that of using any subset of the three information types.

A future direction that emerged in the study was devising an automatic way of balancing the use of whole-document-, cluster-, and passage-based information on a per-query basis. While there is some work on controlling the use of whole-document-based versus passage-based information (Bendersky and Kurland 2008) on a per-document basis, an open challenge is how to balance those with respect to using cluster-based information on a per-query basis.

As noted above, the study presented in this paper addresses one component of a search system; that is, (a part of) the document representation task is addressed from an effectiveness perspective. As already stated, our approach does not incur a considerable computational overhead over the initial ranking that is based on document-query similarities. Hence, from an efficiency point of view, the approach is applicable in practical retrieval settings. Yet, a natural question, which rises with regard to Cranfield-style-based evaluations (Hersh et al. 2000; Turpin and Hersh 2001; Smucker and Jethani 2010) as the one we presented here, is whether the presented effectiveness improvements can be translated to improved user satisfaction/effectiveness. While this is an interesting question at its own right, we note that there are still additional means that can be employed so as to potentially improve the performance of our approach, and which can further increase the potential for merits to the user in practical search settings. For example, while our focus was on the “document side”, integrating in addition different (expanded) query representations can potentially help improve performance; e.g., cluster-based (and topic-model-based) query expansion (Liu and Croft 2004; Tao et al. 2006; Wei and Croft 2006; Kalmanovich and Kurland 2009) and passage-based query expansion (Liu and Croft 2002; Bendersky and Kurland 2008) were shown to be of merit. Furthermore, using different types of passages, and utilizing different types of language models and/or term-proximity-based models, can also potentially improve performance as mentioned in Sect. 2.