1 Introduction

One of the major challenges that search engines have to cope with is the ad hoc retrieval task: ranking documents in a corpus (collection) by their presumed relevance to an information need represented by a given query.

Many retrieval methods are based on comparing a representation of the document as a whole with that of the query so as to induce document ranking [e.g., vector space model (Salton et al. 1975), language modeling framework (Ponte and Croft 1998; Croft and Lafferty 2003)]. However, it could be the case that a long and/or heterogeneous (with respect to content) document has only a short part (passage) with information pertaining to the query. Under the premise that such a document is of interest to the user who initiated the query, as it does bear some pertaining information, comparing the document as a whole with the query can fall short with respect to the effectiveness of the resultant ranking.

Thus, researchers have proposed various approaches that utilize passage-based information for document retrieval (Salton et al. 1993; Callan 1994; Mittendorf and Schäuble 1994; Wilkinson 1994; Kaszkiel and Zobel 1997; Denoyer et al. 2001; Kaszkiel and Zobel 2001; Liu and Croft 2002; Wan et al. 2008; Wang and Si 2008). Among the two most common methods are ranking a document by the highest query-similarity score that any of its passages is assigned (Callan 1994; Wilkinson 1994; Liu and Croft 2002); and, interpolating this score with the document-query similarity score (Callan 1994; Wilkinson 1994).

We show that several of these previously proposed passage-based document-ranking approaches—including the two just mentioned—which might seem at a first glance as somewhat independent, can actually be understood using the same probabilistic model. The key principle that guides the derivation of our model is that passages can serve as effective proxies for ranking documents. More generally, the fundamental hypothesis underlying the work presented in this paper is that integrating information induced from the document as a whole with information induced from its constituent passages, in a way that depends on the content-homogeneity of the document, is beneficial for document ranking.

To instantiate specific retrieval algorithms using our proposed model, we use statistical language models (Ponte and Croft 1998; Croft and Lafferty 2003). We use the term “statistical language model” to refer to a probability distribution defined over the vocabulary that is induced from a given span of text (e.g., document or passage). To score a span of text (document or passage) in response to a query we determine the probability that the query terms can be generated by a language model induced from the span. Indeed, there is a large body of work on utilizing language models for different information retrieval tasks (Zhai 2008).

Among the main contributions of this paper is the proposal of a novel passage language model that incorporates information from the containing document to an extent controlled by the estimated document homogeneity. Our hypothesis is that (language) models of passages in highly homogeneous documents, that is, documents that could be thought of as focusing on a single topic (issue), should utilize a substantial amount of information from the containing document; for passages in highly heterogeneous documents—which potentially discuss a few topics—minimal such information should be used.

Experiments performed over TREC data attest that several document-homogeneity measures that we present yield passage language models that are more effective for basic passage-based document ranking than the standard passage model; this model uses information only from the passage and some general corpus-based term statistics. Our proposed passage models are also more effective than the standard one for constructing and utilizing passage-based relevance models (Liu and Croft 2002); furthermore, the resultant relevance models also outperform a document-based relevance model (Lavrenko and Croft 2001).

The probabilistic model that we present also gives rise to new passage-based document ranking approaches. One such approach that we study and analyze integrates document-query and passage-query similarity information based on document-homogeneity measures; experimental results demonstrate the effectiveness of this approach. In further exploration we study the merits in integrating this method with our proposed passage language models.

The rest of this paper is organized as follows. In Sect. 2 we develop a probabilistic framework for passage-based document ranking; we then instantiate language-model-based algorithms using this framework in Sect. 2.2 wherein we also present a novel passage language model. Section 2.3 presents measures for estimating document homogeneity, which are utilized in some of our retrieval methods and language models. We then survey related work in Sect. 3, and present an empirical evaluation of the proposed methods in Sects. 4 and 5. We conclude and specify some potential directions for future work in Sect. 6.

2 Retrieval framework

In what follows we present a probabilistic framework for passage-based document retrieval. We show that several previously-proposed passage-based document-ranking approaches can be understood using the proposed framework, and that some new retrieval models can be derived from it. To instantiate specific algorithms from the framework, we use language model estimates in Sect. 2.2; in doing so, we present a novel passage language model.

Notation and conventions. Throughout this section we assume that the following have been fixed: a query q, a document d, and a corpus of documents \({\mathcal C} \; (d \in {\mathcal C}). \) We use g to denote a passage, and write g ∈ d if g is one of d’s m passages. (Our algorithms are not dependent on the type of passages used.) We write \(p_{x}(\cdot)\) to denote a (smoothed) unigram language model induced from x (a document or a passage); our language model induction methods are described in Sect. 2.2.1.

2.1 Passage-based document ranking

We rank document d in response to query q by estimating the probability \({p}(q \vert d)\) that q can be generatedFootnote 1 from a model induced from d, as is common in the language modeling approach to retrieval (Ponte and Croft 1998; Croft and Lafferty 2003). We hasten to point out, however, that our framework is not committed to any specific estimates for probabilities of the form \({p}(q \vert x),\) which we often refer to as the “query-similarity” of x.

Since passages are smaller, and hence, could be considered as more focused units than documents, they can potentially “help” in “generating” queries, which are usually composed of a few terms. Thus, assuming that all passages in the corpus can serve as proxies (representatives) of d for generating any query, and using \(p(g_i\vert d)\) to denote the probability that some passage g i in the corpus is chosen as a proxy of d, we can write

$$ {p}(q\vert d) = \sum_{g_i} {p}(q \vert d, g_i) {p}(g_i\vert d). $$
(1)

If we assume that d’s passages are better proxies of d than passages not in d,Footnote 2 then we can define

$$ \hat{p}(g_i\vert d) {\mathop{=}\limits^{def}} \left\{ \begin{array}{ll} {\frac{{p}(g_i\vert d)}{\sum_{g_j \in d} {p}(g_j\vert d)}} & \hbox {if } g_i \in d,\\ 0 & \hbox {otherwise}; \end{array}\right. $$

and use this estimate in Eq. 1 to rank d by

$$ {\rm Score}(d) {\mathop{=}\limits^{def}} \sum_{g_i \in d} {p}(q\vert d,g_i) \hat{p}(g_i \vert d). $$
(2)

To estimate \( p(q\vert d,g_i), \) we integrate \(p(q \vert d)\) and \(p(q \vert g_i)\) based on the assumed homogeneity of d, that is, the assumed focus on a single topic/issue throughout d. The more homogeneous d is assumed to be, the higher the impact it has as a “whole” on generating q. One way of performing such integration is to use a mixture-model-based estimateFootnote 3 \(h^{[{\mathcal{M}}]}(d) {p}(q \vert d) + (1-h^{[{\mathcal{M}}]}(d))p(q \vert g_i),\) where \(h^{[{\mathcal{M}}]}(d)\) assigns a value in [0, 1] to d by homogeneity model \({\mathcal{M}}.\) (Higher values correspond to higher estimates of homogeneity; we present document-homogeneity measures in Sect. 2.3.) Using some probability algebra (and the fact that \(\sum_{g_i \in d} \hat{p}(g_i \vert d) = 1),\) Eq. 2 then becomes

$$ {\rm Score}(d) {\mathop{=}\limits^{def}} h^{[{\mathcal{M}}]}(d) {p}(q \vert d) + (1-h^{[{\mathcal{M}}]}(d))\sum_{g_i \in d} p(q \vert g_i)\hat{p}(g_i \vert d), $$
(3)

with more weight put on the “match” of d as a whole to the query as d is considered more homogeneous.

If we consider d to be highly heterogeneous and consequently set \(h^{[{\mathcal{M}}]}(d)= 0,\) and in addition use the relative importance (manually) attributed to g i as a surrogate for \(\hat{p}(g_i \vert d),\) Eq. 3 then echoes a previously proposed ranking approach for (semi-)structured documents (Wilkinson 1994). On the other hand, if a uniform distribution is used for \(\hat{p}(g_i \vert d),\) that is, assuming that d’s passages are its equi-important proxies, then we score d by the mean query-similarity of its constituent passages:

$$ {\rm Score}_{mean}(d) {\mathop{=}\limits^{def}} \frac{1}{m} \sum_{g_i \in d} p(q \vert g_i). $$
(4)

Note that this ranking principle contradicts our motivation of ranking long/heterogeneous documents by potentially considering very few of their passages. Indeed, the experimental results that we present in Sect. 5 show that this model posts performance that is inferior to that of other models that we consider, which utilize only a single passage of each document for ranking it. Another potential alternative for dealing with the deficiency of this ranking principle is to sum, rather than average, the retrieval scores of the passages: \(\sum_{g_i \in d} p(q \vert g_i),\) as was done is some previous work (Hearst and Plaunt 1993). This scoring principle, which rewards long documents, can also be understood using our framework. That is, if we use for the document prior, p(d), the normalized value of its number of passages (where normalization is with respect to the number of passages in all documents in the corpus),Footnote 4 and use p(q,d) for ranking rather than \(p(q \vert d)\) as we do throughout this section, then this ranking principle rises.

An alternative approach to assuming a uniform distribution for \(\hat{p}(g_i \vert d)\) is to bound Eq. 3 by

$$ \text{Score}_{\text{inter}\text{-}{\text{max}}}(d) {\mathop{=}\limits^{def}}h^{[{\mathcal{M}}]}(d) {p}(q \vert d) + (1-h^{[{\mathcal{M}}]}(d)) \max_{g_i \in d} p(q \vert g_i). $$
(5)

This scoring function is a generalized form of approaches that interpolate—using the same fixed interpolation weights for all documents—the document-query similarity score and the maximum query-similarity score assigned to any of its passages (Buckley et al. 1994; Callan 1994; Wilkinson 1994; Cai et al. 2004); thus, using Eq. 5 we can see that these methods are based on the implicit assumption that all documents in the corpus are homogeneous to the same extent.

Now, assuming that d is highly heterogeneous and setting \(h^{[{\mathcal{M}}]}(d)=0\) in Eq. 5 yields a commonly-used approach that scores d by the maximum query-similarity estimated for any of its passages (Callan 1994; Wilkinson 1994; Kaszkiel and Zobel 2001; Liu and Croft 2002):

$$ {\rm Score}_{\rm max}(d) {\mathop{=}\limits^{def}}\max_{g_i \in d} p(q \vert g_i). $$
(6)

Alternatively, note that setting \(h^{[{\mathcal{M}}]}(d)=1\) in Eq. 5—i.e., assuming that d is highly homogeneous—results in a standard document-based ranking approach.

2.2 Language-model-based algorithms

Following standard practice in work on language models for IR (Croft and Lafferty 2003), we estimate \(p(q \vert d)\) and \(p(q \vert g_i)\) using the unigram language models induced from d and g i , i.e., p d (q) and \(p_{g_i}(q),\) respectively. Using these estimates in Eqs. 4, 5 and 6 yields the MeanPsg, InterMaxPsg and MaxPsg algorithms, respectively; the algorithms are summarized in Table 1.

Table 1 Passage-based document ranking algorithms vs. a standard document-based approach

We next present language model induction methods for estimating p d (q) and \(p_{g_i}(q).\) The choice of induction method has a considerable impact on retrieval performance as we show in Sect. 5.

2.2.1 Language model induction

Let \({\rm tf}(w \in x)\) denote the number of occurrences of term w in the text or text collection x. The maximum likelihood estimate (MLE) of w with respect to x is

$$ \widetilde{p}^{ MLE}_{x}({w}) {\mathop{=}\limits^{def}} \frac{{\rm tf}(w \in x)}{\sum_{w'} {\rm tf}(w' \in x)}. $$
(7)

As is common, to avoid the zero probability problem of terms not occurring in x, we smooth the estimate using corpus statistics (Zhai and Lafferty 2001)

$$ \widetilde{p}^{[base]}_{x}({w})= (1-\lambda_{{\mathcal C}}) \widetilde{p}^{ MLE}_{x}({w}) + \lambda_{{\mathcal C}} \widetilde{p}^{ MLE}_{\mathcal C}({w}). $$
(8)

(\(\lambda_{{\mathcal C}}\) is a free parameter.) Setting \(\lambda_{{\mathcal C}}\) to a fixed value yields the Jelinek-Mercer smoothing technique; alternatively, we can set \(\lambda_{{\mathcal C}} = \frac{\mu}{\vert x \vert + \mu}\) and get the Bayesian smoothing approach with Dirichlet priors (Zhai and Lafferty 2001). (\(\vert x \vert = \sum_{w'} {\rm tf}(w' \in x);\) μ is a free parameter.) Refer to Zhai and Lafferty (2001); Hiemstra (2002) and Zhai and Lafferty (2002) for further discussion on the importance of, and techniques for, smoothing language models in ad hoc document retrieval.

We extend the estimate just described to a sequence of terms \(w_1 w_2 \cdots w_n\) by using the unigram-language-model term-independence assumption

$$ p^{[base]}_{x}({w_1 w_2 \cdots w_n}) {\mathop{=}\limits^{def}} \prod_{j=1}^{n} \widetilde{p}^{[base]}_{x}({w_j}). $$
(9)

Passage language model. Using the passage language model \(p^{[base]}_{g}({\cdot})\) results in the passage-query similarity estimate \(p^{[base]}_{g}({q}),\) which is used in the algorithms described at the above, depending only on terms within the passage and the query and some corpus statistics. However, since passages are relatively short spans of text, a passage representation—language model in our case—can be potentially enriched with information from other passages in the document; such an approach can especially be effective for passages in documents that are somewhat homogeneous.

Indeed, some past work on question answering, and passage and XML retrieval (Abdul-Jaleel et al. 2004; Hussain 2004; Ogilvie and Callan 2004; Murdock and Croft 2005; Sigurbjörnsson and Kamps, 2005;Wade and Allan 2005) uses a passage language model that exploits information from the containing document to the same fixed extent for all passages and documents. In contrast, here we suggest to use the document estimated homogeneity to control the amount of reliance on document information. (Recall that homogeneity measures were used in the InterMaxPsg algorithm for fusion of similarity scores.) Specifically, the more homogeneous the document is assumed to be, the more we “trust” information extracted from the document as a whole when inducing a passage language model. Consequently, for gd and homogeneity model \({\mathcal{M}}\) we define the passage language model

$$ \widetilde{p}^{[{\mathcal{M}}]}_{g}({w}) {\mathop{=}\limits^{def}} \lambda_{psg}(g) \widetilde{p}^{ MLE}_{g}({w}) + \lambda_{doc}(d) \widetilde{p}^{ MLE}_{d}({w}) + \lambda_{{\mathcal C}} \widetilde{p}^{ MLE}_{\mathcal C}({w}); $$
(10)

we fix \(\lambda_{{\mathcal C}}\) to some value, and set \(\lambda_{doc}(d) = (1-\lambda_{{\mathcal C}})h^{[{\mathcal{M}}]}(d)\) and \(\lambda_{psg}(g) = 1 - \lambda_{{\mathcal C}} -\lambda_{doc}(d) \) to have a valid probability distribution. We then extend this estimate to sequences as we did at the above

$$ p^{[{\mathcal{M}}]}_{g}({w_1 w_2 \cdots w_n}) {\mathop{=}\limits^{def}} \prod_{j=1}^{n} \widetilde{p}^{[{\mathcal{M}}]}_{g}({w_j}). $$
(11)

Setting \(h^{[{\mathcal{M}}]}(d) =0\)—considering d to be highly heterogeneous—we get the standard passage language model from Eq. 9. On the other hand, assuming d is highly homogeneous and setting \(h^{[{\mathcal{M}}]}(d) =1\) results in representing each of d’s passages with d’s standard language model from Eq. 9. Note that in this case the MaxPsg algorithm amounts to a standard document-based language model retrieval approach.

2.3 Document-homogeneity measures

We now consider a few simple choices of models \({\mathcal{M}}\) for estimating document homogeneity. Specifically, we define functions \(h^{[{\mathcal{M}}]}: {\mathcal C} \rightarrow [0,1]\) with higher values corresponding to (assumed) higher levels of homogeneity. Recall that we proposed the homogeneity measures for two different roles. The first is controlling the balance between using document-based and passage-based information when inducing a passsage language model. The second is controlling the integration of document-based and passage-based retrieval scores.

Long documents have often been considered as more heterogeneous than shorter ones (Singhal et al. 1996). Intuitively, the chances for content heterogeneity in a document increase as the number of the terms it contains grows. We use the number of terms in a document as its length: \(\vert d \vert {\mathop{=}\limits^{def}} \sum_{w'}{{\rm tf}(w' \in d)},\) and formulate the following normalized length-based measure where normalization is with respect to the longest document in the corpusFootnote 5

$$ h^{[{length}]}(d) {\mathop{=}\limits^{def}} 1 - \frac{\log \vert d \vert - \min_{d_i \in {\mathcal C}} \log \vert d_i \vert }{\max_{d_i \in {\mathcal C}} \log \vert d_i \vert - \min_{d_i \in {\mathcal C}} \log \vert d_i \vert }. $$

However, the length-based measure just described does not handle the case of short heterogeneous documents. We can alternatively say that d is more homogeneous if its term distribution is concentrated around a small number of terms (Kurland and Lee 2005). To model this idea, we use the entropy of d's unsmoothed language model:

$$ H(d) {\mathop{=}\limits^{def}} -\sum_{w \in d} \widetilde{p}^{ MLE}_{d}({w})\log \widetilde{p}^{ MLE}_{d}({w}); $$

note that higher values correspond to assumed lower levels of homogeneity. We then normalize the entropy with respect to the maximum possible entropy of a document with the same length as that of d’s, that is, a document in which each term occurs once (i.e., \(H(d) = \log \vert d\vert\)). Thus, the entropy-based homogeneity measure is defined as

$$ h^{[{ent}]}(d) {\mathop{=}\limits^{def}} \left\{ \begin{array}{ll} 1 - \frac{H(d)}{\log \vert d \vert}& \vert d \vert > 1;\\ 1 & \hbox {otherwise.} \end{array}\right. $$

Both homogeneity measures just described are based on the document as a whole and do not explicitly estimate the variety among its passages. We can say, for example, that the more similar the passages of a document are to each other, the more homogeneous the document is. Alternatively, a document with passages highly similar to the document as a whole might be considered homogeneous.

To formally capture these two homogeneity notions, we assume that the passages of d are assigned with unique IDs, and denote the tf.idf vector-space representation of text x as x (Salton, 1968). We can then define these homogeneity notions using

$$ h^{[{interPsg}]}(d) {\mathop{=}\limits^{def}} \left\{ \begin{array}{ll} \frac{2}{m(m-1)} \sum_{i< j;g_i,g_j \in d} \cos({\bf{g_i}, {g_j}}) & {\hbox {if}} \,m>1 \\ 1 & \hbox{otherwise;} \end{array}\right. $$

and

$$ h^{[docPsg]}(d) {\mathop{=}\limits^{def}} \frac{1}{m} \sum_{g_i \in d} \cos({\bf d},{\bf g_i}), $$

respectively.

Although it is clear that the document homogeneity measures interPsg and docPsg are connected, they might differ in the magnitude of assumed homogeneity that they assign to documents. In fact, the docPsg measure can be viewed as more conservative than the interPsg measure. For example, consider a document with two passages that do not share any terms; in this case \( h^{[{interPsg}]}(d) =0,\) as there is no term overlap between the passages, while \(h^{[{docPsg}]}(d)>0,\) as each passage bears some similarity to the document as a whole.

We also note that trying to model the latter two homogeneity notions utilizing a (normalized) version of the KL-divergence between language models yielded retrieval performance substantially inferior to that resulting from using a vector space representation with the cosine measure. This finding echoes to some extent some previous reports about the improved effectiveness of cosine with respect to a symmetrized version of the KL divergence (e.g, the J divergence) in settings wherein inter-textual similarities are modeled (Diaz 2005; Kurland 2006).

Finally, we hasten to point out that while the homogeneity measures presented above are quite simple, our focus in this paper is on studying whether homogeneity modeling can help to improve retrieval effectiveness using our models, rather than on devising and utilizing sophisticated homogeneity measures. Devising such novel measures and adopting existing ones from work in the area of natural language processing (e.g., Barzilay and Lee (2004)), so as to be used with our retrieval models, is an interesting venue for future work.

3 Related work

A common use of passage-based information, on which we focus in this paper, is for ad hoc (query-based) document retrieval. As mentioned in Sect. 1, long and/or topically heterogeneous relevant documents might contain (lots of) information not pertaining to the query. Hence, estimating relevance in these cases using passage-based information might be more beneficial than only comparing the document as a whole to the query (Salton et al. 1993; Callan 1994; Wilkinson 1994; Kaszkiel and Zobel 1997, 2001; Liu and Croft, 2002; Cai et al. 2004; Bendersky and Kurland 2008a; Na et al., 2008; Wan et al. 2008; Wang and Si 2008).

One of the most common approaches for document retrieval using passage-based information is based on scoring a document by the highest query-similarity that any of its passages exhibits (Callan 1994; Wilkinson 1994; Kaszkiel and Zobel 2001; Liu and Croft, 2002; Na et al. 2008). Interpolating this similarity score with a document-query similarity score (using fixed interpolation weights) is another recurring retrieval approach (Buckley et al. 1994; Callan 1994; Wilkinson 1994; Cai et al. 2004; Wan et al. 2008; Wang and Si 2008). We showed in Sect. 2 that these two retrieval approaches can be understood using the same probabilistic model. Furthermore, in Sect. 5 we demonstrate the relative merits of using homogeneity-based interpolation coefficients for the interpolation just described.

Passage-based information has also been utilized in other information retrieval tasks. For example, instead of returning documents in response to a query one can return passages that supposedly contain pertaining information (Denoyer et al. 2001; Allan 2003; Jiang and Zhai 2004; Murdock and Croft 2005; Wade and Allan 2005). Retrieving passages is also common in work on question answering (Corrada-Emmanuel et al. 2003; Lin et al. 2003; Tellex et al. 2003; Hussain 2004; Zhang and Lee 2004; Otterbacher et al. 2005) wherein answers are being extracted (or compiled) from passages that are deemed relevant to the question at hand. While we demonstrate the merits in using our proposed passage language model for ad hoc document retrieval, it can also potentially be used in any task that calls for passage retrieval.

Liu and Croft (2002) studied the utilization of a standard passage language model (refer back to Eq. 9 in Sect. 2.2) for document retrieval. They rank a document by the highest query likelihood assigned by any of the document’s constituent passages models. (This is the MaxPsg retrieval algorithm.) We demonstrate the merits in using our homogeneity-based passage model instead of the standard passage language model in the MaxPsg algorithm in Sect. 5.2. Liu and Croft (2002) also proposed methods for constructing and utilizing passage-based relevance models (Lavrenko and Croft 2001) using the standard passage model.Footnote 6 Similar (and somewhat improved) passage-based relevance models were later used by Corrada-Emmanuel et al. (2003) and Li and Zhu (2008) for passage and document retrieval. We show in Sect. 5.2 that our passage model is more effective for constructing and utilizing passage-based relevance models for document retrieval than the standard passage model is.

Inducing a passage language model by utilizing information drawn from the containing document was shown to be effective in work on sentence and passage retrieval and question answering (Abdul-Jaleel et al. 2004; Hussain 2004; Murdock and Croft 2005; Wade and Allan 2005), and XML retrieval (Ogilvie and Callan 2004; Sigurbjörnsson and Kamps 2005). In contrast to our approach that utilizes document-homogeneity measures for controlling the reliance on document information, these approaches fix this reliance to the same extent for all documents, thereby making the implicit assumption that all documents are homogeneous to the same extent. We demonstrate the relative merits of our passage-model induction approach in Sect. 5.2.

In a related vein, there is some recent work (Wan et al. 2008) on inducing a document language model by using information both from the document as a whole and from the document’s passage that is the most similar to the query; then, the induced document models are used for document ranking. The proposed document language model is the same model proposed for passages in work described at the above (Abdul-Jaleel et al. 2004; Hussain 2004; Murdock and Croft 2005;Wade and Allan 2005), and can be viewed as a specific case of our proposed passage language model when implemented with degenerated homogeneity measures, that is, fixed interpolation weights. It is also interesting to note that using the MaxPsg algorithm with our passage language model is reminiscent of the approach proposed by Wan et al. (2008). Specifically, implementing MaxPsg with our passage model can conceptually be viewed as a two-step retrieval process: we first find for each document its passage that is most similar to the query (by using the passage model from Eq. 10), and then use Eq. 10 again but this time as the language model of the document for performing document ranking. However, while Wan et al. (2008) use fixed interpolation weights for language model induction, we utilize homogeneity measures to control the interpolation.

One of our document-homogeneity measures (interPsg) is based on measuring within-document inter-passage similarities. Such similarities were also used in a recently-proposed passage-based document-retrieval discriminative approach (Wang and Si 2008). Specifically, the goal is to ensure that the relevance status of different passages within the same document will not be independent. Furthermore, since the suggested discriminative approach is based on language-model retrieval of passages using the standard passage language model, using our proposed passage language model instead can potentially help to further improve the effectiveness of this method. Inter-passage similarities within a document were also utilized for semantic decomposition of documents (Hearst and Plaunt 1993; Ashoori et al. 2007). Inter-passage similarities between passages that are not necessarily parts of the same document have been used for document re-ranking and question answering using graph-based approaches (Otterbacher et al. 2005; Wan et al. 2008).

Finally, we note that in previous works various passage types have been either automatically or manually identified in documents. Among which are discourse passages, which are based on document markup (e.g., sentences, paragraphs or SGML/HTML markups) (Salton and Buckley 1991; Callan, 1994; Wilkinson 1994; Cai et al. 2004; Hussain 2004), semantic passages, which are often identified based on presumed shifts between topics within a document (Hearst and Plaunt 1993; Mittendorf and Schäuble 1994; Ponte and Croft 1997; Denoyer et al. 2001; Jiang and Zhai 2004), and fixed (or variable) length window passages (Callan 1994; Kaszkiel and Zobel 1997; Liu and Croft 2002; Wade and Allan 2005; Na et al. 2008; Wang and Si 2008). As noted earlier, our retrieval models could potentially be used with different passage types.

4 Experimental setup

In what follows we present an empirical evaluation designed to explore the relative merits (or lack thereof) of the methods and language models that we presented in Sect. 2. We first describe in this section the experimental setup used for evaluation. In Sect. 5 we present the experimental results.

We conducted our experiments on the following four TREC corpora.

Corpus

# of docs

Avg. doc. length

Queries

Disk(s)

FR12

45,820

935

51–100

1, 2

LA+FR45

187,526

317

401–450

4, 5

WSJ

173,252

263

151–200

1–2

AP89

84,678

264

1–50

1

FR12, which is a common testbed in work on passage-based document retrieval (Callan 1994; Wilkinson 1994; Liu and Croft 2002; Dang et al. 2007; Wang and Si 2008), and LA+FR45, which is known to be a very hard benchmark with TREC8 queries (Hu et al. 2003; Kurland et al. 2005), contain documents that are considered heterogeneous due to the FR component. That is, the FR collections contain long documents that often span different subjects. Documents in AP89 and WSJ, on the other hand, are considered as relatively homogeneous; these corpora were also used in previous work on passage-based document retrieval (Dang et al. 2007; Wang and Si 2008).

Relevance judgments for TREC data, that is, the relevance of a document with respect to an information need underlying a query as determined by human annotators, can be established based on a single small evidence in a document. (Refer to the TREC guidelines for human annotators (Voorhees and Harman 2005).) Thus, while it might seem at a first glance that passage-based document retrieval models have an advantage over document-based retrieval approaches with respect to TREC data due to this annotation approach, we show, as in previous work, that for several corpora the reverse holds—i.e., document-based retrieval posts superior performance; moreover, we demonstrate the merits in using information induced from the document as a whole in our passage-based retrieval approaches.

We used the Lemur toolkit (http://www.lemurproject.org) to run our experiments. We applied basic tokenization and Porter stemming, and removed INQUERY stopwords (Allan et al. 2000). Titles of TREC topics serve as queries.

To evaluate the effectiveness of the various algorithms we use mean average (non-interpolated) precision at 1000 (MAP) and the precision of the top 5 and 10 documents (p@5, p@10). MAP is a widely accepted metric for evaluating the quality of retrieval functions; p@5 and p@10 measure the ability of retrieval methods to position relevant documents at the very high ranks of the retrieved results, and hence, are often regarded as “user-oriented” measures (Voorhees and Harman 2005). We determine statistically significant differences in performance using the two-tailed Wilcoxon test at the 95% confidence level.Footnote 7

Passages. While there are several types of passages that we can implement our algorithms with (refer back to Sect. 3), our focus in this paper is on the general validity of our retrieval algorithms and language-model induction techniques. Therefore, we use half overlapping fixed-length windows of sizes 150, 50 and 25 as passages and mark them prior to retrieval time. Such passages are computationally convenient to use and were shown to be quite effective for document retrieval (Callan 1994; Wang and Si 2008), specifically, in the language model framework (Liu and Croft 2002).

Reference comparisons. The various methods that we presented in Sect. 2 utilize passage-based information for ranking documents. A natural question is how they compare with the (standard) approach of ranking documents based on their match as whole units to the query. To this end, we use Doc[base]—a language-model document-retrieval approach that scores document d by \(p^{[base]}_{d}(q)\)—as a reference comparison to our methods.

The second reference comparison that we consider is the MaxPsg algorithm implemented with the standard passage language model \((p^{[base]}_{g}({\cdot}));\) we use MaxPsg[base] to denote this method, which was proposed by Liu and Croft (2002). Also, recall from Sect. 2 that MaxPsg[base] is a language-model instantiation of a commonly-used passage-based document ranking approach (Buckley et al. 1994; Callan, 1994; Wilkinson 1994; Cai et al. 2004). (Refer back to Table 1 for a specification of the different scoring methods, specifically, that of the MaxPsg algorithm.)

Parameter settings. All our algorithms incorporate a single free parameter \(\lambda_{{\mathcal C}},\) which controls the amount of reliance on corpus-based statistics for smoothing. (Refer back to Sect. 2.2 for details.) To establish a fair comparison between our algorithms’ performance and that of the reference comparisons just described, we take the following approach. We set \(\lambda_{{\mathcal C}}\) in all algorithms (unless otherwise specified) to a value in {0.1,…, 0.9} that results in (near) optimal MAP performance for both reference comparisons; specifically, \(\lambda_{{\mathcal C}}=0.5\) yields such (near) optimal performance for all corpora. Furthermore, \(\lambda_{{\mathcal C}}=0.5\) was also used in some recent work on utilizing passage-based language models (Wan et al. 2008; Wang and Si 2008). Consequently, we note that the performance of our suggested models is not necessarily the optimal one they can attain. Indeed, our focus here is on studying the potential merits and characteristics of the underlying principles of our suggested methods, specifically, the ways of integrating document-based and passage-based information, rather than on tuning parameter values and optimizing performance.

We note that setting \(\lambda_{{\mathcal C}}\) to a fixed value as just described results in our reference comparisons (Doc[base] and MaxPsg[base]) utilizing Jelinek-Mercer smoothing (refer back to Sect. 2.2), which is somewhat less effective than Dirichlet smoothing when using short queries (Zhai and Lafferty 2001). However, recall that our homogeneity-based passage language model integrates (via interpolation) the unsmoothed document and passage language models on which Doc[base] and MaxPsg[base] are based, respectively, with the corpus model. Moreover, the standard document and passage language models, when Jelinek-Mercer smoothing is employed, are specific instantiations of our new passage language model with degenerated homogeneity measures, as mentioned in Sect. 2.2. Hence, setting \(\lambda_{{\mathcal C}}\) to a fixed value yields the most potentially “clean” comparison between our passage model and the (unsmoothed) language models it leverages; such a language-model comparison would be somewhat harder if Dirichlet-smoothing is employed, and it is left for future work. (Refer to Murdock and Croft (2005) for similar arguments about language model comparison.)Footnote 8 We hasten to point out, however, that in Sect. 5.3 we do compare the performance of Doc[base] and MaxPsg[base] when implemented with Dirichlet smoothing with that of our InterMaxPsg algorithm, which fuses their retrieval scores based on the homogeneity measures.

5 Experimental results

We now turn to examine the performance numbers of the different methods that we proposed and those of the reference comparisons. We first compare the effectiveness of the MeanPsg and MaxPsg algorithms in Sect. 5.1. Then, we study the performance results of using our proposed passage language model in Sect. 5.2; the study includes comparison to previous approaches for deriving passage language models, and an analysis of the effectiveness of using passage-based relevance models with our proposed passage model. In Sect. 5.3 we study the performance of our suggested InterMaxPsg algorithm, and in Sect. 5.4 we explore its integration with our passage language model.

5.1 Initial comparison

We derived the MaxPsg and MeanPsg algorithms from the same model in Sect. 2. While the former ranks a document by the query-similarity score of its passage that exhibits the best match to the query, the latter ranks a document based on the average match of its passages to the query. To contrast these two paradigms and to compare them with the approach of ranking a document based on its match as a whole to the query (i.e., standard document-based retrieval), we use the standard document and passage language models \(p^{[base]}_{x}({\cdot}).\) Hence, the implementation of MaxPsg and the standard document retrieval approach are the MaxPsg[base] and Doc[base] algorithms, respectively—our reference comparisons. We use MeanPsg[base] to denote the implementation of MeanPsg with the standard passage language model. The performance comparison of the methods is presented in Table 2.

Table 2 Performance comparison of a standard document-based retrieval approach (Doc[base]) with the MaxPsg (MaxPsg[base]) and MeanPsg (MeanPsg[base]) algorithms

The performance numbers in Table 2 clearly indicate the inferiority of the MeanPsg algorithm to using both standard document-based retrieval and the MaxPsg algorithm. This finding is in accordance with the observation motivating the work presented in this paper that (long/heterogeneous) relevant documents may contain several parts not pertaining to the query.

In comparing Doc[base] with MaxPsg[base] we observe the following patterns. The performance of Doc[base] is clearly superior to that of MaxPsg[base] on AP89, which contains homogeneous documents. For FR12, which contains long heterogeneous documents, MaxPsg[base] is superior to Doc[base] except when using passages of size 25, which are detrimental to the performance of MaxPsg[base]. (Such relatively short passages were shown in previous work on passage-based document retrieval to be much less effective than longer ones (Callan 1994; Liu and Croft 2002).) In addition, Doc[base] is somewhat more effective than MaxPsg[base] on LA+FR45, while the reverse holds for WSJ when considering passages of size 150. (However, the performance differences are not statistically significant.)

The findings at the above resonate with the hypothesis (see Sect. 2) that retrieval performance can be enhanced by ranking a document based on only one of its passages—the one that is the most similar to the query—especially when ranking long/heterogeneous documents. These findings are also in line with previous reports that found that standard document retrieval is more effective for corpora containing homogeneous documents than passage-based document retrieval while the reverse holds for corpora containing heterogeneous documents (Callan 1994; Liu and Croft 2002).

We proposed two approaches in Sect. 2 for (automatically) handling this “document-homogeneity effect” on the effectiveness of passage-based document retrieval. The first is utilizing our homogeneity-based passage language model, the effectiveness of which we explore in the next section. The second is homogeneity-based fusion of document-query and passage-query similarity scores, which is the basis of our InterMaxPsg algorithm; we study this algorithm’s effectiveness in Sect. 5.3.

5.2 Homogeneity-based passage language models

The observations made at the above, along with those in previous reports (Callan 1994; Liu and Croft 2002), imply that document homogeneity might have considerable impact on the (relative) effectiveness of the MaxPsg algorithm when implemented with the standard passage language model. We therefore turn to study the merits in using our homogeneity-based passage language model for implementing MaxPsg. We use MaxPsg \([\varvec {\mathcal{M}}]\) to denote this implementation (\({\mathcal{M}}\) is the homogeneity measure). Recall that the reference comparisons Doc[base] and MaxPsg[base] are specific instantiations of MaxPsg\([{\mathcal{M}}]\) with degenerated homogeneity measures (\(h^{[{\mathcal{M}}]}(d) \equiv 1\) and \(h^{[{\mathcal{M}}]}(d) \equiv 0,\) respectively).

Table 3 presents the performance numbers of MaxPsg[\({\mathcal{M}}]\) and of the reference comparisons. We can see in Table 3 that the MaxPsg algorithm is consistently more effective when utilizing our homogeneity-based passage language model than when using the standard passage language model. Indeed, when considering MAP, MaxPsg[\({\mathcal{M}}]\) is superior to MaxPsg[base] in almost all relevant comparisons (4 corpora × 4 homogeneity measures × 3 passage lengths). For the p@5 and p@10 metrics, MaxPsg[\({\mathcal{M}}]\) posts performance that is at least as good as that of MaxPsg[base] in a majority of the cases. In many cases, the performance improvements posted by our passage language model are also statistically significant. (Refer to the numbers for AP89 and WSJ, for example.)

Table 3 Performance numbers of the MaxPsg algorithm when implemented with either a standard passage language model (MaxPsg[base]) or our new passage language model (MaxPsg\([{\mathcal{M}}])\)

Another observation that we make based on Table 3 is that, in general, the best performing document homogeneity measures for inducing the passage model are length—demonstrating its correlation with heterogeneity (Singhal et al. 1996)—and docPsg, which measures the similarity between a document and its constituent passages; the latter finding is not surprising as docPsg is directly related to the balance of document-based and passage-based information that we want to control when inducing the passage language model. We also note that in many cases the performance posted by the length and docPsg measures is better to a statistically significant degree than that of the other two homogeneity measures.

To further study the connection between the different homogeneity measures, we present in Table 4 the correlation (as measured by Pearson’s coefficent) between the homogeneity values assigned by them to the documents in each corpus. As can be seen, there is, in general, relatively high correlation between the length, docPsg and interPsg assigned values; and, there is low correlation of the values assinged by these measures and those assigned by the ent measure. These findings further attest to the correlation between the length of a document and the content variety among its passages as measured by docPsg and interPsg.

Table 4 Pearson correlation between values assigned by the document-homogeneity measures

Going back to Table 3, we can also see that very short passages (of 25 terms) are the worst choice among the three we consider. This finding is in line with some previous reports (Callan 1994; Liu and Croft 2002). However, using our passage language model ameliorates (to some degree) the performance decay caused by the use of short passages when compared to using the standard passage language model.

Our best performing methods, MaxPsg[length] and MaxPsg[docPsg], are both superior in most relevant comparisons (4 corpora × 3 evaluation metrics) to the standard document-based retrieval method, Doc[base], when passages of 50 terms are used. Sometimes, the performance improvements are also statistically significant. (See, for example, the MAP performance for FR12.) Even for AP89, which is considered to contain highly homogeneous documents, MaxPsg[length] and MaxPsg[docPsg] post performance that is statistically indistiguishable (for passages of 50 and 150 terms) from that of Doc[base] (although lower in terms of MAP); on the other hand, the MAP performance of MaxPsg[base] is worse to a statistically significant degree than that of Doc[base] on AP89.

It is also important to note that the effectiveness of our passage language model that integrates document and passage information cannot be attributed to one of the two being consistently more effective than the other. For example, the performance numbers for FR12 and LA+FR45 show that MaxPsg[length] is superior in most of the relevant comparisons to both MaxPsg[base] and Doc[base]; however, MaxPsg[base] is superior to Doc[base] on FR12 and inferior to Doc[base] on LA+FR45.

All in all, perhaps the most important conclusion we can draw from Table 3 is that our homogeneity measures help to (automatically and) effectively integrate document and passage information for constructing our passage language model for use in the MaxPsg algorithm.

The importance of homogeneity measures. We derived our passage language model by using information from the containing document to an extent controlled by document-homogeneity measures. To study the importance of using homogeneity measures to control the reliance on document-based information, we examine an alternative of fixing this reliance to the same extent for all documents. Hence, such an alternative, which echoes some previous methods for inducing passage language models (Abdul-Jaleel et al. 2004; Hussain 2004; Murdock and Croft 2005; Wade and Allan 2005), is based on the implicit assumption that all documents in the corpus are homogeneous to the same extent.

To study this alternative, we set \( h^{[{\mathcal{M}}]}(d)\) (the estimated homogeneity of d) to a fixed value in {0, 0.2,…, 1} for all \(d \in {\mathcal C}.\) Doing so results in fixing \(\lambda_{doc}(d)\) in Eq. 10 (Sect. 2.2) to a value in \(\{0,0.1,\ldots,0.5\},\) because \(\lambda_{doc}(d) = (1-\lambda_{{\mathcal C}})h^{[{\mathcal{M}}]}(d)\) and \(\lambda_{{\mathcal C}}=0.5.\) Note that setting \(\lambda_{doc}(d)\) to 0 or 0.5 and using the MaxPsg algorithm amounts to using the reference comparisons MaxPsg[base] and Doc[base], respectively.

Figure 1 depicts the MAP performance curve of the MaxPsg algorithm when fixing \(\lambda_{doc}(d)\) to the same value for all documents. We also plot for comparison the performance of our best performing homogeneity-based methods, MaxPsg[docPsg] and MaxPsg[length], with horizontal lines.

Fig. 1
figure 1

The MAP performance of the MaxPsg algorithm. We either set \(\lambda_{doc}(d)\) (see Eq. 10 in Sect. 2.2) to a fixed value for all documents (curve), or use the length and docPsg homogeneity measures as in our original proposal (thin and thick horizontal lines, respectively). Note: Figures are not to the same scale

We can see in Fig. 1 that using homogeneity measures helps to avoid a relatively poor performance obtained by a bad choice of a fixed λ doc (d). (Note, for example, that for FR12 the worst choice amounts to Doc[base], while for LA+FR45 and AP89 it amounts to MaxPsg[base], when using passages of 50 and 150 terms.) Furthermore, for the FR12, LA+FR45 and WSJ corpora, using the homogeneity measures results in near (or even better than) optimal performance that is obtained using a fixed \(\lambda_{doc}(d)\) when utilizing passages of 50 and 150 terms. It is also important to note that while the performance improvements posted by the homogeneity-based approaches over the choice of fixed values of \(\lambda_{doc}(d)\) are sometimes small in absolute terms, some of them are statistically significant. For instance, for passage of 50 terms, MaxPsg[length]’s performance is better to a statistically significant degree than that of setting: (i) \(\lambda_{doc}(d)=0\) for all corpora, (ii) \(\lambda_{doc}(d)=0.5\) for FR12, (iii) \(\lambda_{doc}(d)=0.1\) for AP89, FR12 and WSJ, and (iv) \(\lambda_{doc}(d)=0.1\) for FR12; in addition, MaxPsg[docPsg]’s performance is better to a statistically significant degree than that of setting \(\lambda_{doc}(d)=0\) for WSJ and AP89, \(\lambda_{doc}(d)=0.5\) for FR12, and \(\lambda_{doc}(d)=0.1\) for AP89 and WSJ.Footnote 9

Passage-based relevance models. There has been some work in the language modeling framework on utilizing the standard passage language model (see Eq. 8 in Sect. 2.2) for defining relevance models (Liu and Croft 2002). Relevance models are a state-of-the-art pseudo-feedback-based approach for query expansion in the language model framework (Lavrenko and Croft 2001). Liu and Croft (2002) present several methods for constructing and utilizing passage-based relevance models for document retrieval. Their most effective method is to construct a relevance model \({\mathcal R}\) using only passages, and then to score \(d \in {\mathcal C}\) by the Kullback Leibler (KL) divergence: \(min_{g_i \in d} D\left({\widetilde{p}_{\mathcal R}(\cdot)} \; \Big\vert\Big\vert {\widetilde{p}^{[base]}_{g_i}({\cdot})}\right). \) This ranking algorithm could be viewed as a special case of the MaxPsg algorithm where the query q is replaced with the relevance model \({\mathcal R}.\)

We now turn to study the merits (or lack thereof) of using our homogeneity-based passage models for constructing and utilizing passage-based relevance models. Specifically, we compare the implementation of Liu and Croft (2002), denoted RelPsg, which utilizes the standard passage language model, to an implementation, denoted RelPsg\([{\mathcal{M}}],\) which utilizes our new passage language model. We also use the standard document-based relevance model (Lavrenko and Croft 2001), denoted RelDoc, as a reference comparison.

We (independently) optimized the performance of each of the reference comparisons (RelPsg and RelDoc) with respect to the number of top-retrieved elements (i.e., passages or documents) and the number of terms used for constructing the relevance models. Specifically, we select these parameters’ values from the set {25, 50, 75, 100, 250, 500}—i.e., a total of 36 parameters settings—so as to optimize MAP performance over the tested set of queries. We set \(\lambda_{{\mathcal C}}=0.5\) (as in Sect. 4) except for estimating top-retrieved elements’ language models for constructing \({\mathcal R},\) wherein we set \(\lambda_{{\mathcal C}}=0.2\) following previous recommendations (Lavrenko and Croft 2001). To set the parameters’ values for our RelPsg\([{\mathcal{M}}]\) algorithms we use those chosen for the RelPsg baseline—i.e., we have not optimized the performance of our methods, but rather of those of the reference comparisons.

Table 5 depicts the performance numbers of the different relevance models. (We present results for the length and docPsg measures. Results for the other measures can be found in Table 8 in Appendix.) We can see in Table 5 that in a majority of the relevant comparisons, using our passage language model yields relevance models (RelPsg\([{\mathcal{M}}])\) that outperform both the original one that is based on a standard passage language model (RelPsg) and the document-based relevance model (RelDoc). (Observe, for example, that the underlined numbers that constitute best performance for a corpus per evaluation metric appear almost exclusively in RelPsg\([{\mathcal{M}}]\) rows.) Furthermore, in many of the comparisons, the performance differences are also statistically significant; see the case for passages of 50 terms, for example.

Table 5 Performance numbers of passage-based relevance models (Liu and Croft 2002). We use either the originally suggested basic passage language model (RelPsg) (Liu and Croft 2002) or our homogeneity-based passage language model (RelPsg\([{\mathcal{M}}])\)

5.3 The InterMaxPsg algorithm

Heretofore, the empirical evaluation focused on our suggested passage language model. This passage model utilizes document-homogeneity measures for controlling the reliance on document-based information when inducing the passage model. We now turn to explore the effectiveness (or lack thereof) of our InterMaxPsg algorithm, which scores a document by interpolating the scores assigned to it by a standard document-based language model approach and by the MaxPsg algorithm; the interpolation is governed by document-homogeneity measures. (Refer back to Sect. 2.2 for details.)

Our first order of business (and focus) is to explore whether document-homogeneity measures are effective means for controlling the score integration performed by the InterMaxPsg algorithm. Hence, we use for implementation the standard document and passage language models (see Eq. 9 in Sect. 2.2). Consequently, the scores interpolated by the InterMaxPsg algorithm under this implementation are those derived from the Doc[base] and MaxPsg[base] reference comparisons. To smooth the standard document and passage language models, we use either Jelinek-Mercer smoothing with \(\lambda_{{\mathcal C}}=0.5\) (as we did at the above), or Dirichlet smoothing with μ = 1000 following some previous recommendations (Zhai and Lafferty 2001; Liu and Croft 2002; Fang and Zhai 2005) (See Sect. 2.2 for details regarding smoothing.).

Table 6 depicts the performance results of the InterMaxPsg algorithm with the two smoothing techniques. We use \(\text{IMaxPsg}_{\mathcal{M}}[base]\) to denote the implementation of InterMaxPsg with the \({\mathcal{M}}\) homogeneity measure (for score integration) and the standard document and passage language models. (We present the results for the length and docPsg measures that, in general, yield the most effective performance. Results for the other two measures can be found in Table 9 in Appendix.)

Table 6 Performance numbers of the InterMaxPsg algorithm when implemented with standard document and passage language models

We see in Table 6 that our InterMaxPsg algorithm is superior in most relevant comparisons to the MaxPsg method when the standard passage language model is used, especially with respect to MAP.

It is important to note that we cannot attribute the performance improvements posted by \(\text{IMaxPsg}_{\mathcal{M}}[base]\) over MaxPsg[base] solely to cases wherein the former interpolates the retrieval score assigned by the latter with that derived from a better performing algorithm—Doc[base]. For example, for FR12, MaxPsg[base] is clearly superior to Doc[base], while for LA+FR45 the reverse holds; however, in a majority of the comparisons the \(\text{IMaxPsg}_{\mathcal{M}}[base]\) algorithms post performance better than that of MaxPsg[base] for both corpora.

We can also see in Table 6 that passages of 50 terms result in near optimal performance with respect to other passage lengths. This finding is in line with those regarding our passage language model from Sect. 5.2.

Another observation that we make based on Table 6 is that when using passages of 50 terms with Dirichlet smoothing, which outperforms Jelinek-Mercer smoothing in most cases, IMaxPsg[length][base] and IMaxPsg[docPsg][base] post better performance than that of the standard document-based retrieval approach (Doc[base]) in many of the relevant comparisons (4 corpora × 3 evaluation metrics); in several cases (e.g., for MAP over LA+FR45), the performance differences are also statistically significant.

It is also interesting to note that in some cases it is beneficial to integrate passage-based retrieval scores with document-based retrieval scores even for the AP89 corpus, which is considered to contain homogeneous documents. For instance, compare the performance of IMaxPsg[docPsg][base] with that of Doc[base] for AP89 when Dirichlet smoothing is employed.

The effectiveness of homogeneity measures for integration of retrieval scores. Analogously to the analysis we performed in Sect. 5.2, we now examine the relative effectiveness of using homogeneity measures for controlling the document-passage score-integration in the InterMaxPsg algorithm. To that end, we study an alternative that fixes the balance between the document-based and passage-based query-similarity scores, making the assumption that all documents in the corpus are homogeneous to the same extent. We do so by setting \(h^{[{\mathcal{M}}]}(d)\) to a fixed value for all \(d \in {\mathcal C}.\) (Refer back to Table 1 in Sect. 2.2.) Using such fixed interpolation weights for score-fusion is reminiscent of some previous methods for passage-based document ranking (Callan 1994; Wang and Si 2008). Furthermore, observe that setting \(h^{[{\mathcal{M}}]}(d)\) to 0 or 1 and using the InterMaxPsg algorithm amounts to using MaxPsg[base] and Doc[base], respectively—our two reference comparisons.

Figure 2 depicts the MAP performance curve of InterMaxPsg when setting \(h^{[{\mathcal{M}}]}(d)\) to the same fixed value in {0, 0.1,…, 1} for all \(d \in {\mathcal C}.\) We also plot for comparison the performance of IMaxPsg[length][base] and IMaxPsg[docPsg][base] with thin and thick horizontal lines, respectively. (Jelinek-Mercer smoothing is employed.)

Fig. 2
figure 2

The MAP performance of the InterMaxPsg algorithm. The curve corresponds to setting \(h^{[{\mathcal{M}}]}(d)\) to the same fixed value in {0, 0.1,…, 1} for all \(d \in {\mathcal C}\) (0 and 1 correspond to MaxPsg[base] and Doc[base], respectively). We draw for reference the resultant performance of using the \({\mathcal{M}}={length}\) and \({\mathcal{M}}={docPsg}\) homogeneity measures (thin and thick horizontal lines, respectively). Note: Figures are not to the same scale

We can see in Fig. 2 that for longer passages (i.e., of 150 and 50 terms) using the homogeneity measures helps us at the very least to avoid a relatively poor performance obtained by a bad choice of a constant \(h^{[{\mathcal{M}}]}(d),\) as is the case on AP89, for example. In some cases (see FR12 and WSJ) using homogeneity measures with passages of 150 and 50 terms yields performance that is near or even better than the optimal performance obtained by using a fixed value of \(h^{[{\mathcal{M}}]}(d).\) It is also important to note that while the performance improvements posted by using homogeneity measures over using a fixed \(h^{[{\mathcal{M}}]}(d)\) are small in absolute terms, many of them are statistically significant. Case in point, for passage of 150 terms, IMaxPsg[length][base]’s performance is better to a statistically significant degree than that of setting (i) \(h^{[{\mathcal{M}}]}(d) = 0\) for all tested corpora except for FR12, (ii) \(h^{[{\mathcal{M}}]}(d) \in \{0.1,\ldots,0.4\}\) for LA+FR45, (iii) \(h^{[{\mathcal{M}}]}(d) \in \{0.1,0.3\}\) for WSJ, and (iv) \(h^{[{\mathcal{M}}]}(d) = 0.9\) for FR12.

Figure 2 also illustrates the performance degradation for passages of 25 terms. This finding is in line with our findings in Sect. 5.2 with regard to inducing passage language models.

We therefore draw the conclusion—which is analogous to the one in Sect. 5.2—that homogeneity measures can help to integrate document-based and passage-based query-similarity scores. The resultant performance can sometimes transcend that of using each separately and that resulting from a poor choice of fixed interpolation weights for integrating the scores.

5.4 Putting it all together

Heretofore, document-homogeneity measures have played two different roles. They helped to integrate passage and document information for inducing passage language models. In addition, they were utilized for controlling the fusion of passage-based and document-based retrieval scores in the InterMaxPsg algorithm. A natural question is, then, whether combining these two roles can yield additional performance improvements.

To study a possible integration of the two roles of the homogeneity measures we use the InterMaxPsg algorithm with our passage language model. We denote this implementation with \(\text{IMaxPsg}_{[{\mathcal{M}}]}[{\mathcal{M}}];\) the measure \({\mathcal{M}}\) is used both for inducing the passage model and for fusing retrieval scores. For brevity, we focus on the length and docPsg measures that have yielded the best performance so far. We note that the potential merit of this role-integration is as follows. Consider, for example, a highly heterogenous document in which the whole-document information plays very little role in scoring each of its passages, and consequently, in selecting the single most query-related passage (i.e., in the MaxPsg algorithm). However, it might be the case that pieces of query-related information are scattered over different passages and considering them as a whole, as is done when using the document retrieval score, can help to more effectively determine the relevance of the entire document.

Table 7 presents the performance numbers of the MaxPsg algorithm (MaxPsg\([{\mathcal{M}}]),\) which uses homogeneity measures for passage-model induction, along with those of the InterMaxPsg algorithm which uses the measures for integration of retrieval scores; InterMaxPsg is implemented with either a standard passage language model (\(\text{IMaxPsg}_{[\mathcal{M}]}[base])\) as in Sect. 5.3, or with our homogeneity-based passage language model \((\text{IMaxPsg}_{[{\mathcal{M}}]}[{\mathcal{M}}])\) as described at the above. The performance numbers of the Doc[base] and MaxPsg[base] baselines are presented for reference. In all cases, Jelinek-Mercer smoothing is used with \(\lambda_{{\mathcal C}}=0.5\) as described in Sect. 4.

Table 7 Comparison between the MaxPsg and InterMaxPsg algorithms

Our first observation based on Table 7 is that utilizing the homogeneity measures for inducing passage language models in the MaxPsg algorithm is in general more effective than using the InterMaxPsg algorithm with the standard passage language model (where the measures control the retrieval-scores fusion). Specifically, in most relevant comparisons, MaxPsg\([{\mathcal{M}}]\) posts performance that is at least good as that of \(\hbox {IMaxPsg}_{[{\mathcal{M}}]}[base];\) however, the performance differences are statistically significant in a very few of the cases. In addition, we note that using the homogeneity-based passage model in the InterMaxPsg algorithm instead of the standard one (that is, moving from \(\hbox{IMaxPsg}_{[{\mathcal{M}}]}[base]\) to \(\hbox{IMaxPsg}_{[{\mathcal{M}}]}[{\mathcal{M}}],\) helps to slightly improve the algorithm’s performance in a vast majority of the cases (sometimes to a statistically significant degree). The performance of \(\hbox{IMaxPsg}_{[{\mathcal{M}}]}[{\mathcal{M}}]\) is as good as that of \(\hbox{MaxPsg}[base]\)in a majority of the relevant comparisons. However, the performance of \(\text {IMaxPsg}_{[{\mathcal {M}}]}[{\mathcal {M}}]\) is somewhat comparable to that of MaxPsg[\({\mathcal{M}}]\) indicating, again, that most performance gains are due to using the homogeneity measures at the language model level.

6 Conclusions and future work

We presented a general probabilistic model for passage-based ad hoc document retrieval. We showed that several previously-suggested passage-based document ranking approaches, along with new ones, can be understood using this model.

To instantiate specific algorithms from the suggested model, we used language model estimates. In doing so, we presented a new passage language model that utilizes information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage language model for document retrieval. Furthermore, our passage language models are also more effective than the standard passage model for constructing and utilizing passage-based relevance models. In addition, we demonstrated the relative merits of using the homogeneity measures to control the reliance on whole-document information with respect to using a fixed level of reliance for all documents in the corpus as was proposed in some past work.

In further exploration we showed that while the suggested homogeneity measures are also effective for fusing document-based and passage-based retrieval scores for document ranking, using the measures to integrate document and passage information at the language model level is somewhat more effective.

For future work we plan on devising additional document-homogeneity measures. We also intend to study the effectiveness of our proposed passage language model for tasks that rely on passage retrieval (e.g., question answering and XML, and more generally, structured-document, retrieval).