1 Introduction

In language modeling approaches for information retrieval, we often score a document based on the likelihood of a query according to a document language model (Ponte and Croft 1998; Zhai and Lafferty 2001a) or the KL-divergence between a query language model and a document language model (Lafferty and Zhai 2001; Zhai and Lafferty 2001a). In any case, a basic task is to estimate a document language model. In Zhai and Lafferty (2001a), it is shown that accurate estimation of document language models is quite critical for improving retrieval performance, and in particular, how to smooth a document language model can significantly affect the retrieval precision.

Traditional smoothing methods mainly use the global collection information for smoothing (Ponte and Croft 1998; Miller et al. 1999; Hiemstra and Kraaij 1998; Zhai and Lafferty 2001b). These methods generally do a linear interpolation of the maximum likelihood estimate of the model and a reference language model estimated using the whole collection:

$$ p_{\rm smooth}(w|d) = (1-\lambda) p_{\rm ML}(w|d) + \lambda p(w|{\mathcal{C}}) $$

where p ML(w|d) is the maximum likelihood estimate of the model, \(p(w|{\mathcal{C}})\) is the collection language model and coefficient λ controls the influence of each model. Thus these methods use the probabilities computed based on the whole document collection for smoothing.

Recently there has been some research on using local corpus structure for smoothing purposes (Liu and Croft 2004; Kurland and Lee 2004; Tao et al. 2006). These methods use local corpus information instead of global information for smoothing with the intuition that local structure provides more focused information about the document. These methods also use a simple interpolation of the maximum likelihood estimate of the model and the local surrounding model for smoothing:

$$ p_{\rm smooth}(w|d) = (1 - \lambda) p_{\rm ML}(w|d) + \lambda p(w|c) $$

where p(w|c) is the local surrounding model. What all these smoothing algorithms do in common is a simple one step interpolation of the model derived from the individual document and the model of the surrounding documents. Note that the surrounding can potentially include all the documents, based on how we define the surrounding.

In this paper, we propose a new way of smoothing document language models based on probabilistic score propagation in the similarity structure of the corpus which allows us to do the smoothing in multiple steps. Our main idea is to propagate word count statistics in a network of similar documents. The network is composed of documents with generation links (Kurland and Lee 2005) between them. Generation links can be thought of as automatically generated citation links between documents which serve us as alternates for hyperlinks to connect related documents. There is a generation link between two documents if the language model of the first document gives high probability to the term sequence comprising the second one. The word count statistics are then propagated through the network probabilistically. The intuition behind the propagation is to do word-count propagation between similar documents, smoothing each document by the content of its similar neighbors. The smoothing is performed iteratively, updating the document contents to the point that updating does not affect the content any more. The result will be the smoothed document language models. The iterative nature of the algorithm allows us to smooth each document by the new, smoothed version of its neighboring documents, allowing us to propagate term counts to remotely related documents.

We evaluated our algorithm on several TREC data sets, including Associated Press Newswire (AP) 1988, 1989, 1990, the LA Times (LA) and San Jose Mercury News (SJMN). The results show that the proposed algorithm consistently and in most cases significantly outperforms an optimized standard simple collection-based smoothing algorithm (i.e., Dirichlet prior). The results also show that our algorithm is especially effective for improving precision in the top-ranked documents through “filling in” missing query terms in the relevant documents. Compared with other smoothing methods that also exploit local corpus structures, our method is also more effective for improving precision in the top-ranked documents. Since a user often reads only a few top-ranked results in most search engine applications, the proposed smoothing method can be expected to deliver better utility to the users than these existing smoothing methods. Furthermore, our method is shown to be complementary with pseudo feedback which tends to improve the average precision, and a combination of our method and pseudo feedback achieves better performance than either one alone.

The rest of the paper is organized as follows. We first present some background on language model smoothing and some previous work on smoothing in Sect. 2. We then introduce our probabilistic term weight propagation algorithm in Sect. 3. We discuss the experiment results in Sect. 4, review the related work in Sect. 5 and finally conclude in Sect. 6.

2 Document language model smoothing

2.1 Language modeling approaches to information retrieval

The language modeling approach to information retrieval has been studied extensively in the past few years and has been shown to be successful for many retrieval tasks, such as ad hoc retrieval (Ponte and Croft 1998; Miller et al. 1999; Hiemstra and Kraaij 1998; Zhai and Lafferty 2001a; Lavrenko and Croft 2001), structured document retrieval (Ogilvie and Callan 2003), distributed information retrieval (Si and Callan 2005), and expert finding (Balog et al. 2006; Fang and Zhai 2007). The basic idea of this approach is to estimate a language model for each document and use the language model to rank the documents given a query.

In the query likelihood scoring method (Ponte and Croft 1998; Zhai and Lafferty 2001a), the documents are ranked based on the likelihood of the query given each document language model:

$$ p(q|d)=\prod\limits^n_{i = 1}p(q_i|d) $$

where qq 1...q n is the query. Thus the retrieval problem is reduced to the estimation of a unigram document language model p(.|d).

In this scoring method, exploiting feedback documents to improve the ranking accuracy is difficult. The Kullback–Leibler (KL) divergence scoring method (Lafferty and Zhai 2001) overcomes this problem by introducing the query language model and scoring the documents based on the KL-divergence of the query language model and the document language model:

$$ D(q||d)=\sum\limits_{w \in V} p(w|q)\log\frac{p(w|q)}{p(w|d)} $$

where V is the set of all words in the vocabulary. Note that the query likelihood method is a special case of the KL-divergence method when the query language model is estimated based on the empirical query word distribution. The estimation of the query model in this method can be improved using feedback models (Lavrenko and Croft 2001; Zhai and Lafferty 2001a).

In both query likelihood and KL-divergence scoring methods, the estimation of the document language model is an important factor which can affect retrieval performance significantly (Zhai and Lafferty 2001a). In particular, smoothing has been shown to be critical in accurately estimating a document language model. Indeed, when estimating the document language model, the maximum likelihood estimator estimates the probability of each word based on the relative frequency of the word:

$$ p_{\rm ML}(w|d) =\frac{c(w,d)}{|d|} $$

where c(w, d) is the number of occurrences of the word w in document d and |d| is the total number of words in d. Thus the maximum likelihood estimator assigns zero probability to those words not occurring in the document which is an underestimation of the probabilities of the missing words. The goal of smoothing is to adjust the maximum likelihood estimate to improve the accuracy of word probability estimation and to avoid the problem of zero probability.

2.2 Traditional smoothing methods

A general smoothing scheme followed by most traditional smoothing methods involves making the probability of an unseen word proportional to the probability of the word given by a reference language model estimated using the entire collection. The Jelinek–Mercer (JM) smoothing method and Bayesian Smoothing using Dirichlet priors are two traditional methods commonly used for smoothing document language models (Zhai and Lafferty 2001b).

In the JM smoothing method, the document language model is estimated based on a fixed coefficient linear interpolation of the maximum likelihood model of the document and the global collection model:

$$ p(w|d) = (1 - \lambda) p_{\rm ML}(w|d) + \lambda p(w|{\mathcal{C}}) $$

where coefficient λ controls the influence of each model.

In the Bayesian smoothing approach (also referred to as Dirichlet prior smoothing), the document language model is estimated as:

$$ \begin{aligned} p(w|d)&=\frac{c(w,d)+\mu p(w|{\mathcal{C}})}{|d|+\mu}\\ &=\frac{|d|}{|d| + \mu}p_{\rm ML}(w|d)+\frac{\mu}{|d| + \mu} p(w|{\mathcal{C}}) \end{aligned} $$

where μ is the Dirichlet prior parameter. This method again involves an interpolation of the individual document model and the collection model, but the coefficient controlling the influence of each model is document-dependent.

One deficiency of these traditional smoothing methods is that the global collection information does not reflect the specific content of individual documents, thus it only provides a crude way for smoothing. To address this deficiency, some recent work (Liu and Croft 2004; Kurland and Lee 2004; Tao et al. 2006) has attempted to use the local structure for smoothing with the intuition that the local structure can provide more focused information for better estimation of a document language model. We now briefly review this line of work.

2.3 Using local corpus structures

Kurland and Lee (2004) propose to combine the information drawn from the content of the document with how the document is situated within the similarity structure of the corpus to better represent the document. In their method, they use clusters as a means to represent the similarity structure of the corpus. They first construct a set of overlapping clusters of similar documents offline. At retrieval time, they choose a set of appropriate clusters based on the query and smooth the language model of the document with the cluster language model, with the intuition that clusters provide smoothed, representative statistics for their elements. For example, a document belonging to a cluster whose components generally contain the query terms should be considered relevant even if it does not contain the query terms itself. Although in this work, no explicit smoothed document language models are computed, their method essentially achieves the goal of exploiting cluster information to smooth a document language model through their ranking method.

Liu and Croft (2004) also smooth representations of individual documents using the corresponding cluster models. They first do either query independent static clustering or query-specific clustering to construct the clustersFootnote 1 and build language models for the clusters:

$$ p(w|\hbox{Cluster}) = (1 - \beta) p_{\rm ML}(w|\hbox{Cluster}) + \beta p(w| {\mathcal{C}}) $$

where β is a general parameter for smoothing and then smooth representations of individual documents using models of the clusters they come from:

$$ \begin{aligned} p(w|d)&=(1 - \lambda) p_{\rm ML}(w|d) + \lambda p(w|\hbox{Cluster})\\ &=(1 - \lambda) p_{\rm ML}(w|d) + \lambda [(1 - \beta) p_{\rm ML}(w|\hbox{Cluster}) + \beta p_{\rm ML}(w|{\mathcal{C}})] \end{aligned} $$

where λ and β are general parameters for smoothing. In other words, they first smooth the cluster model with the whole collection model and then smooth the document model with the smoothed cluster model. This method is called CBDM for Cluster-Based Document Model.

In another study, Tao et al. (2006) expand documents using local corpus structures to better estimate document language models. They augment a document probabilistically with potentially all similar documents in the collection. For each document, they construct a probabilistic neighborhood of similar documents where each neighbor is associated with a probability value that reflects how likely it is from the underlying distribution of the original document. They then expand each document with the probabilistic neighborhood around it:

$$ c(w,d^{\prime})=\alpha c(w,d) + (1 - \alpha) \sum\limits_{b \in {\mathcal{C}} - \{d\}}(\gamma_b(d) \times c(w,b)) $$

Here d′ is the expanded version of d, α is a parameter that controls the balance between the content of the document and the influence of the neighborhood and γ b (d) is the confidence value assigned to each neighboring document b based on its similarity to the document d. They use d′, the expanded version of the document, to estimate the document language model. From the smoothing viewpoint, this work is an extension of Liu and Croft’s work where each document has its own cluster for smoothing. In the rest of the paper, we will refer to this method as DELM for Document Expansion Language Model.

As can be seen, what all these methods do is a one-step interpolation of the document language model and a reference language model. In the following section, we introduce our proposed smoothing method, which propagates scores in the similarity structure of the corpus probabilistically and allows us to do the smoothing in multiple steps.

3 A term propagation smoothing method

In this section, we present the term propagation smoothing method. We first discuss why multiple-step smoothing is potentially advantageous over single-step smoothing.

3.1 One-step versus multiple-step smoothing

The current smoothing methods all do one-step smoothing. That is, the smoothed language model of a document is generally a one-step interpolation of the relative frequencies of words in the target document and those in some reference set of documents (either surrounding documents by some similarity or the whole set of documents). Intuitively, we could do such one-step smoothing multiple times. Indeed, if we believe that a smoothed document language model is a better representation of the document than its maximum likelihood estimate (i.e., relative frequencies), then smoothing the language model of a document using the already smoothed language models of its surrounding documents can be better than smoothing using the unsmoothed language models of those surrounding documents. We now use a simple example to illustrate this intuition. Among all the methods which use local information for smoothing, our work is most similar to the “Document Expansion Language Model (DELM)” method proposed by Tao et al. (2006). We thus use this DELM method in the illustration.

Suppose that we have a document collection of five documents, \({\mathcal{C}}=\{d_1, d_2, d_3, d_4, d_5\}{:}\)

DocID

Content

d 1

A B C D E

d 2

C D E F

d 3

A C D E

d 4

D E F

d 5

A B

In order to augment the documents, the DELM method constructs a graph of documents with documents as nodes and cosine similarities as relation weights. Figure 1a shows the corresponding graph.

Fig. 1
figure 1

One-step versus multiple-step smoothing. (a) Cosine similarity graph, (b) Smoothed document LM (DELM Method), (c) Smoothed document LM (Propagation Method)

It then expands each document by the content of the surrounding documents:

$$ c(w,d^{\prime})=\alpha c(w,d) + (1 - \alpha) \times \sum\limits_{b \in {\cal C}-\{d\}}(\gamma_d(b)\times c(w,b)) $$

where c(w, d) is the count of word w in document d and γ b (d) is the confidence value assigned to each document b in the neighborhood of d based on the similarity of b and d. γ d (b) controls the influence of b on the expanded version of d. (More details on this method can be found in Tao et al. (2006).)

When augmenting d 5 using the DELM method, the only documents influencing d 5 will be d 1 and d 3. The corresponding augmented document language model (assuming α to be 0.5) is shown in Fig. 1b i.e., the probability of the word ‘F’ in \(d_{5}^{\prime},\) the expanded version of d 5, will still be 0.

In multiple-step smoothing, on the other hand, d 5 would be influenced by all d 1, d 2, d 3 and d 4, and the probability of the word ‘F’ in the smoothed language model for d 5 would be non-zero. Indeed, the smoothed language models for d 1 and d 3 would have a non-zero probability for ‘F’ after one step of smoothing with d 2. Thus after another iteration of smoothing, in which d 5 would be smoothed with the smoothed languages of its two neighbors d 1 and d 3, the probability of ‘F’ for d 5 would also be non-zero. That is, the count of ‘F’ in d 2 can be propagated to d 5 through d 1 and d 3. In Fig. 1c, we show some sample smoothing result obtained by applying our proposed method to this toy example. Intuitively, this achieves more accurate smoothing than the result shown in Fig. 1b.

3.2 Term propagation smoothing

The basic idea of the proposed term propagation smoothing method is precisely to allow counts of terms in a document to “spread” to other documents that are “remotely” related in a weighted manner so that we can achieve multiple-step smoothing of document language models. To implement this idea, we first need to construct a document similarity graph through which the counts can be propagated. Figure 2 shows a sketch of the proposed term propagation smoothing method. Having a set of documents, at the first step, we estimate an unsmoothed unigram language model for each document. For each query word, we then compute the probabilities p 0(d|w) using the Bayes’ formula. At the third step, we propagate these probabilities in the similarity graph of the documents until they converge. We will show later in this section that with the way we construct the similarity graph and propagate the probabilities, we can guarantee that the probabilities will converge to a unique probability distribution. Having the new p n(d|w), we finally estimate the document language model p smooth(w|d) by applying Bayes’ rule again. In the following, we present the details of constructing the similarity graph and different steps of the algorithm.

Fig. 2
figure 2

Smoothing process steps

3.2.1 Constructing the generation graph

The existence of human-created hyperlinks in a hyperlinked environment provides a huge amount of latent judgments about the relevance of documents (Kleinberg 1999). However in a non-hypertext setting, these judgments are not available. The problem of automatically generating links between documents in a non-hypertext environment has been studied before (Frisse 1987; Furuta et al. 1989; Wilkinson and Smeaton 1999; Kurland and Lee 2005). In this work, we use generation graphs proposed in Kurland and Lee (2005) to construct a graph of documents for propagating term counts. A generation graph can be viewed as a graph of documents that cite each other, where the weighted links are induced automatically from the content of the documents. Specifically, a generation graph is a directed graph where documents are the nodes and link weights are proportional to the generation probabilities, the probabilities assigned by the language model of one document to the text of another.

Given any set of documents D, we can construct a generation graph G = (D, W) as follows. For each document d ∈ D, we compute p(d|g), the likelihood of document d given any other document g ∈ D and take the top k documents that give d the highest likelihoods as k neighbors of d in G. We denote this set of documents by TopGen(d). We have an edge between d and g (i.e., (d, g) ∈ W) if and only if g ∈ TopGen(d). The probability weight of each edge \(p(d \rightarrow g)\) is simply defined as

$$ p(d\rightarrow g) = \frac{p(d|g)}{\sum_{g^{\prime}\in D} p(d|g^{\prime})} $$
(1)

where p(d|g) is assumed to be zero if \(g \notin TopGen(d).\) Clearly, \(\sum_{g \in D} p(d\rightarrow g)=1.\) Intuitively, this means that we have a conditional probability distribution over all the neighbors of d given d, which can be interpreted as giving the probability of “walking” to a neighbor of d from d. Later we will see that such a probabilistic graph allows us to implement our idea of multiple-step smoothing as a random walk model on this graph.

Given a query, intuitively, improving the language models of the top-ranked documents is most interesting as lowly ranked documents would unlikely be relevant. This suggests that we only need to construct the generation graph for a certain number of top-ranked documents based on their retrieval scores. Such a “working set” approach has an additional advantage of reducing computational overhead and regularizing the propagation to avoid over-smoothing. As will be shown later, smoothing with only a small number of top-ranked documents is more robust and tends to perform better than smoothing with many top-ranked documents.

The generation graph constructed this way captures the similarity structure of the corpus. By propagating scores in the graph, we could allow a document d to iteratively receive support of counts of words from those documents g whose p(d|g) is relatively large. Since TopGen(d) and p(d|g) can be pre-computed, such a generation graph can be constructed efficiently during the run time of a query.

The choice of k here is empirical. In the experiments section, we will show the results for different values of k and analyze the sensitivity to this number.

3.3 Probabilistic term propagation algorithm

The probabilistic term propagation (PTP) algorithm involves the following four steps:

  • Step 1: Having the set of documents, we estimate an unsmoothed unigram language model based on a document d using the maximum likelihood estimate given by the relative counts of the words:

    $$ p_{\rm ML}(w|d)=\frac{c(w,d)}{|d|} $$

    Here w is any word in our vocabulary V (the vocabulary is composed of all the words that appear in at least one document in D) and \(|d| = \sum_{w^{\prime}\in V}c(w^{\prime}, d).\) Note that using maximum likelihood estimator, we will have zero probabilities for all the words absent in the document.

  • Step 2: For each query word, we then compute the probabilities p 0(d|w) using the Bayes’ formula:

    $$ p(d|w)\propto p(w|d)p(d) $$

    where p(d) is the document prior. The reason why we want to reverse the conditional probability is because p(d|w) defines a distribution over all the documents and this allows us to cast multiple-step smoothing as iteratively revising this distribution based on propagation on the generation graph. It is unclear how we could do the same thing with the original conditional probability p(w|d).

    Assuming a uniform document prior, we will have:

    $$ p^0(d|w)\propto p_{\rm ML}(w|d)=\frac{p_{\rm ML}(w|d)} {\sum_{d_i \in D} p_{\rm ML}(w|d_i)}=\frac{c(w, d)/|d|}{\sum_{d_i \in D} c(w,d_i)/|d_i|} $$

    Since every word w in our vocabulary must appear in at least one document in D, there is at least one d ∈ D for which c(w, d i ) > 0. At this point, for each query word, we have the estimated conditional probabilities of all the documents given the word, with zero probabilities for those documents not containing the word. i.e., the probabilities of the documents not having the word is underestimated. We will see how propagation on the generation graph can improve this estimate.

  • Step 3: At this step, we smooth the probability of each document given a word with the probabilities of similar documents, with the intuition that both the content of the current document and the content of similar documents can be useful for estimating the probabilities.

    Given a word w, we define the probability of each document as:

    $$ p(d|w)=\alpha p^0(d|w) + (1 - \alpha) \sum_{x \in D} p(x|w) p(x \rightarrow d) $$
    (2)

    i.e., a linear combination of its content-based probability and the effect of neighbors in the generation graph. Here \(p(x \rightarrow d)\) is the weight of the directed edge from x to d in the generation graph which is defined in Eq. 1. These probabilities are computed iteratively, updating the probability of each document using the updated probabilities of the neighbors until they converge to a limit. This updating formula is a special case of the general probabilistic relevance propagation framework proposed in Shakery and Zhai (2006). At each step, the score of each document is propagated to its outgoing neighbors in the generation graph in a weighted manner, and the score of each document is updated to a combination of the sum of its incoming (propagated) scores and its own content-based score.

The score definition in Eq. 2 corresponds to the standing probability distribution of a random walk on the generation graph of the documents. Indeed, the smoothing algorithm can be interpreted as follows: Imagine that a random surfer is surfing the set of documents looking for documents related to the word w. At each step, the surfer would either jump to a related document (by following an edge on the graph) with probability 1−α or jump to a random document with probability α. If the surfer decides to jump to a related document (from the current document d) the surfer would land on a document g with probability \(p(d\rightarrow g);\) otherwise, the surfer would land on a random document g with probability p 0(g|w). The surfer keeps doing this iteratively, jumping to documents looking for documents related to the word. The final score of each document is equal to the standing probability of the surfer on the document.

In order to compute the scores, we construct a matrix M = αM 0 + (1−α)M G where M 0(m, n) = p 0(d n |w) and \(M_G(m, n) = p(d_m \rightarrow d_n).\) We then compute the probability scores using matrix multiplication: \(\vec{P}=M^T \vec{P}\) where \(\vec{P}\) is the vector of the probability values. The probability values are computed iteratively in a very similar way as any of the existing link-based scoring algorithms such as PageRank (Page et al. 1998). Clearly, efficient matrix multiplication methods can be used to further speed up the scoring. The final scores will be the values of the stationary probability distribution of the Markov chain defined by M. We ensure reachability to each document through smoothing the random jump probability p 0(d|w) slightly with a uniform distribution over all the documents (similar to the uniform jumping probability in PageRank (Page et al. 1998), but we give the otherwise unreachable documents a very tiny probability). Thus by the Ergodicity theorem for Markov chains (Grimmett and Stirzaker 1989), we know that the Markov chain defined by such a transition matrix M must have a unique stationary probability distribution.

  • Step 4: Having obtained the propagated conditional probabilities p n(d|w) (after n iterations), we can “convert” them into the desired conditional probabilities p(w|d) of the document language model by using the Bayes’ rule again:

    $$ p_{\rm smooth}(w|d) \propto p^n(d|w)p(w) $$

    where p(w) is the word prior. We estimate the word priors from the counts of the words in the entire collection \(\left(p(w|{\mathcal{C}})\right).\) Since we have done propagation only for query words, we distinguish two cases for computing these probabilities, one where w is a query word (w ∈ Q) and one where it is not \((w \notin Q){:}\)

    $$ \begin{aligned} p_{\rm smooth}(w|d)&\propto p^n(d|w)p(w|{\mathcal{C}})\\ &=\frac{p(d|w)p(w|{\mathcal{C}})}{\sum_{w_i} p(d|w_i)p(w_i|{\mathcal{C}})}\\ &=\left\{\!\!\begin{array}{ll} \frac{p^{n}(d|w)p(w|{\mathcal{C}})}{\sum_{w \in Q}p^0(d|w)p(w|{\mathcal{C}}) + \sum_{w\notin Q}p^{n}(d|w)p(w|{\mathcal{C}})}&w\in Q\\ \frac{p^0(d|w)p(w|{\mathcal{C}})}{\sum_{w \in Q}p^0(d|w)p(w|{\mathcal{C}}) + \sum_{w\notin Q}p^{n}(d|w)p(w|{\mathcal{C}})}&w\notin Q\\ \end{array}\right. \end{aligned} $$

    p smooth(w|d) gives the smoothed document language model for document d.

In our probabilistic propagation method, the score propagation is computed once for each query word. As discussed earlier, we do not use the whole graph of documents for propagation, but instead we propagate the counts in the top k documents returned by a basic retrieval method, with the intuition that the documents ranked lower than k are unlikely to be relevant. (Indeed, as will be shown later in the discussion of experiment results, it is actually beneficial to restrict propagation to only the top-ranked documents.) This node pruning also helps us to speed-up the propagation process. Specifically, given a query, we extract the subgraph corresponding to the top k documents returned by a basic retrieval method from the universal generation graph. The universal generation graph corresponds to the whole set of documents and is constructed once offline. The query subgraphs are generally sparse, since the number of outlinks of each document in the generation graph is prespecified and is commonly small compared to k. Thus we can make use of sparse matrix multiplication methods to speed up the iterative multiplications. Even if we do not exploit sparse matrix multiplication methods, the computational complexity in each iteration of propagation is O(k 2), which is about the same complexity as doing query-specific clustering (with pre-computed similarity matrix) as done in some previous work (Liu and Croft 2004). In practice, the scores converge to a limit quite fast and the whole propagation process can be done in real time. In our experiments, the propagation took us less than 0.1 s (for k = 1,000) to converge for each query word on a Linux desktop machine with dual Pentium 4 3.0 GHz processors and 1 GB memory, thus the probabilistic propagation smoothing algorithm is efficient enough to be performed in real-time.

3.4 Retrieval using the smoothed language model

As discussed in Zhai and Lafferty (2001c), smoothing plays two distinct roles in retrieval. The first role is to improve the accuracy of estimation of document language models. The second is to model any noise in the query. Our propagation method aims at improving smoothing for the first purpose. Thus to ensure that we also model noise in the query, we further perform a second stage of smoothing. That is, we use the following final language for retrieval with the query likelihood retrieval method or the KL-divergence retrieval method:

$$p^{\prime}(w|d)=\frac{|d|}{|d|+\mu}p_{\rm smooth}(w|d)+\frac{\mu}{|d|+\mu}p(w|{\mathcal{C}}) $$

where μ is a parameter similar to the one in Dirichlet prior smoothing (Zhai and Lafferty 2001b). We set μ = 1,800.

4 Experiments

4.1 Data sets and baseline method

As our data sets, we used five TREC test collections: three combination of the Associated Press Newswire 1988, 1989, 1990, the San Jose Mercury News and the LA Times (Voorhees and Harman 2005) which are the collections previously used for evaluating various smoothing methods (Liu and Croft 2004; Kurland and Lee 2004; Tao et al. 2006). Statistics of the data sets and the queries we used in our experiments are given in Table 1. We used the query likelihood method with Dirichlet prior smoothing as our baseline.

Table 1 Data sets

4.2 Term count propagation

Having the baseline ranked list of results, we pick the top “k” documents (50 in our experiments) and extract the similarity graph of this set of documents. The similarity graph could have been constructed in different ways. We use the generation graph with fixed number of neighbors for each document in our experiments. We experimented with different number of neighbors, ranging from 5 to 30.

We then apply our term propagation smoothing method on this set of documents to get the corresponding smoothed document language models. At this step, the parameter α (in the propagation formula (1)) allows us to control the amount we want to trust the propagated weights. We changed the value of α from 0.1 to 0.9 in our experiments.

We then put these documents back in the pool of documents and rank the whole data set again and compare this new ranking with the baseline ranking. As the measures of comparison, we report precision at 0.1 recall (Prec@0.1 Recall), precisions at 5 and 10 documents (Prec@5, Prec@10) and Mean Average Precision (MAP).

4.3 Basic results

The first research question we want to answer is whether the proposed PTP smoothing algorithm would perform better than the baseline Dirichlet prior smoothing method which does not exploit local corpus structure.

In order to answer this question, for each query, we pick the top 50 documents of the query likelihood ranking, extract the corresponding generation graph using 5, 10, 20 and 30 neighbors and do the propagation on these documents. We then rank all the documents in the data set again, using the new smoothed document language model if the document is among the top 50 (other documents are smoothed using the Dirichlet prior smoothing method just as in the baseline). Finally we compare the results with the baseline method. The results are shown in Table 2. In the table for each data set, we report the baseline scores as well as the scores of our PTP method with the specified parameters and the amount of improvement we get when using the proposed method. We did a Wilcoxon signed rank test at 0.05 level of significance to see if the improvement is statistically significant. Statistically significant improvements are distinguished by a star (*). We also report the number of relevant retrieved documents and the total number of relevant documents for each experiment (RelRet/TotalRel).

Table 2 Term propagation results versus Dirichlet baseline

As can be observed from the results, in all the five data sets, we can improve almost all the measures over the baseline, although the improvement shown towards the top of the ranking is more significant than the average improvement shown.

Figure 3 shows the Precision–Recall curve for the Baseline as well as PTP smoothing results for one of our experiments on the SJMN data set. The curve confirms our observation of improvement on top ranks rather than on average where we can see improvement of our method on the front part of the curve.

Fig. 3
figure 3

Precision–recall for one experiment in SJMN

This observation is indeed very interesting in that the precision at top ranks is improved even when MAP is not improved that much. This behavior is clearly quite beneficial in any search engine application because a user often only views a small number of top-ranked results.

4.4 One-step versus multiple-step smoothing

A major research question we want to answer is whether multiple-step smoothing is more effective than one-step smoothing. We answer this question by looking into the effect of varying the number of iterations in the propagation component of our smoothing algorithm.

We first compare the no-propagation results with the fully converged results obtained from multiple iterations of propagation in Table 3. The no-propagation ranking results is different from the Dirichlet prior baseline because we use the Bayes’ rule to compute p(w|d). This comparison helps us to see how much improvement we actually get from propagation. Again we did a Wilcoxon signed rank test at 0.05 level of significance to see if the improvements are significant. Significant improvements are distinguished by a star (*). As the table shows, in all the five data sets, we get significant improvement over the no-propagation method, suggesting that propagation indeed helps improve the accuracy of smoothing.

Table 3 Term propagation results versus no-propagation results

We further compare the fully converged results with the results obtained from one-step of propagation in Table 4. In one step propagation, we start from the non-smoothed probabilities (p 0(d|w)) and do the smoothing with immediate neighbors only, while complete propagation allows us to smooth the documents with remotely related documents. Thus comparing them would allow us to see how much gain we can obtain through involving remotely related documents in smoothing. From the results in Table 4, we see that smoothing with remotely related neighbors indeed improves over smoothing with only immediate neighbors in all the data sets except for AP88-89 where the performance of complete propagation is slightly worse than that of one step propagation. We also did a Wilcoxon signed rank test to see if the improvement is statistically significant. Statistically significant improvements are distinguished by ‘*’, ‘**’ and ‘***’ for significance levels 0.1, 0.05 and 0.01, respectively. In most cases, the improvement is statistically significant. Overall, smoothing with remotely related documents is clearly beneficial.

Table 4 One step propagation versus complete propagation

4.5 Comparison with other smoothing methods using local corpus structures

We further compare our method with some other smoothing methods proposed in the previous work that also exploit local corpus structures.

In Table 5 we show the PTP results compared with “DELM + Diri” proposed by Tao et al. (2006) on two of the data sets for which we have complete results of DELM + Diri. As the table shows, in both data sets, we improve precision on top rank results substantially with slightly worse MAP.

Table 5 Comparison with DELM

In Fig. 4 we compare our method with “CBDM” proposed by Liu and Croft (2004) on the AP data set based on precision at different recall levels. (We do not have other results of this method.) Again our method slightly outperforms “CBDM” at low recall values (the front part of the curve) but is slightly worse at high recall levels, confirming that our method tends to improve precision on the top-ranked documents.

Fig. 4
figure 4

Comparison with CBDM

Indeed, from Table 6, where we compare our method with “CBDM" based on the Mean Average Precision, we see that our MAP values are comparable to the “CBDM” results.

Table 6 Comparison with “CBDM" based on MAP

It is quite interesting to see that in all these results, our method outperforms these other methods in precision of the top-ranked documents, but does not really improve the MAP; indeed, the MAP is often slightly worse. This observation motivates us to look into the reason why our method appears to be especially good at improving precision of the top-ranked documents, and we find that it is likely because our method can help those relevant documents missing at least one query term to “fill in” the missing query terms through iterative propagation, thus improving their ranks. Indeed, the multiple-step smoothing mechanism of PTP allows term counts to be propagated to those remotely related documents. We now present a more detailed analysis of PTP in this line and examine its sensitivity to various parameters.

4.6 Detailed analysis of PTP

4.6.1 Understanding the improvement in precision of top-ranked documents

Since a main motivation of PTP is to achieve multiple-step smoothing and allow term counts in a document to help smooth those remotely related documents, we hypothesize that the reason why our method appears to be very good at improving precision of top-ranked documents is that our method promotes those relevant documents that do not match all query terms by “filling in” the missing query terms through iterative propagation of term counts. In order to test the hypothesis, we compare our ranking in top 10 with the Dirichlet prior smoothing baseline ranking and take out the unique relevant documents in each ranking. We then count the number of documents missing at least one query term in each set. Table 7 shows the results of this comparison for three of the data sets.

Table 7 Percentage of relevant documents in top 10 with at least one query word missing

As the table shows, in all the three collections, the percentage of documents with at least one query word missing in our method is much higher than the baseline, suggesting that our hypothesis is true and our method helps the documents with missing query words to come to the top by filling in their missing query word(s). Indeed, according to the clustering hypothesis (van Rijsbergen 1979), which states that relevant documents tend to be more similar to each other than to non-relevant documents, our generation graph likely will connect many relevant documents to each other. Thus with our propagation algorithm, we can effectively “borrow” terms from one relevant document to help other relevant documents to fill in the missing query terms even when the document supplying a term is only remotely related to the documents receiving the term. While such propagation may potentially also help non-relevant documents to gain extra counts for query terms, the clustering hypothesis suggests that relevant documents will generally get more help than non-relevant documents through such propagation since most of the counts of query terms are in those highly relevant documents and non-relevant documents are generally not as close to such highly relevant documents as relevant documents are. Thus although we do not perform clustering explicitly, our smoothing method can be regarded as one way to exploit the clustering hypothesis to improve the estimation of language models. The fact that our method can effectively improve precision of top-ranked documents suggests that the clustering hypothesis indeed holds for the top-ranked documents. However, a detailed analysis of the performance of our method suggests that the clustering hypothesis may not hold for documents lowly-ranked in the search results (see Sect. 4.6.2). That is, in the biased sample of lowly ranked documents, relevant documents are not necessarily more similar to each other than to non-relevant documents.

However, since we propagate p(d|w), when we fill in the missing terms in one document (i.e., one document gets a larger p(d|w)), we would inevitably reduce the probability of these terms in their original documents to maintain the constraint \(\sum_{d} p(d|w)=1.\) This means that the benefit of filling in missing query terms in top-ranked documents may be at the price of pushing down some other relevant documents that are not well-connected with most relevant documents (thus not getting benefit from propagation). This may be the reason why our method is not effective for improving MAP which measures the overall ranking accuracy and especially emphasizes the precisions at high recall levels. Further analysis of the behavior of the PTP algorithm would be a very interesting future research direction.

We now study the sensitivity of PTP to some parameters.

4.6.2 Number of top documents for smoothing

In our method, we pick the top “k” documents returned by a basic retrieval method to construct the generation graph for smoothing document language models. We have so far reported the results for smoothing the top 50 documents (k = 50). Here we compare the results of smoothing the top 50 documents with the case where we do smoothing on a much larger set of documents, i.e., the top 1,000 documents (k = 1,000). Table 8 compares Precision @ 0.1 Recall, Precision @ 5, Precision @ 10 and Mean Average Precision for these two cases.

Table 8 Smoothing different number of top documents

As can be seen from the table, in most cases, both smoothing the top 50 documents and the top 1,000 documents outperform the baseline results and smoothing the top 50 documents outperforms smoothing the top 1,000 documents. The reason can be that the top 50 documents form a more coherent cluster of documents related to the query compared to the top 1,000 documents which may contain many non-relevant documents and the top neighbors of a document may actually not be very similar to the document. Thus propagating using a large graph may not be as reliable as using a small graph and can potentially introduce unreliable propagation.

From the viewpoint of clustering hypothesis, this suggests that relevant documents are more clustered together in the top-ranked documents than in lowly ranked documents. That is, in the top-ranked documents, relevant documents are very close to each other (making propagation quite effective and reliable), but the relevant documents ranked down in the result list are not necessarily more similar to each other or to those highly relevant documents, where most of the counts of query terms are, than some highly ranked non-relevant documents are. This observation is consistent with what is observed in some other work exploiting clustering hypothesis. For example, in the work Tombros (2002), it is found that query-dependent clustering is more effective than static query-independent clustering. Similarly, query expansion with local context analysis (i.e., pseudo feedback) is more effective than with global co-occurrence analysis (Xu and Croft 1996) (pseudo feedback can also be regarded as a way to leverage clustering hypothesis). All this work and our work seem to suggest that the clustering behavior of relevant documents may be more salient in the top-ranked documents than in the entire collection, which intuitively also makes sense as within a biased sample of top-ranked documents relevant documents may form a much more coherent cluster than they do in the entire collection.

Our analysis above also suggests that we should apply the proposed PTP algorithm to a relatively small number of top-ranked documents in real applications, which is actually beneficial in terms of reducing the computational cost.

4.6.3 Parameter α

The parameter α controls the amount of influence from the neighbors in propagation. In Figs. 5 and 6 we show the sensitivity of Precision @ 5, Precision @ 10 and Mean Average Precision to α for the SJMN and AP88-89 data sets, respectively.

Fig. 5
figure 5

Sensitivity to α (SJMN Data set)

Fig. 6
figure 6

Sensitivity to α (AP88-89 Data set)

As the figures show, the optimal range for good performance towards the top of the ranking is quite wide, showing that our method for term weight propagation is useful with quite a wide range of parameters. The best results are achieved somewhere in the middle. However, a small value of α can really hurt MAP, especially when the number of neighbors is small. This is expected because a small α means mostly relying on the counts from very few neighbors to estimate a language model, likely resulting in quite biased smoothing.

4.6.4 Number of neighbors

Given a certain number of top-ranked documents to use for constructing a generation graph, we may generate the graph with different numbers of neighbors for each document. Figure 7 shows the graphs of Precision @ 5, Precision @ 10 and Mean Average Precision when propagating through different number of neighbors for each document in the SJMN data set. This parameter is set when we construct the similarity graph and determines the number of documents to which each document propagates its weight. As the figures for Prec@5 and Prec@10 show, from some point on, we gradually lose the amount of benefit from word propagation as we increase the number of neighbors. The reason can be that each document has to propagate some of its weight to its neighbors, in the case of large number of neighbors, to potentially non-relevant ones. Thus top relevant documents may be discounted this way, decreasing Prec@5 and Prec@10. On the other hand, propagating to more neighbors will allow low score, hard to reach relevant documents to get some benefit from other documents and move up. That is why the MAP figure shows some improvement when we increase the number of neighbors.

Fig. 7
figure 7

Sensitivity to the number of neighbors (SJMN Data set)

4.7 Combination with query expansion

Finally we study whether we can further improve retrieval accuracy by combining our smoothing method with query expansion and pseudo feedback. Feedback and query expansion have been shown to be an effective way of improving query representation (ROCCHIO 1971; Xu and Croft 1996; Zhai and Lafferty 2001a). In our propagation method, we use different information than pseudo-feedback, thus intuitively we should be able to combine these two methods to further improve the performance.

To test this hypothesis, we perform query expansion on top of PTP smoothing. Specifically we use the top 10 documents after term propagation to perform feedback using the mixture model approach implemented in the Lemur toolkit (Zhai and Lafferty 2001a). The basic idea of this approach is to fit a mixture model to the feedback documents and estimate a feedback topic language model, which is then interpolated with the original query model to generate an “expanded” query model for scoring documents. We used the default settings of all the parameters (i.e., 0.5 for both background noise and feedback coefficient and 20 terms for expanding the query model). Since our method helps to improve the precision at top ranks, we expect to get benefit from this new (improved) ranking for pseudo feedback. Experiment results show that this is indeed true. In Table 9, we compare the performance of “PTP Smoothing”, “Query Expansion” and “Query Expansion on top of PTP Smoothing” for two of our data sets.

Table 9 Query expansion on top of probabilistic term propagation smoothing

The results of the combination show a very interesting feature of the combined algorithm: pseudo feedback usually improves MAP, but the improvement in precision of top-ranked documents is not as much. On the other hand, our method helps more on improving the precision at the top ranks. The combined algorithm has the good features of both, improving both the precision at top ranks and the mean average precision.

5 Related work

Smoothing of document language models has been studied extensively. Most work in this area uses a global background model for the purpose of smoothing (Ponte and Croft 1998; Miller et al. 1999; Hiemstra and Kraaij 1998; Zhai and Lafferty 2001b). More recent work uses some local corpus structures (Liu and Croft 2004; Kurland and Lee 2004; Tao et al. 2006) with the intuition that the local structure can provide more focused information for better estimation of language models. Our work extends all this work in that it considers multiple steps of smoothing and allows smoothing with remotely related documents. As shown in our experiment results, such extension is beneficial.

Our work is related to the clustering hypothesis (van Rijsbergen 1979). The hypothesis states that relevant documents tend to be more similar to each other than to non-relevant documents, and therefore tend to appear in the same clusters. Although we do not perform clustering explicitly, our smoothing method can be regarded as one novel way to exploit the clustering hypothesis to improve the estimation of language models. In this sense, our work is related to some previous work on document clustering (Voorhees 1985; Willett 1988; Tombros 2002) and pseudo relevance feedback (Xu and Croft 1996). It is interesting that in both our study and the work Tombros (2002), exploiting the corpus structure in documents highly similar to the query is more effective than using a larger working set of documents or the entire collection. This may suggest that the clustering behavior of relevant documents (relative to that of non-relevant documents) may be more salient in the top-ranked documents.

The problem of automatically generating links between documents in a non-hypertext environment has been studied before (Frisse 1987; Furuta et al. 1989; Wilkinson and Smeaton 1999; Kurland and Lee 2005). We used generation graphs proposed in Kurland and Lee (2005) where the graphs are used to propagate document scores; our work differs from it in that we use the graph to propagate term counts for smoothing a probabilistic language model.

The idea of using random walks for ranking purposes has also been studied before. For example, PageRank (Page et al. 1998) and Topic Specific PageRank (Haveliwala 2003) are stationary probability distributions for the Markov chain induced by random walks on the Web graph. SALSA (Lempel and Moran 2000) examines random walks on the graph derived from the link structure to find authoritative sites on a topic. A recent work (Craswell and Szummer 2007) has studied random walks on click graphs to produce a probabilistic ranking of documents for a given query. The propagation method we used in this work was a special case of the general propagation framework presented in Shakery and Zhai (2006). There they propagate the document scores in the network of documents linked to different groups of neighbors for the purpose of combining content and link information when ranking the documents in response to the given query, while here we use it for propagating term scores in the similarity network of documents for the purpose of smoothing.

6 Conclusions

In this paper, we cast the problem of smoothing document language models as a problem of propagating term counts among documents probabilistically, and presented a novel method for smoothing document language models based on this idea. A major advantage of this method over previous methods is that it provides a principled way to bring in remotely related documents to smooth the current document. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method and smoothing with remote neighbors in the document similarity graph outperforms smoothing with only immediate neighbors. Compared with other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is presumably most important in practical applications as a user often only reads a few top-ranked documents. Furthermore, our method is shown to be complementary with pseudo feedback which tends to improve the average precision, and a combination of our method and pseudo feedback achieves better performance than either one alone. Although our method consistently improves precision among top-ranked documents, it does not improve the average precision so consistently. A major future research direction is to further study how to improve both the average precision and the precision in top-ranked documents.