Extending the language modeling framework for sentence retrieval to include local context

Fernández, Ronald T.; Losada, David E.; Azzopardi, Leif A.

doi:10.1007/s10791-010-9146-4

Extending the language modeling framework for sentence retrieval to include local context

Published: 21 October 2010

Volume 14, pages 355–389, (2011)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Extending the language modeling framework for sentence retrieval to include local context

Download PDF

Ronald T. Fernández¹,
David E. Losada¹ &
Leif A. Azzopardi²

337 Accesses
22 Citations
Explore all metrics

Abstract

Employing effective methods of sentence retrieval is essential for many tasks in Information Retrieval, such as summarization, novelty detection and question answering. The best performing sentence retrieval techniques attempt to perform matching directly between the sentences and the query. However, in this paper, we posit that the local context of a sentence can provide crucial additional evidence to further improve sentence retrieval. Using a Language Modeling Framework, we propose a novel reformulation of the sentence retrieval problem that extends previous approaches so that the local context is seamlessly incorporated within the retrieval models. In a series of comprehensive experiments, we show that localized smoothing and the prior importance of a sentence can improve retrieval effectiveness. The proposed models significantly and substantially outperform the state of the art and other competitive sentence retrieval baselines on recall-oriented measures, while remaining competitive on precision-oriented measures. This research demonstrates that local context plays an important role in estimating the relevance of a sentence, and that existing sentence retrieval language models can be extended to utilize this evidence effectively.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Pre-trained models for natural language processing: A survey

Article 15 September 2020

1 Introduction

The sentence retrieval (SR) task consists of finding relevant sentences from a document base given a query. This task is very useful in a wide range of Information Retrieval (IR) applications, such as summarization, question answering, and opinion mining. SR is a challenging problem area that has attracted a great deal of attention recently (Allan et al. 2003; Losada 2008; Losada and Fernández 2007; Murdock 2006; White et al. 2005). The bulk of SR methods proposed in the literature are a straight-forward adaptation of standard retrieval models (such tf-idf, BM25, Language Models, etc), where the sentence is the unit of retrieval, as opposed to the document. This leads to SR models which estimate relevance based only on the match between query and sentence terms. The state of the art SR method is known as term frequency—inverse sentence frequency (tfisf) which is analogous to the traditional tf-idf method used in document retrieval (Allan et al. 2003; Losada 2008). While, numerous attempts to develop more sophisticated models that employ techniques, such as Natural Language Processing and Clustering have been proposed (Kallurkar et al. 2003; Li and Croft 2005; Zhang et al. 2003), they have failed to significantly and consistently outperform the tfisf method. Consequently, little progress has been made in terms of improving sentence retrieval effectiveness.

To develop a more effective sentence retrieval method, we argue that the assumption engaged as a result of the naive application of document retrieval, i.e. that all sentences are independent, does not hold. This is because a sentence is surrounded by other sentences which help to contextualize it. Also the sentence is part of a document, and this sentence may or may not be important in representing the topic of the document. Presently, this local context is either ignored or underutilized by existing methods. We posit that by incorporating the local context within SR models, more effective SR methods can be developed.

The reasons for this are as follows: Any model using only standard term statistics to match query and sentences will suffer severely from the vocabulary mismatch problem because there is little overlap between the query and sentence terms. Intuitively, the local context could be used to improve retrieval, by helping to mitigate the difficulties posed by the vocabulary mismatch rooted in the sparsity of sentences. Additionally, current methods do not exploit the importance of a sentence in a document, which we posit is an important factor in determining the relevance of a sentence. A relevant sentence needs to be indicative of the query topic, but also representative and important in the context of the document, i.e. we assume that key statements within a document are more likely to be relevant.

To this aim, we propose a novel reformulation of the SR problem that includes the local context in a Language Modeling (LM) framework. Within this principled framework, it is possible to naturally include additional evidence into the smoothing process in order to enrich the representation of sentences. Also, the model provides a way to include a query-independent probability that encodes the importance of a sentence in a document. In a set of experiments performed over several TREC test collections, we compare the proposed models against existing SR models and demonstrate that using local context within a LM framework delivers retrieval performance that significantly outperforms the current state of the art in sentence retrieval.

The remainder of this paper is organized as follows. Section 2 presents previous work related to this research. Section 3 explains the methods we propose to address the SR problem. Section 4 reports on the conducted experiments and analyzes the outcomes. The paper concludes with Sect. 5, where a summary of our findings and directions for future work are presented.

2 Related work

In this paper, we adopt the same definition of the sentence retrieval problem as proposed in the TREC Novelty Tracks (Harman 2002; Soboroff 2004; Soboroff and Harman 2003). Although these tracks are mostly focused on researching redundancy filtering, they also involve a SR task that enables research into how to retrieve sentences that are relevant to a given query.

As previously mentioned, there have been numerous SR methods that have been proposed in the literature. One of the first methods was coined as tfisf (Allan et al. 2003). It is an adaptation of the document retrieval method tf-idf, but at the sentence level. This simple approach is regarded as the state of the art in SR as it has been shown to consistently outperform other methods (Allan et al. 2003; Fernández and Losada 2009; Losada and Fernández 2007). As a matter of fact, this parameter-free method has been shown to perform at least as well as the best performing empirically tuned and trained SR models based on BM25 or LMs (Fernández and Losada 2009; Losada and Fernández 2007). While this tends not to be the case in document retrieval, on other tasks where the unit of retrieval is smaller such as passage retrieval, vector-space models have performed empirically well. For instance, Kaszkiel and Zobel (1997, 2001) showed that some cosine and pivoted models are highly effective for document ranking based on passages. Although we evaluate here SR (rather than document retrieval), past studies on passage-based document retrieval confirm also that vector-space methods are also state of the art models for query-passage scoring.

Li and Croft (2005) analyzed the components of sentences and identified patterns (such as phrases, name entities and combination of query terms) to estimate the relevance of the sentences. Although this method succeeded in detecting redundant information, it was not able to improve the tfisf baseline to estimate relevance. Clustering methods have been also considered as alternative techniques to improve SR models, such methods have shown mixed performance (Kallurkar et al. 2003; Zhang et al. 2003) seldom improving upon the tfisf baseline. These cluster methods also incur additional computation costs and increased complexity making them unattractive to implement. Query expansion techniques have been also proposed to improve the performance of current sentence retrieval approaches. Among them, the most common is query expansion via pseudo-relevance feedback (Collins-Thompson et al. 2002; Losada 2008) and with selective feedback (Jaleel et al. 2004; Losada and Fernández 2007), or relevance models (Liu and Croft 2002). While query expansion techniques tend to improve performance by addressing the vocabulary mismatch problem, they rely on good performance during the first pass of retrieval to realize such improvements.

In this paper, we reformulate the problem of sentence retrieval within the LM framework, where localized smoothing is employed to improve the representation of sentences. The work most related to this research has been performed by Losada and Fernández (2007) and Murdock (2006). In Losada and Fernández (2007), the local context of a sentence was informally introduced into the computation of sentence similarity. Basically, extra weight was given to those terms that have high frequency in the associated documents. In Murdock (2006), the estimation of the sentence language model included some local context, and combines the evidence from the sentence and document level. More specifically, a simple mixture model of the sentence, document and collection was proposed in order to form a better representation of the sentence. From the limited experiments reported, Murdock showed that the mixture model was better than other LM methods with the TREC novelty data. However, the results are far from conclusive because competitive SR methods, such as tfisf, were not evaluated. Nor was any indication of the sensitivity of the method w.r.t the smoothing parameters reported. In this paper, we provide a more general framework that encompasses both previous formulations using Language Models, but also provides avenues for incorporating other forms of local context.

3 Sentence retrieval models

The SR task consists of estimating the relevance of each sentence s in a given document set, and supplying the user with a ranked list of sentences that satisfy his/her need (expressed as a user query q). In this section, we first outline the standard LM approach applied to the problem of SR. Then, we propose a novel reformulation which includes local context seamlessly and intuitively within the model. Finally, we conclude the section with a description of baseline SR models (tfisf and BM25).

3.1 Sentence retrieval with language models (standard method)

Language Models are probabilistic mechanisms to explain the generation of text (Ponte and Croft 1998). The simplest LM is the unigram LM, which consists of associating a probability to each word of the vocabulary (Hiemstra 2001; Miller et al. 1999; Zhai and Lafferty 2001). This is a very intuitive and powerful approach that has been shown to be very effective in many IR tasks, such as ad-hoc retrieval (Zhai and Lafferty 2001), distributed IR (Si et al. 2002), and expert finding (Balog et al. 2009).

Given the SR problem, the idea is to estimate relevance according to the probability of generating a sentence s given the query q, expressed as p(s|q). Instead of directly estimating this probability, Bayes Theorem is applied, and sentences can be ranked using the query-likelihood approach, p(q|s). ^{Footnote 1} The probability of a query q given the sentence s can then be estimated using the standard LM approach, where for each sentence s, a sentence LM is inferred. From the sentence model θ_s it is assumed that each query term t is sampled independently and identically, such that:

$$ p(q|\theta_s) = \prod_{t \in q} p(t|\theta_s)^{c(t,q)} $$

(1)

where, c(t, q) is the number of times the term t appears in q. The sentence model is constructed through a mixture between the probability of a term in the sentence and the probability of a term occurring in some background collection (i.e. maximum likelihood estimators of sentence and collection, respectively). This is usually performed in one of two ways by using (a) Jelinek–Mercer (JM) smoothing as shown in (2), or (b) Dirichlet (DIR) smoothing as shown in (3).

$$ p(t|\theta_s) = (1-\lambda) p(t|s) + \lambda p(t) $$

(2)

$$ p(t|\theta_s) = \frac{c(t,s) + \mu p(t)}{c(s) + \mu} $$

(3)

where c(t, s) is the number times that t appears in s, and c(s) is the number of terms in the sentence. λ and μ are parameters that control the amount of smoothing. Note that, in (2) and (3), the smoothing expression ignores any local context and resorts immediately to the most general background knowledge p(t). This is a strong assumption because it focuses the computation on sentence and collection statistics, without regard to any reference to other terms and phrases in sentences within the same document. As previously mentioned, many SR models (Allan et al. 2003) take similar simplifications as the query-sentence similarity values do not take into account any information from the document (i.e. all sentences are treated independently).

JM and DIR smoothing yield to retrieval matching functions with specific length retrieval trends. In Losada and Azzopardi (2008a) and Smucker and Allan (2005), the authors studied these trends. Losada and Azzopardi (2008a) reported that DIR smoothing performs better than JM smoothing by showing that the document length pattern resembles the relevance pattern. They showed that DIR priors balance the query modeling and the document modeling roles, whereas JM smoothing does not consider the document length in the smoothing process. Thus, JM leads to poor retrieval performance because documents tend to be longer than the documents retrieved by DIR and the smoothing cannot compensate this. Smucker and Allan (2005) demostrated that DIR smoothings performance advantage arises from an implicit document prior that favors longer documents by smoothing them less. They tested the performance of a DIR prior and the JM smoothing with and without the document prior and showed that both methods smooth documents identically, except that the DIR prior smooths longer documents less. The result of this meant that the DIR prior tends to favor the retrieval of longer documents. Given the sentence retrieval problem, it is an open question as to what kind of length correction is appropriate for this task and whether the implicit length correction of smoothing methods employed help or hinder in the retrieval of relevant sentences.

3.2 Sentence retrieval using language models with local context

In this section, we relax the independence assumption between sentences and assume that the document (i.e. the local context) plays an important role in determining the relevance of a sentence. Therefore, we treat the SR problem as a problem of estimating the probability of the query and the document given the sentence, i.e. is the sentence likely to be a generator of both the query and the document? This assumes that there is a correlation between this likelihood, p(q, d|s) (where d is the document that contains s) and the relevance of the sentence. Thus, we posit that relevance is affected by how well the sentence explains both the document and the query topic (as opposed to the query topic alone). In order to simplify the estimation of the conditional joint probability, we can rewrite it as follows:

$$ p(q,d|s) = p(q|s,d) p(d|s) $$

(4)

where p(q|s, d) is the probability of the query given the sentence and document, and p(d|s) is the probability of the document given the sentence. Now we can clearly see that the estimation of the query likelihood will depend on both the sentence and the document. In addition, the p(d|s) provides another way in which the local context is captured, by encoding the importance of a sentence within the document. In the next subsections we consider how these probabilities can be estimated.

3.3 Estimating p(d|s)

The probability of generating the document given the sentence, p(d|s), can be regarded as a measure of the importance of the sentence within the topic of the document. Formally, this expression can be rewritten using Bayes’ rule:

$$ p(d|s) = \frac{p(s|d)p(d)}{p(s)} $$

(5)

where p(s|d) is the probability of a sentence given a document, the p(s) the probability of a sentence, and p(d) is the prior probability of a document. Here, we assume that there is no a priori preference towards any of the documents, and treat p(d) as a constant. ^{Footnote 2} The p(s|d) represents how likely the sentence is to be generated from the document, whereas p(s) represents how likely the sentence is to be generated randomly. The ratio between the two expresses the importance of the sentence. Hence, in order to estimate p(d|s), we compute p(s) as:

$$ p(s) = \prod_{t \in s} p(t)^{c(t,s)} $$

(6)

where p(t) can be calculated using the maximum likelihood estimator of the term in a large collection: p(t|C) (where C is the collection). Analogously, we define the probability of a sentence s given a document d as:

$$ p(s|d) = \prod_{t \in s} p(t|d)^{c(t,s)} $$

(7)

where p(t|d) is the probability of generating t from the maximum likelihood estimator of the document, and c(t, s) usually equals one as most terms only appear once in a sentence (unless the term is a stop word). It is to be noted that the problem of obtaining null probabilities from these estimates does not exist because terms that occur in a sentence will have non-zero probability in the LM of the document. Observe that p(d|s) will give preference to those sentences that are central to the document’s topics (i.e. high p(s|d)) but also rare within the collection (i.e. low p(s)). In this paper we carefully study the effect of p(d|s) on performance, and have designed a complete set of experiments where we compare the estimation described above against the simplest (and naive) assumption: p(d|s) is uniform.

3.4 Estimating p(q|s, d)

In order to estimate the query likelihood given the sentence and the document, we do this in a similar manner to the standard approach: first we assume that there is a model θ_s,d which generates the query terms, such that the probability of query given the sentence and the document is:

$$ p(q|s,d) = \prod_{t \in q} p(t|\theta_{s,d})^{c(t,q)} $$

(8)

The LM p(t|θ_s,d) is determined by the sentence and the local context denoted by d, thus we can represent the model as a mixture between the probability of a term in the sentence and the probability of a term in a document, which is then smoothed by the background model. The idea is that the terms in the document provide meaning to the sentence, and can improve the estimate of the relevance of a sentence.

For the time being, we assume that p(t|d) is the normalized term frequency of t in d, but later we explore restricting this estimate to the sentences surrounding the sentence s.

There are several ways in which a mixture model can be defined using smoothing:

3.4.1 Three mixture model (3MM)

The first model we propose here is a mixture of three LMs. This model assumes that queries are generated from a mixture of three different probability distributions: a LM for the sentence, p(t|s), a LM for the document, p(t|d), and a LM for the collection, p(t|C) (or, simply, p(t)). Formally, we define this approach as:

$$ p(t|\theta_{s,d}) =\lambda p(t|s) + \gamma p(t|d) + (1-\lambda -\gamma) p(t) $$

(9)

where λ and γ are smoothing parameters such that λ, γ ∈ [0, 1]. This estimator was initially proposed by Murdock (2006). Other authors have also applied 3MMs for other tasks such as question-answering (Xue et al. 2008). Since the 3MM is very general, it is worth considering alternatives which smooth the sentence with the document and the collection but in a length-dependent way. This can be achieved by either first smoothing with the document proportionally to the sentence, and then interpolating with the collection (i.e. the Two Stage Model). Or, alternatively, first interpolating the sentence and the document, and then smoothing with the collection proportional to the sentence length. We shall detail these methods next.

3.4.2 Two-stage model (2S)

The two-stage model adopted here is a variant of the well-known two-stage model used for document retrieval (Zhai and Lafferty 2002). This model is a combination of Dirichlet (DIR) and Jelinek–Mercer (JM) smoothing. Rather than smoothing with the collection model in both stages, we adapt here the model to the characteristics of the SR task and, therefore, the DIR stage uses p(t|d) while the JM stage uses p(t) for smoothing purposes. This is a simple and natural application of the two-stage smoothing for our problem. The formal expression is:

$$ p(t|\theta_{s,d}) = (1-\lambda) \frac{c(t,s) + \mu p(t|d)}{c(s) + \mu} + \lambda p(t) $$

(10)

3.4.3 Two-stage model, stages inverted (2S-I)

We propose here a two-stage model where the order in which DIR and JM smoothing methods are applied is inverted:

$$ p(t|\theta_{s,d}) = ( 1 - \beta ) ( (1-\lambda) p(t|s) + \lambda p(t|d) ) +\beta p(t) $$

(11)

where $\beta = \frac{\mu}{c(s)+\mu}$. The sentence model is first smoothed using linear interpolation with the document’s model. Next, Dirichlet is applied to smooth with the collection model.^{Footnote 3} By smoothing in this way the first stage provides a new estimate of the foreground terms by combining the sentence and the document (through linear interpolation), and then the next stage adjusts the estimates with the background language model proportional to the length of the sentence. By inverting the smoothing methods, different length normalization schemes are applied to the sentence language models. In later sections, we shall analytically and empirically show how the 2S and 2S-I models differ in this respect.

Observe that DIR and JM smoothing can also be included within this framework assuming that p(q|s, d) = p(q|s) and applying DIR or JM to estimate the likelihood. If p(d|s) is uniform, then these models are equivalent to the ones discussed in Sect. 3.1. However, if p(d|s) is not uniform then we get a novel combination of these popular smoothing strategies with the estimation of the importance of sentences in documents. Table 1 summarizes the different proposed models and informs about what configurations are novel (and, therefore, have not been tested in the literature).

Table 1 Language models included in our study

Extending the language modeling framework for sentence retrieval to include local context

Abstract

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Pre-trained models for natural language processing: A survey

1 Introduction

2 Related work

3 Sentence retrieval models

3.1 Sentence retrieval with language models (standard method)

3.2 Sentence retrieval using language models with local context

3.3 Estimating p(d|s)

3.4 Estimating p(q|s, d)

3.4.1 Three mixture model (3MM)

3.4.2 Two-stage model (2S)

3.4.3 Two-stage model, stages inverted (2S-I)

3.5 Baseline sentence retrieval models

4 Empirical study

4.1 Experimental setup

4.1.1 Smoothing with surrounding sentences

4.2 Experimental results

4.2.1 Influence of localized smoothing

4.2.2 Impact of sentence importance

4.2.3 Incorporating context into the baselines

4.3 Analysis

4.3.1 Summary and discussion

5 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Localized smoothing

1.1.1 Training with TREC 2003

1.1.2 Training with TREC 2004

1.2 Sentence importance

1.2.1 Training with TREC 2003

1.2.2 Training with TREC 2004

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation