Improved sentence retrieval using local context and sentence length

doi:10.1016/j.ipm.2013.06.004

Information Processing & Management

Volume 49, Issue 6, November 2013, Pages 1301-1312

https://doi.org/10.1016/j.ipm.2013.06.004 Get rights and content

Highlights

•
We extend the TF–ISF method to use local context.
•
We extend the TF–ISF method to promote retrieval of long sentences.
•
Context and promoting retrieval of long sentences both improves sentence retrieval.
•
We also combine using context and promoting retrieval of long sentences.
•
It is useful to use at the same time context and promoting retrieval of long sentences.

Abstract

In this paper we propose improved variants of the sentence retrieval method TF–ISF (a TF–IDF or Term Frequency–Inverse Document Frequency variant for sentence retrieval). The improvement is achieved by using context consisting of neighboring sentences and at the same time promoting the retrieval of longer sentences. We thoroughly compare new modified TF–ISF methods to the TF–ISF baseline, to an earlier attempt to include context into TF–ISF named tfmix and to a language modeling based method that uses context and promoting retrieval of long sentences named 3MMPDS. Experimental results show that the TF–ISF method can be improved using local context. Results also show that the TF–ISF method can be improved by promoting the retrieval of longer sentences. Finally we show that the best results are achieved when combining both modifications. All new methods (TF–ISF variants) also show statistically significant better results than the other tested methods.

Introduction

The task of sentence retrieval is finding relevant sentences from a document base in response to a query. Sentence retrieval is used in tasks like novelty detection, question answering, summarization and opinion mining (Fernandez et al., 2010, Murdock, 2006). Sentence retrieval can be used as the first step of novelty detection. The possible application scenario is presented in (Harman, 2002). In this scenario the user uses a smart “next” button that allows him to walk down a ranked list of documents by highlighting next relevant and novel sentence. This way the user avoids non-relevant and duplicate information and saves time moving through the document collection. When used for question answering, sentence retrieval is used to find sentences that contain the answer to the user’s question. For example if the question is “How far is it from Earth to Mars” the aim is to find sentences like “The minimum distance from Earth to Mars is about 55 million kilometers”. Giving an answer in the form of a set of sentences is an improvement over the classic document retrieval where the user has to exploit whole documents. We expect that such functionality will be common in future search engines. When using for summarization sentence retrieval is used to find a number of sentences relevant to a query to create a summary of documents. For example sentence retrieval is used to create summaries from Wikipedia articles (Ganguly, Leveling, & Jones, 2012). In (Chen & Verma, 2006) a document summarization system is built specialized for medical domain which retrieves and summarizes up-to-date medical information from trustworthy online sources according users queries. We see that sentence retrieval can be used in various ways to simplify the end user task of finding the right information from document collections.

Methods used for sentence retrieval are usually simple adaptations of document retrieval methods where sentences are treated as documents (Harman, 2002, Soboroff, 2004, Soboroff and Harman, 2003). The state of the art and most successful models for sentence retrieval are the vector space model (Allan et al., 2003, Fernandez et al., 2010, Zhang et al., 2004) and the language modeling approach. Language modeling based methods were improved by taking into account local context made of surrounding sentences or the whole document, (Fernandez et al., 2010, Murdock, 2006). Attempt of improving the TF–ISF (Term Frequency–Inverse Sentence Frequency) method by taking into account context by Fernandez et al. was unsuccessful and had no statistically significant improvements (Fernandez et al., 2010).

In addition to context usage, there is a new modification called “the importance of the sentence within the topic of document” or p(d|s) that managed to improve different language modeling based methods (Fernandez et al., 2010). The improvement appeared because of promoting the retrieval of long sentences (Fernandez et al., 2010). Our hypothesis is that it would be valuable to also try to apply modifications that would use local context of sentences and would promote retrieval of longer sentences to the TF–ISF method which showed good results in the past (Allan et al., 2003, Fernandez and Losada, 2009, Losada and Fernandez, 2007). The first modification consists of using local context (previous and next sentence) and second one is a component that promotes the retrieval of longer sentences. Related work is presented in Section 2. The corresponding new methods are explained in Sections 3 Adding context to the TF–ISF ranking function, 4 Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function. Other tested state of the art methods are explained in Section 5. In Section 6. we compare our new methods to other state of the art methods with good results. We conclude this paper with the conclusion in Section 7.

Section snippets

Related work

Sentence retrieval methods are usually simple adaptations of document retrieval methods where sentences are treated as documents (Harman, 2002, Soboroff, 2004, Soboroff and Harman, 2003). One of the first and most successful methods for sentence retrieval is the TF–ISF or Term Frequency–Inverse Sentence Frequency method (Allan et al., 2003) which is a trivial adaptation of the TF–IDF method for document retrieval. TF–IDF is a numerical statistic descriptor which indicates how important a word

Adding context to the TF–ISF ranking function

From the previous examples we saw that some sentence retrieval methods were improved by using local context of sentences. An exception is the TF–ISF method. In (Doko et al., 2013) we showed that it is possible to improve the TF–ISF method by using local context that consists of the two neighboring sentences (previous and next sentence of the current sentence) and using a recursive ranking function. While in (Fernandez et al., 2010) it was tried to include the local context by modifying parts of

Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function

We already mentioned that in (Fernandez et al., 2010) the probability of generating a document given the sentence (p(d|s)) was used to improve several language modeling based sentence retrieval methods. In (Fernandez et al., 2010) p(d|s) was regarded as a measure of the importance of the sentence within the topic of the document. Multiple methods (3MM, 2S, 2S-I, DIR, JM) were tested with p(d|s) and all of them showed similar good performance (Fernandez et al., 2010). There were no significant

Other tested sentence retrieval methods

In addition to the already presented methods (Sections 3 Adding context to the TF–ISF ranking function, 4 Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function) we also included into our tests the tfmix method (Fernandez et al., 2010) and the Three mixture model (3MM) with importance of the sentence within the topic of the document (p(d|s)) (Fernandez et al., 2010, Murdock, 2006).

The tfmix method is defined in (Fernandez et al., 2010): $R_{tfmix} (s ∣ q) = \sum_{t \in s \cap^{q}}$

Results and discussion

An overview of all tested methods in this paper is shown in Table 1.

The origin of each method from Table 1 is illustrated in Fig. 2.

We tested all sentence retrieval methods (Table 1) using data from the TREC Novelty tracks which are series of competitions used to test novelty detection systems. There were three TREC Novelty Tracks in the years from 2003 to 2004 (Harman, 2002, Soboroff, 2004, Soboroff and Harman, 2003). The task was novelty detection which consists of two subtasks, finding

Conclusion

In this paper we have implemented two improvements for TF–ISF method for sentence retrieval that were shown useful in methods based on language modeling approach. In our earlier paper (Doko et al., 2013) we successfully improved the TF–ISF method using local context and called the new method TF–ISF_con. We described this method again. Additionally, we introduced a new method named TF–ISF_length that promotes the retrieval of long sentences taking into account current sentence length in comparison

References (18)

Allan, J., Wade, C., & Bolivar, A. (2003). Retrieval and novelty detection at the sentence level. In Proceedings of the...
Chen, P., & Verma, R., (2006). A query-based medical information summarization system using ontology knowledge. In...
A. Doko et al.
A recursive TF–ISF based sentence retrieval method with local context
International Journal of Machine Learning and Computing
(2013)
R.T. Fernandez et al.
Using opinion-based features to boost sentence retrieval
R.T. Fernandez et al.
Extending the language modeling framework for sentence retrieval to include local context
Information Retrieval
(2010)
Ganguly, D., Leveling, J., & Jones, G. J. F. (2012). DCU@INEX-2012: Exploring Sentence Retrieval for Tweet...
Harman, D. (2002). Overview of the TREC 2002 novelty track. In Proceedings of the eleventh text retrieval conference...
D. Hull
Using statistical testing in the evaluation of retrieval experiments
D.E. Losada
A study of statistical query expansion strategies for sentence retrieval

There are more references available in the full text version of this article.

Cited by (0)

¹: Tel.: +385 21 305 852.

View full text

Information Processing & Management

Improved sentence retrieval using local context and sentence length

Highlights

Abstract

Introduction

Section snippets

Related work

Adding context to the TF–ISF ranking function

Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function

Other tested sentence retrieval methods

Results and discussion

Conclusion

A recursive TF–ISF based sentence retrieval method with local context

International Journal of Machine Learning and Computing

Using opinion-based features to boost sentence retrieval

Extending the language modeling framework for sentence retrieval to include local context

Information Retrieval

Using statistical testing in the evaluation of retrieval experiments

A study of statistical query expansion strategies for sentence retrieval