Improved sentence retrieval using local context and sentence length
Introduction
The task of sentence retrieval is finding relevant sentences from a document base in response to a query. Sentence retrieval is used in tasks like novelty detection, question answering, summarization and opinion mining (Fernandez et al., 2010, Murdock, 2006). Sentence retrieval can be used as the first step of novelty detection. The possible application scenario is presented in (Harman, 2002). In this scenario the user uses a smart “next” button that allows him to walk down a ranked list of documents by highlighting next relevant and novel sentence. This way the user avoids non-relevant and duplicate information and saves time moving through the document collection. When used for question answering, sentence retrieval is used to find sentences that contain the answer to the user’s question. For example if the question is “How far is it from Earth to Mars” the aim is to find sentences like “The minimum distance from Earth to Mars is about 55 million kilometers”. Giving an answer in the form of a set of sentences is an improvement over the classic document retrieval where the user has to exploit whole documents. We expect that such functionality will be common in future search engines. When using for summarization sentence retrieval is used to find a number of sentences relevant to a query to create a summary of documents. For example sentence retrieval is used to create summaries from Wikipedia articles (Ganguly, Leveling, & Jones, 2012). In (Chen & Verma, 2006) a document summarization system is built specialized for medical domain which retrieves and summarizes up-to-date medical information from trustworthy online sources according users queries. We see that sentence retrieval can be used in various ways to simplify the end user task of finding the right information from document collections.
Methods used for sentence retrieval are usually simple adaptations of document retrieval methods where sentences are treated as documents (Harman, 2002, Soboroff, 2004, Soboroff and Harman, 2003). The state of the art and most successful models for sentence retrieval are the vector space model (Allan et al., 2003, Fernandez et al., 2010, Zhang et al., 2004) and the language modeling approach. Language modeling based methods were improved by taking into account local context made of surrounding sentences or the whole document, (Fernandez et al., 2010, Murdock, 2006). Attempt of improving the TF–ISF (Term Frequency–Inverse Sentence Frequency) method by taking into account context by Fernandez et al. was unsuccessful and had no statistically significant improvements (Fernandez et al., 2010).
In addition to context usage, there is a new modification called “the importance of the sentence within the topic of document” or p(d|s) that managed to improve different language modeling based methods (Fernandez et al., 2010). The improvement appeared because of promoting the retrieval of long sentences (Fernandez et al., 2010). Our hypothesis is that it would be valuable to also try to apply modifications that would use local context of sentences and would promote retrieval of longer sentences to the TF–ISF method which showed good results in the past (Allan et al., 2003, Fernandez and Losada, 2009, Losada and Fernandez, 2007). The first modification consists of using local context (previous and next sentence) and second one is a component that promotes the retrieval of longer sentences. Related work is presented in Section 2. The corresponding new methods are explained in Sections 3 Adding context to the TF–ISF ranking function, 4 Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function. Other tested state of the art methods are explained in Section 5. In Section 6. we compare our new methods to other state of the art methods with good results. We conclude this paper with the conclusion in Section 7.
Section snippets
Related work
Sentence retrieval methods are usually simple adaptations of document retrieval methods where sentences are treated as documents (Harman, 2002, Soboroff, 2004, Soboroff and Harman, 2003). One of the first and most successful methods for sentence retrieval is the TF–ISF or Term Frequency–Inverse Sentence Frequency method (Allan et al., 2003) which is a trivial adaptation of the TF–IDF method for document retrieval. TF–IDF is a numerical statistic descriptor which indicates how important a word
Adding context to the TF–ISF ranking function
From the previous examples we saw that some sentence retrieval methods were improved by using local context of sentences. An exception is the TF–ISF method. In (Doko et al., 2013) we showed that it is possible to improve the TF–ISF method by using local context that consists of the two neighboring sentences (previous and next sentence of the current sentence) and using a recursive ranking function. While in (Fernandez et al., 2010) it was tried to include the local context by modifying parts of
Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function
We already mentioned that in (Fernandez et al., 2010) the probability of generating a document given the sentence (p(d|s)) was used to improve several language modeling based sentence retrieval methods. In (Fernandez et al., 2010) p(d|s) was regarded as a measure of the importance of the sentence within the topic of the document. Multiple methods (3MM, 2S, 2S-I, DIR, JM) were tested with p(d|s) and all of them showed similar good performance (Fernandez et al., 2010). There were no significant
Other tested sentence retrieval methods
In addition to the already presented methods (Sections 3 Adding context to the TF–ISF ranking function, 4 Adding component for promoting the retrieval of longer sentences to the TF–ISF ranking function) we also included into our tests the tfmix method (Fernandez et al., 2010) and the Three mixture model (3MM) with importance of the sentence within the topic of the document (p(d|s)) (Fernandez et al., 2010, Murdock, 2006).
The tfmix method is defined in (Fernandez et al., 2010):
Results and discussion
An overview of all tested methods in this paper is shown in Table 1.
The origin of each method from Table 1 is illustrated in Fig. 2.
We tested all sentence retrieval methods (Table 1) using data from the TREC Novelty tracks which are series of competitions used to test novelty detection systems. There were three TREC Novelty Tracks in the years from 2003 to 2004 (Harman, 2002, Soboroff, 2004, Soboroff and Harman, 2003). The task was novelty detection which consists of two subtasks, finding
Conclusion
In this paper we have implemented two improvements for TF–ISF method for sentence retrieval that were shown useful in methods based on language modeling approach. In our earlier paper (Doko et al., 2013) we successfully improved the TF–ISF method using local context and called the new method TF–ISFcon. We described this method again. Additionally, we introduced a new method named TF–ISFlength that promotes the retrieval of long sentences taking into account current sentence length in comparison
References (18)
- Allan, J., Wade, C., & Bolivar, A. (2003). Retrieval and novelty detection at the sentence level. In Proceedings of the...
- Chen, P., & Verma, R., (2006). A query-based medical information summarization system using ontology knowledge. In...
- et al.
A recursive TF–ISF based sentence retrieval method with local context
International Journal of Machine Learning and Computing
(2013) - et al.
Using opinion-based features to boost sentence retrieval
- et al.
Extending the language modeling framework for sentence retrieval to include local context
Information Retrieval
(2010) - Ganguly, D., Leveling, J., & Jones, G. J. F. (2012). DCU@INEX-2012: Exploring Sentence Retrieval for Tweet...
- Harman, D. (2002). Overview of the TREC 2002 novelty track. In Proceedings of the eleventh text retrieval conference...
Using statistical testing in the evaluation of retrieval experiments
A study of statistical query expansion strategies for sentence retrieval
Cited by (0)
- 1
Tel.: +385 21 305 852.