Biased LexRank: Passage retrieval using random walks with question-based priors
Introduction
Text summarization is one of the hardest problems in information retrieval, mainly because it is not very well-defined. There are various definitions of text summarization resulting from different approaches to solving the problem. Furthermore, there is often no agreement as to what a good summary is even when we are dealing with a particular definition of the problem. In this paper, we focus on the query-based or focused summarization problem where we seek to generate a summary of a set of related documents given a specific aspect of their common topic formulated as a natural language query. This is in contrast to generic summarization, where a set of related documents is summarized without a query, with the aim of covering as much salient information in the original documents as possible.
The motivation behind focused summarization is that readers often prefer to see specific information about a topic in a summary rather than a generic summary (e.g. Tombros & Sanderson, 1998). An example summarization problem from the Document Understanding Conferences (DUC) 20061 is as follows: Example 1 Topic: International adoption. Focus: What are the laws, problems, and issues surrounding international adoption by American families?
Given a set of documents about a topic (e.g. “international adoption”), the systems are required to produce a summary that focuses on the given, specific aspect of that topic. In more general terms, this task is known as passage retrieval in information retrieval. Passage retrieval also arises in question answering as a preliminary step: given a question that typically requires a short answer of one or a few words, most question answering systems first try to retrieve passages (sentences) that are relevant to the question and thus potentially contain the answer. This is quite similar to summarization with the key difference being that the summarization queries typically look for longer answers that are several sentences long.
In the current work, we propose a unified method for passage retrieval with applications to multi-document text summarization and passage retrieval for question answering. Our method is a query-based extension of the LexRank summarization method introduced in (Erkan & Radev, 2004). LexRank is a random walk-based method that was proposed for generic summarization. Our contribution in this paper is to derive a graph-based sentence ranking method by incorporating the query information into the original LexRank algorithm, which is query-independent. The result is a very robust method that can generate passages from a set of documents given a query of interest.
An important advantage of the method is that it has only a single parameter to tune that effectively determines how much the resultant passage should be generic (query-independent) or query-based. Therefore, in comparison to supervised learning approaches, it does not require a lot of training data. In addition, it does not make any assumptions about the structure of the language in which the documents are written and does not require the use of any particular linguistic resources (as in Tiedemann, 2005, Woods et al., 2000) and, therefore, its potential applications are quite broad. Finally, in contrast to methods for sentence selection that primarily consider the similarity of the candidate sentences to the query (e.g. Allan et al., 2003, Llopis et al., 2002, Turpin et al., 2007), Biased LexRank exploits the information gleaned from intra-sentence similarities as well. We previously presented this method in Otterbacher, Erkan, and Radev (2005). Here, we extend our experiments to include the summarization problem, and show that our approach is very general with promising results for more than one information retrieval problem.
Section snippets
Our approach: topic-sensitive LexRank
We formulate the summarization problem as sentence extraction, that is, the output of our system is simply a set of sentences retrieved from the documents to be summarized. To determine the sentences that are most relevant to the user’s query, we use a probabilistic model to rank them. After briefly describing the original version of the LexRank method, previously introduced in (Erkan & Radev, 2004) in Section 2.1, we then present in Section 2.2 an adapted, topic-sensitive (i.e. “biased”)
A question answering example
A problem closely related to focused summarization is question answering (QA). Essentially, the only difference between QA and focused summarization is that QA addresses questions that require very specific and short answers that are usually only a few words long, whereas summarization involves questions that can be answered by composing a short document of a few sentences. A crucial first step for most QA systems is to retrieve the sentences that potentially contain the answer to the question (
Application to focused summarization
The Document Understanding Conferences summarization evaluations in 2005 and 2006 included a focused summarization task. Given a topic and a set of 25 relevant documents, the participants were required “to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic statement.” An example topic statement and related questions are shown in Example 1. In this section, we explain how we formulated the summarization tasks of DUC 2005 and 2006 based
Application to passage retrieval for question answering
In this section, we show how Biased LexRank can be applied effectively to the problem of passage retrieval for question answering. As noted by (Gaizauskas et al., 2004), while passage retrieval is the crucial first step for question answering, QA research has typically not emphasized it. As explained in Section 3, we formulate passage retrieval at the sentence level, that is, we aim to extract sentences from a set of documents in response to a question.
We demonstrate that Biased LexRank
Conclusion
We have presented a generic method for passage retrieval that is based on random walks on graphs. Unlike most ranking methods on graphs, LexRank can be tuned to be biased, such that the ranking of the nodes (sentences) in the graph is dependent on a given query. The method, Biased LexRank, has only one parameter to be trained, namely, the topic or query bias.
In the current paper, we have also demonstrated the effectiveness of our method as applied to two classical IR problems, extractive text
Acknowledgements
This paper is based upon work supported by the National Science Foundation under Grant No. 0534323, “BlogoCenter: Infrastructure for Collecting, Mining and Accessing Blogs” and Grant No. 0329043, “Probabilistic and Link-based Methods for Exploiting Very Large Textual Repositories”.
Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation.
We would also like to thank the
References (23)
- et al.
The anatomy of a large-scale hypertextual web search engine
Computer Networks and ISDN Systems
(1998) - et al.
Term-weighting approaches in automatic text retrieval
Information Processing and Management
(1988) - et al.
Retrieval and novelty detection at the sentence level
Assessing agreement on classification tasks: The kappa statistic
CL
(1996)- et al.
LexRank: Graph-based lexical centrality as salience in text
JAIR
(2004) - Gaizauskas, R., Hepple, M., & Greenwood, M. (2004). In Information retrieval for question answering: a SIGIR 2004...
- et al.
Pattern recognition in practice
- Kurland, O., & Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language...
- Lin, C.-Y., & Hovy, E. (2003). Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings...
- Llopis, F., Vicedo, L. V., & Ferrendez, A. (2002). Passage selection to improve question answering. In Proceedings of...
TextRank: Bringing order into texts
Cited by (80)
A new link prediction in multiplex networks using topologically biased random walks
2021, Chaos, Solitons and FractalsSemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions
2020, Artificial Intelligence in MedicineCitation Excerpt :These top-ranked passages are served as answer candidates and the SemBioNLQA system retrieves the answer from them. Several studies, such as the one reported in [52], have highlighted that a correct answer to a given biomedical question can be found with high probability when it already exists in one of the retrieved passages. Accordingly, we have developed a novel approach to extract relevant passages from PubMed documents to a given biomedical question.
Dual pattern-enhanced representations model for query-focused multi-document summarisation
2019, Knowledge-Based SystemsA passage retrieval method based on probabilistic information retrieval and UMLS concepts in biomedical question answering
2017, Journal of Biomedical InformaticsCitation Excerpt :Passage retrieval is a very important component for any biomedical QA system since answers extraction is typically applied to the top-ranked passages rather than to the whole documents. Although passage retrieval in open-domain QA has been a well-studied research area [16,32–34,21], it still remains a real challenge in biomedical QA. The latter has its own characteristics such as the presence of complex technical terms, compound words, domain specific semantic ontologies, domain-specific format and typology of questions [8].