Biased LexRank: Passage retrieval using random walks with question-based priors

https://doi.org/10.1016/j.ipm.2008.06.004Get rights and content

Abstract

We present Biased LexRank, a method for semi-supervised passage retrieval in the context of question answering. We represent a text as a graph of passages linked based on their pairwise lexical similarity. We use traditional passage retrieval techniques to identify passages that are likely to be relevant to a user’s natural language question. We then perform a random walk on the lexical similarity graph in order to recursively retrieve additional passages that are similar to other relevant passages. We present results on several benchmarks that show the applicability of our work to question answering and topic-focused text summarization.

Introduction

Text summarization is one of the hardest problems in information retrieval, mainly because it is not very well-defined. There are various definitions of text summarization resulting from different approaches to solving the problem. Furthermore, there is often no agreement as to what a good summary is even when we are dealing with a particular definition of the problem. In this paper, we focus on the query-based or focused summarization problem where we seek to generate a summary of a set of related documents given a specific aspect of their common topic formulated as a natural language query. This is in contrast to generic summarization, where a set of related documents is summarized without a query, with the aim of covering as much salient information in the original documents as possible.

The motivation behind focused summarization is that readers often prefer to see specific information about a topic in a summary rather than a generic summary (e.g. Tombros & Sanderson, 1998). An example summarization problem from the Document Understanding Conferences (DUC) 20061 is as follows:

Example 1

  • Topic: International adoption.

  • Focus: What are the laws, problems, and issues surrounding international adoption by American families?

Given a set of documents about a topic (e.g. “international adoption”), the systems are required to produce a summary that focuses on the given, specific aspect of that topic. In more general terms, this task is known as passage retrieval in information retrieval. Passage retrieval also arises in question answering as a preliminary step: given a question that typically requires a short answer of one or a few words, most question answering systems first try to retrieve passages (sentences) that are relevant to the question and thus potentially contain the answer. This is quite similar to summarization with the key difference being that the summarization queries typically look for longer answers that are several sentences long.

In the current work, we propose a unified method for passage retrieval with applications to multi-document text summarization and passage retrieval for question answering. Our method is a query-based extension of the LexRank summarization method introduced in (Erkan & Radev, 2004). LexRank is a random walk-based method that was proposed for generic summarization. Our contribution in this paper is to derive a graph-based sentence ranking method by incorporating the query information into the original LexRank algorithm, which is query-independent. The result is a very robust method that can generate passages from a set of documents given a query of interest.

An important advantage of the method is that it has only a single parameter to tune that effectively determines how much the resultant passage should be generic (query-independent) or query-based. Therefore, in comparison to supervised learning approaches, it does not require a lot of training data. In addition, it does not make any assumptions about the structure of the language in which the documents are written and does not require the use of any particular linguistic resources (as in Tiedemann, 2005, Woods et al., 2000) and, therefore, its potential applications are quite broad. Finally, in contrast to methods for sentence selection that primarily consider the similarity of the candidate sentences to the query (e.g. Allan et al., 2003, Llopis et al., 2002, Turpin et al., 2007), Biased LexRank exploits the information gleaned from intra-sentence similarities as well. We previously presented this method in Otterbacher, Erkan, and Radev (2005). Here, we extend our experiments to include the summarization problem, and show that our approach is very general with promising results for more than one information retrieval problem.

Section snippets

Our approach: topic-sensitive LexRank

We formulate the summarization problem as sentence extraction, that is, the output of our system is simply a set of sentences retrieved from the documents to be summarized. To determine the sentences that are most relevant to the user’s query, we use a probabilistic model to rank them. After briefly describing the original version of the LexRank method, previously introduced in (Erkan & Radev, 2004) in Section 2.1, we then present in Section 2.2 an adapted, topic-sensitive (i.e. “biased”)

A question answering example

A problem closely related to focused summarization is question answering (QA). Essentially, the only difference between QA and focused summarization is that QA addresses questions that require very specific and short answers that are usually only a few words long, whereas summarization involves questions that can be answered by composing a short document of a few sentences. A crucial first step for most QA systems is to retrieve the sentences that potentially contain the answer to the question (

Application to focused summarization

The Document Understanding Conferences summarization evaluations in 2005 and 2006 included a focused summarization task. Given a topic and a set of 25 relevant documents, the participants were required “to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic statement.” An example topic statement and related questions are shown in Example 1. In this section, we explain how we formulated the summarization tasks of DUC 2005 and 2006 based

Application to passage retrieval for question answering

In this section, we show how Biased LexRank can be applied effectively to the problem of passage retrieval for question answering. As noted by (Gaizauskas et al., 2004), while passage retrieval is the crucial first step for question answering, QA research has typically not emphasized it. As explained in Section 3, we formulate passage retrieval at the sentence level, that is, we aim to extract sentences from a set of documents in response to a question.

We demonstrate that Biased LexRank

Conclusion

We have presented a generic method for passage retrieval that is based on random walks on graphs. Unlike most ranking methods on graphs, LexRank can be tuned to be biased, such that the ranking of the nodes (sentences) in the graph is dependent on a given query. The method, Biased LexRank, has only one parameter to be trained, namely, the topic or query bias.

In the current paper, we have also demonstrated the effectiveness of our method as applied to two classical IR problems, extractive text

Acknowledgements

This paper is based upon work supported by the National Science Foundation under Grant No. 0534323, “BlogoCenter: Infrastructure for Collecting, Mining and Accessing Blogs” and Grant No. 0329043, “Probabilistic and Link-based Methods for Exploiting Very Large Textual Repositories”.

Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation.

We would also like to thank the

References (23)

  • S. Brin et al.

    The anatomy of a large-scale hypertextual web search engine

    Computer Networks and ISDN Systems

    (1998)
  • G. Salton et al.

    Term-weighting approaches in automatic text retrieval

    Information Processing and Management

    (1988)
  • J. Allan et al.

    Retrieval and novelty detection at the sentence level

  • J. Carletta

    Assessing agreement on classification tasks: The kappa statistic

    CL

    (1996)
  • G. Erkan et al.

    LexRank: Graph-based lexical centrality as salience in text

    JAIR

    (2004)
  • Gaizauskas, R., Hepple, M., & Greenwood, M. (2004). In Information retrieval for question answering: a SIGIR 2004...
  • F. Jelinek et al.

    Pattern recognition in practice

  • Kurland, O., & Lee, L. (2005). PageRank without hyperlinks: Structural re-ranking using links induced by language...
  • Lin, C.-Y., & Hovy, E. (2003). Automatic evaluation of summaries using N-gram co-occurrence statistics. In Proceedings...
  • Llopis, F., Vicedo, L. V., & Ferrendez, A. (2002). Passage selection to improve question answering. In Proceedings of...
  • R. Mihalcea et al.

    TextRank: Bringing order into texts

  • Cited by (80)

    • SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions

      2020, Artificial Intelligence in Medicine
      Citation Excerpt :

      These top-ranked passages are served as answer candidates and the SemBioNLQA system retrieves the answer from them. Several studies, such as the one reported in [52], have highlighted that a correct answer to a given biomedical question can be found with high probability when it already exists in one of the retrieved passages. Accordingly, we have developed a novel approach to extract relevant passages from PubMed documents to a given biomedical question.

    • A passage retrieval method based on probabilistic information retrieval and UMLS concepts in biomedical question answering

      2017, Journal of Biomedical Informatics
      Citation Excerpt :

      Passage retrieval is a very important component for any biomedical QA system since answers extraction is typically applied to the top-ranked passages rather than to the whole documents. Although passage retrieval in open-domain QA has been a well-studied research area [16,32–34,21], it still remains a real challenge in biomedical QA. The latter has its own characteristics such as the presence of complex technical terms, compound words, domain specific semantic ontologies, domain-specific format and typology of questions [8].

    View all citing articles on Scopus
    View full text