1 Introduction

Text summarization is the process of automatically creating a compressed version of a given text that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. Generally speaking, multi-document summarization can be either generic or topic-focused. Topic-focused summarization differs from generic summarization in that given a specified topic description (i.e. user profile, user query), topic-focused summarization is to create from the documents a summary which either answers the information need expressed in the topic or explains the topic.

Automatic multi-document summarization has drawn much attention in recent years and it has been widely used in many web applications. Multi-document summary can be used to concisely describe the information contained in a cluster of documents and facilitate users to understand the document cluster. For example, a number of web news services, such as Google News,Footnote 1 NewsBlaster,Footnote 2 have been developed to group news articles into news topics, and then produce a short summary for each news topic. The users can easily understand the topic they have interest in by taking a look at the short summary. Topic-focused summary can be used to provide personalized services to users after the user profiles are provided manually or automatically. The above news services can be personalized by collecting the users’ profiles, and delivering both the related news articles and the produced news summary biased to the profile of the specific user. Other examples include Question/Answer systems, in which a question-focused summary is usually required to answer the information need in the issued question.

A particular challenge for multi-document summarization is that the document set might contain much information unrelated to the main topic. Hence we need effective summarization methods to analyze the information stored in different documents and extract the important information related to the main topic. In other words, a good summary is expected to preserve the globally important information contained in the documents as much as possible, and at the same time keep the information as novel as possible.

The challenges for topic-focused multi-document summarization are as follows: the first one is a common problem for generic multi-document summarization, that the globally important information needs to be extracted and merged; the second one is a particular challenge for topic-focused multi-document summarization that the information in the summary must be biased to the given topic, so we need effective summarization methods to take into account this topic-biasing characteristic during the summarization process. In brief, a good topic-focused summary is expected to preserve the information contained in the documents as much as possible, and at the same time keep the information as novel as possible, and moreover, the information must be biased to the given topic.

Both generic and topic-focused multi-document summarizations have been widely explored in the natural language processing and information retrieval communities. A series of workshops and conferences on automatic text summarization (e.g. NTCIR,Footnote 3 DUCFootnote 4), special topic sessions in ACL, COLING, and SIGIR have advanced the technology and produced a couple of experimental online systems.

In recent years, graph-ranking based methods (Mani and Bloedorn 1999; Erkan and Radev 2004a, b; Mihalcea and Tarau 2005) have been proposed for generic multi-document summarization based on sentence relationships. All these methods make use of the relationships between sentences and select sentences according to the “votes” or “recommendations” from their neighboring sentences, which is similar to PageRank (Page et al. 1998) and HITS (Kleinberg 1999). However, all the methods have not differentiated different kinds of relationships between sentences, i.e. the cross-document relationships and the within-document relationships. They all assume that the two kinds of sentence relationships are equally important, which is in fact inappropriate. In this study, we investigate the relative importance of the cross-document relationships and the within-document relationships between sentences in an extended graph-ranking based approach. The approach extends previous works by treating each kind of sentence relationship as a separate “modality” and computing the information richness of sentences based on each “modality”. Also, the approach applies the diversity penalty process to keep the novelty of the summary. Experiments on DUC 2002 and DUC 2004 are performed and we find that the cross-document relationships between sentences are very important for multi-document summarization. The system based only on the cross-document relationships can always perform better than or at least as well as the systems based on both the cross-document relationships and the within-document relationships between sentences.

Furthermore, we extend the graph-ranking based method for topic-focused summarization by integrating the relevance of the sentences to the specified topic, which can be considered as a topic-sensitive version of the random walk model. Likewise, the relative importance of the cross-document relationships and the within-document relationships between sentences are investigated. Experiments on DUC 2003 and DUC 2005 are performed and we find that the proposed graph-ranking based approach using only the cross-document relationships between sentences outperforms the top performing summarization approaches and baseline approaches, including the graph-ranking based approach using both the cross-document relationships and the within-document relationships. The cross-document relationships between sentences are confirmed to be very important for topic-focused multi-document summarization.

The rest of this paper is organized as follows: Section 2 introduces related works. The proposed graph-ranking based approaches for generic multi-document summarization and topic-focused multi-document summarization are presented in Sect. 3. The experiments and results are given in Sect. 4. Lastly, we conclude our paper in Sect. 5.

2 Related works

2.1 Generic multi-document summarization

A variety of summarization methods have been developed recently. Generally speaking, the methods can be extractive summarization, abstractive summarization, or hybrid summarization. Extractive summarization is a simple but robust method for text summarization and it involves assigning saliency scores to some units (e.g. sentences, paragraphs) of the documents and then extracting the sentences with highest scores, while abstraction summarization (e.g. NewsBlaster) usually needs information fusion (Barzilay et al. 1999), sentence compression (Knight and Marcu 2002) and reformulation (McKeown et al. 1999). In this study, we focus on extractive summarization.

The centroid-based method (Radev et al. 2004) is one of the most popular extractive summarization methods. MEAD (Radev et al. 2003) is an implementation of the centroid-based method that assigns scores to sentences based on sentence-level and inter-sentence features, including cluster centroids, position, TF*IDF, etc. Based on MEAD, an online system—NewsInEssence is developed to summarize online news articles.

NeATS (Lin and Hovy 2002) is a project on multi-document summarization at ISI based on the single-document summarizer-SUMMARIST. Sentence position, term frequency, topic signature and term clustering are used to select important content. Stigma word filters and time stamps are used to improve cohesion and coherence.

XDoX (Hardy et al. 2002) is a cross document summarizer designed specifically to summarize large document sets. It identifies the most salient themes within the set by passage clustering and then composes an extraction summary, which reflects these main themes. The passages are clustered based on n-gram matching. Much other works have also explored to find topic themes in the documents for summarization, e.g. Harabagiu and Lacatusu (2005) investigate five different topic representations and introduce a novel representation of topics based on topic themes. Nenkova et al. (2006) thoroughly study the contribution to summarization of three factors related to frequency: content word frequency, composition functions for estimating sentence importance from word frequency, and adjustment of frequency weights based on context, and they show that a frequency based summarizer can achieve performance comparable to that of state-of-the-art systems, but only with a good composition function. Zhang et al. (2002) propose to use the Cross-document Structure Theory (CST) to enhance the arbitrary multi-document extract by replacing low-salience sentences with other sentences that increase the total number of CST relationships included in the summary. Sentence ordering for multi-document summarization has also been explored to make summaries coherent and readable (Bollegala et al. 2006; Ji and Pulman 2006).

Graph-ranking based methods have been proposed to rank sentences or passages recently. Earlier summarization method (Salton et al. 1997) generates intra-document links between passages of a document and characterizes the structure of the document based on the intra-document linkage pattern, and then it applies the knowledge of text structure to passage extraction. Websumm (Mani and Bloedorn 1999) uses a graph-connectivity model and operates under the assumption that nodes which are connected to many other nodes are likely to carry salient information. LexRank (Erkan and Radev 2004a, b) is an approach for computing sentence importance based on the concept of eigenvector centrality. It constructs a sentence connectivity matrix and compute the sentence importance based on an algorithm similar to PageRank (Page et al. 1998). Mihalcea and Tarau (2005) also propose similar algorithms based on PageRank and HITS (Kleinberg 1999) to compute the sentence importance for single document summarization, and for multi-document summarization, they use a meta-summarization process to summarize the meta-document produced by assembling all the single summary of each document. Instead of using sentences for ranking, Li et al. (2006) propose a novel approach to derive event relevance from the documents, where an event is defined as one or more event terms along with the named entities associated, and then apply PageRank algorithm to estimate the significance of an event for inclusion in a summary.

The above graph-based methods make uniform use of all kinds of sentence relationships under an inaccurate assumption that the cross-document relationships and the within-document relationships between sentences are equally important for multi-document summarization. We extend the above graph-based works by differentiating the two kinds of relationships between sentences and thoroughly investigate their relative importance for generic multi-document summarization in this study.

2.2 Topic-focused document summarization

Most topic-focused document summarization methods integrate the information of the given topic or query into generic summarizers and extracts sentences suiting the user’s declared information need. In Saggion et al. (2003), a simple query-based scorer by computing the similarity value between each sentence and the query is incorporated into a generic summarizer to produce the query-based summary. The query words and named entities in the topic description are investigated in Ge et al. (2003) and CLASSY (Conroy and Schlesinger 2005) for event-focused/query-based multi-document summarization. In Hovy et al. (2005), the important sentences are selected based on the scores of basic elements (BE). CATS (Farzindar et al. 2005) is a topic-oriented multi-document summarizer which first performs a thematic analysis of the documents, and then matches these themes with the ones identified in the topic. BAYESUM (Daumé and Marcu 2006) is proposed to extract sentences by comparing query models against sentence models using the language modeling techniques for IR framework. Maximal marginal relevance (MMR) (Carbonell and Goldstein 1998) is a method for combining query-relevance with information-novelty in the context of text retrieval and summarization, and it strives to reduce redundancy while maintaining query relevance in re-ranking retrieved documents and in selecting appropriate passages for text summarization in a greedy manner, which has been widely used to remove redundancy in both generic summaries and topic-focused summaries.

In this study, we extend the above graph-ranking based methods for topic-focused multi-document summarization by (1) integrating the relevance of the sentences to the topic in the algorithm by using the topic-sensitive PageRank (Haveliwala 2002), similar to the work for question-focused sentence retrieval (Otterbacher et al. 2005); (2) differentiating the two kinds of relationships between sentences and using only the cross-document relationships between sentences in the graph-ranking based algorithm.

3 The proposed approach

3.1 Overview

The proposed approaches to generic and topic-focused multi-document summarizations are generally in the same framework, which is an extension of previous graph-ranking based summarization methods (Mani and Bloedorn 1999; Erkan and Radev 2004a, b; Mihalcea and Tarau 2005). The framework consists of the following three steps: (1) different affinity graphs are built to reflect different kinds of relationships between the sentences in the document set respectively; for topic-focused document summarization, the relevance values of the sentences to the topic are computed; (2) the information richness of the sentences can be computed based on each affinity graph respectively, and the final information richness of the sentences is either one of the computed information richness or a linear combination of them; for topic-focused document summarization, the biased information richness of the sentences is computed based on each affinity graph; (3) based on the whole affinity graph and the information richness scores, the diversity penalty is imposed on the sentences and the affinity rank score of each sentence is obtained to reflect both information richness and information novelty of the sentence. The sentences with high affinity rank scores are chosen to produce the summary.

The formal definitions of (biased) information richness and information novelty are given as follows:

  • Information richness Given a sentence collection \( S = {\left\{ {s_{i} |1 \le i \le n} \right\}}, \) the information richness InfoRich(s i ) is used to denote the information degree of the sentence s i , i.e. the richness of information contained in the sentence s i with respect to the entire collection S.

  • Biased information richness Given a sentence collection \( \chi = {\left\{ {x_{i} |1 \le i \le n} \right\}} \) and a topic T, the biased information richness of a sentence x i is used to denote the information degree of the sentence x i with respect to both the sentence collection and T, i.e. the richness of information contained in the sentence x i biased towards T.

  • Information novelty Given a set of sentences in the summary \( R = {\left\{ {s_{i} |1 \le i \le m} \right\}}, \) the information novelty InfoNov(s i ) is used to measure the novelty degree of information contained in the sentence s i with respect to all other sentences in the set R.

The aim of the proposed approaches is to include the sentences with both high information richness and high information novelty in the generic summary, and the sentences with both high biased information richness and high information novelty are included in the topic-focused summary.

Figure 1 shows the framework of the summarization approach. Note that the modules in the dashed box (at top right corner) are only for topic-focused document summarization.

Fig. 1
figure 1

The framework of the summarization approach

3.2 Affinity graph building

3.2.1 Generic multi-document summarization

Given a sentence collection \( S = {\left\{ {s_{i} |1 \le i \le n} \right\}} \), where n is the size of the sentence collection, the affinity weight aff(s i s j ) between a sentence pair of s i and s j is calculated using the standard Cosine measure (Baeza-Yates and Ribeiro-Neto 1999) as follows:

$$ aff(s_{i} ,s_{j} ) = \frac{{\ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{i} \cdot \ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{j} }} {{{\left\| {\ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{i} } \right\|} \times {\left\| {\ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{j} } \right\|}}} $$
(1)

where \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{s} \) and \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{j} \) are the corresponding term vectors of s i and s j . The weight associated with term t is calculated with the tf t *isf t formula, where tf t is the frequency of term t in the corresponding sentence and isf t is the inverse sentence frequency of term t, i.e. \( 1 + \log (N/n_{t} ), \) where N is the total number of sentences and n t is the number of sentences containing term t.

If sentences are considered as nodes, the sentence collection can be modeled as an undirected graph by generating a link between two sentences if their affinity weight exceeds 0, i.e. an undirected link between s i and \( s_{j} (i \ne j) \) with affinity weight aff(s i , s j ) is constructed if aff(s i , s j ) > 0; otherwise no link is constructed. Thus, we construct an undirected graph G reflecting the semantic relationship between sentences by their content similarity. The graph G contains all kinds of links between sentences and is called as the Whole Affinity Graph. We use an adjacency (affinity) matrix M to describe the whole affinity graph G with each entry corresponding to the weight of a link in the graph. \( {M} = (M_{{i,j}} )_{{n \times n}} \) is defined as follows:

$$ M_{{i,j}} = \left\{ \begin{array}{ll} aff(s_{i}, s_{j}), & \text{if}\; \; i \ne j \\ 0, & {\text{otherwise}} \\ \end{array} \right. $$
(2)

Then M is normalized to \( \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M} \) as follows to make the sum of each row equal to 1:

$$ \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{i,j}} = \left\{ \begin{array}{ll} {M_{{i,j}} } \mathord{\left/ {\vphantom {{M_{{i,j}} } {{\sum\limits_{j = 1}^n {M_{{i,j}} } }}}} \right. \kern-\nulldelimiterspace} {{\sum\limits_{j = 1}^n {M_{{i,j}} } }}, &{\text{if}} \; \; {\sum\limits_{j = 1}^n {M_{{i,j}} } } \ne 0 \\ 0, & {\text{otherwise}} \\ \end{array} \right. $$
(3)

Similar to the above process, the other two affinity graphs G intra and G inter are also built: the within-document affinity graph G intra is to include only within-document links between sentences (the entries of cross-document links are set to 0); the cross-document affinity graph G inter is to include only cross-document links between sentences (the entries of within-document links are set to 0). Note that given a sentence pair of s i and s j , if s i and s j belong to different documents, the link between s i and s j is a cross-document link (relationship); otherwise, the link is a within-document link (relationship). The corresponding adjacency (affinity) matrices of G intra and G inter are denoted by M intra and M inter respectively. In fact, M intra and M inter can be extracted from M and we have M = M intra M inter. Similar to Eq. 3, M inter and M intra are respectively normalized to \(M _{\text {intra}}\) and \( \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{{\text{inter}}}} \) to make the sum of each row equal to 1.

3.2.2 Topic-focused multi-document summarization

Similar to generic multi-document summarization, the whole affinity graph G, the within-document affinity graph G intra and the cross-document affinity graph G inter are built to reflect different relationships between sentences. The corresponding normalized affinity matrices are denoted by \( \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}, \) \( \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{{\text{intra}}}} \) and \( \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{{\text{inter}}}}, \) respectively.

In order to compute the biased information richness of sentences, the relevance values of the sentences to the topic need to be computed. Given a topic description q and a sentence collection \( S = {\left\{ {s_{i} |1 \le i \le n} \right\}} \) for a document set, we first compute the relevance of a sentence s i to the topic q using the standard Cosine measure as follows:

$$ rel{\text{(}}s_{i} |q{\text{)}} = \frac{{\ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{i} \cdot \ifmmode\expandafter\vec\else\expandafter\vec\fi{q}}} {{{\left\| {\ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{i} } \right\|} \cdot {\left\| {\ifmmode\expandafter\vec\else\expandafter\vec\fi{q}} \right\|}}} $$
(4)

where \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{s}_{i} \) and \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{q} \) are the corresponding term vectors of s i and q. The weight associated with term t is calculated with the tf t *isf t formula.

The relevance of sentence s i to the topic q is then normalized as follows to make the sum of all relevance values of the sentences equal to 1.

$$ rel^\prime {\text{(}}s_{i} |q{\text{)}} = \frac{{rel{\text{(}}s_{i} |q{\text{)}}}} {{{\sum\limits_{k = 1}^n {rel{\text{(}}s_{k} |q{\text{)}}} }}} $$
(5)

The relevance value of each sentence can be considered as a weight attached to each node in the affinity graph, reflecting the degree of the node’s topic-biasing.

3.3 Information richness computation

3.3.1 Generic multi-document summarization

The computation of the information richness scores of the sentences is based on the following three intuitions: (1) The more neighbors a sentence has, the more informative it is; (2) The more informative a sentence’s neighbors are, the more informative it is; (3) The more heavily a sentence is linked to by other informative sentences, the more informative it is. In brief, a sentence heavily linked to by many sentences with high information richness will also have high information richness. Based on the above intuitions, we apply a graph-ranking based algorithm to compute the information richness score for each node in the obtained affinity graph, similar to PageRank.

Based on the whole affinity graph G, the information richness score InfoRich all (s i ) for sentence s i can be deduced from those of all other sentences linked with it and it can be formulated in a recursive form as follows:

$$ InfoRich_{{all}} {\text{(}}s_{i} {\text{)}} = d \cdot {\sum\limits_{all j \ne i} {InfoRich_{{all}} {\text{(}}s_{j} {\text{)}} \cdot \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{j,i}} + \frac{{{\text{(}}1 - d{\text{)}}}} {n}} } $$
(6)

And the matrix form is:

$$ \ifmmode\expandafter\vec\else\expandafter\vec\fi{\lambda } = d{{\ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}}}^{T} \ifmmode\expandafter\vec\else\expandafter\vec\fi{\lambda } + \frac{{{\text{(}}1 - d{\text{)}}}} {n}\ifmmode\expandafter\vec\else\expandafter\vec\fi{e} $$
(7)

where n is the number of sentences in S; \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{\lambda } = {\text{[}}InfoRich_{{all}} {\text{(}}s_{i} {\text{)]}}_{{n \times 1}} \) is the vector containing the information richness of all the sentences; \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{e} \) is a unit vector with all elements equaling to 1; d is the damping factor set to 0.85.

The above process can be considered as a Markov chain by taking the sentences as the states and the corresponding transition matrix is given by \( d \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}^{T} + {\text{(}}1 - d{\text{)}}U \), where \( U = {\left[ {\frac{1} {n}} \right]}_{{n \times n}} \). Based on the transition matrix, we can construct a random walk model. The stationary probability distribution of each state is obtained by the principal eigenvector of the transition matrix.

For implementation, the initial information richness scores of all sentences are set to 1 and the iteration algorithm in Eq. 6 is adopted to compute the new information richness scores of the sentences. Usually the convergence of the iteration algorithm is achieved when the difference between the information richness scores computed at two successive iterations for any sentences falls below a given threshold (0.0001 in this study).

Similarly, the information richness score for sentence s i can be deduced based on either the within-document affinity graph G intra or the cross-document affinity graph G inter as follows:

$$ InfoRich_{{intra}} {\text{(}}s_{i} {\text{)}} = d \cdot {\sum\limits_{all\;j \ne i} {InfoRich_{{intra}} {\text{(}}s_{j} {\text{)}} \cdot {\text{(}} \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{intra}} {\text{)}}_{{j,i}} + \frac{{{\text{(}}1 - d{\text{)}}}} {n}} } $$
(8)
$$ InfoRich_{{inter}} (s_{i} ) = d \cdot {\sum\limits_{all\;j \ne i} {InfoRich_{{inter}} {\text{(}}s_{j} {\text{)}} \cdot {\text{(}} \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{inter}} {\text{)}}_{{j,i}} + \frac{{{\text{(}}1 - d{\text{)}}}} {n}} } $$
(9)

The model constructed in Eq. 8 is called as within-document random walk and the model constructed in Eq. 9 is called as cross-document random walk.

The final information richness InfoRich(s i ) of a sentence s i can be either InfoRich all (s i ), InfoRich intra (s i ) or InfoRich inter (s i ). Note that all previous graph-ranking based summarization methods have InfoRich(s i ) = InfoRich all (s i ).

3.3.2 Topic-focused multi-document summarization

In order to compute the biased information richness scores of the sentences, the topic-sensitive PageRank is employed to incorporate the relevance of the sentences to the topic.

Based on the whole affinity graph G, the biased information richness score InfoRich all (s i ) for sentence s i can be deduced from those of all other sentences linked with it and the relevance value of the sentence to the topic. It can be formulated in a recursive form as follows:

$$ InfoRich_{{all}} {\text{(}}s_{i} {\text{)}} = d \cdot {\sum\limits_{all\;j \ne i} {InfoRich_{{all}} {\text{(}}s_{j} {\text{)}} \cdot \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{j,i}} + {\text{(}}1 - d{\text{)}} \cdot rel'{\text{(}}s_{i} |q{\text{)}}} } $$
(10)

And the matrix form is:

$$ \ifmmode\expandafter\vec\else\expandafter\vec\fi{\lambda } = d \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}^{T} \ifmmode\expandafter\vec\else\expandafter\vec\fi{\lambda } + {\text{(}}1 - d{\text{)}}\ifmmode\expandafter\vec\else\expandafter\vec\fi{\alpha } $$
(11)

where \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{\lambda } = {\text{[}}InfoRich_{{all}} {\text{(}}s_{i} {\text{)]}}_{{n \times 1}} \) is the vector containing the information richness scores of all the sentences; \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{\alpha } = {\text{[}}rel'{\text{(}}s_{i} |q{\text{)]}}_{{n \times 1}} \) is the vector containing the relevance scores of all the sentences to the topic; d is the damping factor simply set to 0.85 as in PageRank.Footnote 5

Likewise, the above process can be considered as a Markov chain by taking the sentences as the states and the corresponding transition matrix is given by \( d \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}^{T} + (1 - d)\ifmmode\expandafter\vec\else\expandafter\vec\fi{\alpha }\ifmmode\expandafter\vec\else\expandafter\vec\fi{e}^{T}, \) where \( \ifmmode\expandafter\vec\else\expandafter\vec\fi{e} \) is a unit × 1vector with all elements equaling to 1. Based on the transition matrix, we can construct a random walk model. The stationary probability distribution of each state is obtained by the principal eigenvector of the transition matrix.

The biased information richness score for sentence s i can be deduced based on either the within-document affinity graph G intra or the cross-document affinity graph G inter as follows:

$$ InfoRich_{{intra}} {\text{(}}s_{i} {\text{)}} = d \cdot {\sum\limits_{all\;j \ne i} {InfoRich_{{intra}} {\text{(}}s_{j} {\text{)}} \cdot {\text{(}} \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{intra}} {\text{)}}_{{j,i}} + {\text{(}}1 - d{\text{)}} \cdot rel'{\text{(}}s_{i} |q{\text{)}}} } $$
(12)
$$ InfoRich_{{inter}} {\text{(}}s_{i} {\text{)}} = d \cdot {\sum\limits_{all\;j \ne i} {InfoRich_{{inter}} {\text{(}}s_{j} {\text{)}} \cdot {\text{(}} \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{inter}} {\text{)}}_{{j,i}} + {\text{(}}1 - d{\text{)}} \cdot rel'{\text{(}}s_{i} |q{\text{)}}} } $$
(13)

Similarly, the final biased information richness score InfoRich(s i ) of a sentence s i can be either InfoRich all (s i ), InfoRich intra (s i ) or InfoRich inter (s i ).

3.4 Diversity penalty imposition

This step aims to remove redundant information in the summary by penalizing the sentences highly overlapping with other informative sentences. Based on the whole affinity graph G and the obtained final information richness scores, a greedy algorithm (Zhang et al. 2005) is applied to impose diversity penalty and compute the final affinity rank scores of the sentences as follows:

  1. 1.

    Initialize two sets \( A = \emptyset ,\) \( B = {\left\{ {s_{i} |i = 1,2, \ldots ,n} \right\}} ,\) and each sentence’s affinity rank score is initialized to its information richness score, i.e. ARScore(s i ) = InfoRich(s i ), i=1,2,...,n.

  2. 2.

    Sort the sentences in B by their current affinity rank scores in descending order.

  3. 3.

    Suppose s i is the highest ranked sentence, i.e. the first sentence in the ranked list. Move sentence s i from B to A, and then the diversity penalty is imposed on the affinity rank score of each sentence linked with s i as follows: For each sentence s j in B, we have

    $$ ARScore{\text{(}}s_{j} {\text{)}} = ARScore{\text{(}}s_{j} {\text{)}} - \omega \cdot \ifmmode\expandafter\tilde\else\expandafter\sim \fi{M}_{{j,i}} \cdot InfoRich{\text{(}}s_{i} {\text{)}} $$
    (14)

    where ω > 0 is the penalty degree factor. The larger ω is, the greater penalty is imposed on the affinity rank score. If ω = 0, no diversity penalty is imposed at all.

  4. 4.

    Go to step 2 and iterate until \( B = \emptyset \) or the iteration count reaches a predefined maximum number.

In the above algorithm, the third step is the crucial step and its basic idea is to decrease the affinity rank score of less informative sentences by the part conveyed from the most informative one. Finally, the sentences with highest affinity rank scores are chosen to produce the summary according to the summary length limit.

4 Experiments and results

4.1 Experimental setup

4.1.1 Data set

Generic multi-document summarization has been one of the fundamental tasks in DUC 2001, DUC 2002 and DUC 2004 (i.e. task 2 in DUC 2001, task 2 in DUC 2002 and task 2 in DUC 2004), and we used DUC 2001 data as training set and DUC 2002 and DUC 2004 data as test set in our experiments. In DUC 2002, 59 document setsFootnote 6 of approximately 10 documents each were provided and generic summaries of each document set with lengths of approximately 100 words or less were required to be created. In DUC 2004, 50 TDT (Topic Detection and Tracking (Allan et al. 1998)) document clusters were provided and a short summary with lengths of 665 bytes or less was required to be created. Note that the TDT topic would not be input to the summarizer. Table 1 gives a summary of the datasets used in the experiments.

Table 1 Summary of datasets for generic multi-document summarization

Topic-focused multi-document summarization has been evaluated on tasks 2 and 3 of DUC 2003 and the only task of DUC 2005, each task having a gold standard data set consisting of document clusters and reference summaries. In our experiments, task 2 of DUC 2003 was used for training and parameter tuning and the other two tasks were used for testing. The topic representations of the three topic-focused summarization tasks are different: task 2 of DUC 2003 is to produce summaries focused by events; task 3 of DUC 2003 is to produce summaries focused by viewpoints; the task of DUC 2005 is to produce summaries focused by DUC Topics. In the experiments, we do not differentiate the three kinds of topic representations and make use of them in the same way. Table 2 gives a short summary of the datasets.

Table 2 Summary of datasets for topic-focused multi-document summarization

As a preprocessing step, the sentences were extracted using the sentence-breaker tool provided by DUC and the dialog sentences (sentences in quotation marks) were removed. In the process of affinity graph building, the stop words were removed and Porter’s stemmer (Porter 1980) was used for word stemming.

4.1.2 Evaluation metric

For evaluation, we use the ROUGE (Lin and Hovy 2003) evaluation toolkit,Footnote 7 which is adopted by DUC for automatically summarization evaluation. It measures summary quality by counting overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and the reference summary. ROUGE-N is an n-gram recall measure computed as follows:

$$ ROUGE - N = \frac{{{\sum\limits_{S \in {\text{\{ }}RefSum{\text{\} }}} {{\sum\limits_{n\text{-}gram \in S}{Count_{{match}} {\text{(}}n\text{-}gram{\text{)}}} }} }}}{{{\sum\limits_{S \in {\text{\{ }}Ref Sum{\text{\} }}}{{\sum\limits_{n\text{-}gram \in S}{Count{\text{(}}n\text{-}gram{\text{)}}} }} }}} $$
(15)

where n stands for the length of the n-gram, and Count match (n-gram) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries. Count(n-gram) is the number of n-grams in the reference summaries.

ROUGE toolkit reports separate scores for 1, 2, 3 and 4-gram, and also for longest common subsequence co-occurrences. Among these different scores, unigram-based ROUGE score (ROUGE-1) has been shown to agree with human judgment most (Lin and Hovy 2003). We show three of the ROUGE metrics in the experimental results: ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and ROUGE-W (based on weighted longest common subsequence, weight = 1.2).

In order to truncate summaries longer than length limit, we used the “-l” or “-b” optionFootnote 8 in ROUGE toolkit and we also used the “-m” option for word stemming.

4.2 Experimental results

4.2.1 Generic multi-document summarization

The proposed approaches are compared with the top 3 performing systems and two baseline systems (i.e. the lead baseline and the coverage baseline) on task 2 of DUC 2002 and task 2 of DUC 2004 respectively. The top three systems are the systems with the highest ROUGE scores, chosen from the performing systems in the tasks of DUC 2002 and DUC 2004 respectively. The lead baseline and coverage baseline are two baselines employed in the multi-document summarization tasks at DUC. The lead baseline takes the first sentences one by one from the last document in the collection, where documents are assumed to be ordered chronologically, until the summary reaches the length limit. And the coverage baseline takes the first sentence from the first document, the first sentence from the second document, and the first sentence from the third document, and so on, until the summary reaches the length limit.

The following three approaches based on different sentence relationships are investigated in the experiments:

  • Uniform Link: The approach computes the information richness score of a sentence based on the whole affinity graph without differentiating the cross-document relationships and the within-document relationships, as in previous graph-ranking based summarization methods, i.e. InfoRich(s i ) = InfoRich all (s i );

  • Intra-Link: The approach computes the information richness score of a sentence based only on the within-document relationships between sentences, i.e. InfoRich(s i ) = InfoRich intra (s i );

  • Inter-Link: The approach computes the information richness score of a sentence based only on the cross-document relationships between sentences, i.e. InfoRich(s i ) = InfoRich inter (s i ).

Tables 3 and 4 show the comparison results on task 2 of DUC 2002 and task 2 of DUC 2004, respectively, where S19-S104 are the system IDs of the top performing systems, whose details are described in DUC publications. The penalty degree factor ω for the above three approaches was tuned on task 2 of DUC 2001 and set to 8.

Table 3 Performance comparison on Task 2 of DUC 2002
Table 4 Performance comparison on Task 2 of DUC 2004

We can see from the above tables that the two graph-ranking based approaches using the cross-document relationships between sentences (i.e. “Inter-Link” and “Uniform Link”) much outperform the top performing systems and baseline systems (including the “Intra-Link”) on both tasks over all three metrics. Among the three graph-ranking based approaches, the approach based only on the cross-document relationships between sentences (i.e. “Inter-Link”) performs best on both tasks, while the approach based only on the within-document relationships between sentences (i.e. “Intra-Link”) performs worst. The above observations demonstrate the great importance of the cross-document relationships between sentences for generic multi-document summarization.

Figures 27 show the comparison results of the graph-ranking based approaches under different values of the penalty degree factor ω. Figures 24 show the comparison results over ROUGE-1, ROUGE-2 and ROUGE-W on DUC 2002 respectively and Figs. 57 show the comparison results over the three metrics on DUC 2004 respectively.

Fig. 2
figure 2

ROUGE-1 results on task 2 of DUC 2002 vs. ω

Fig. 3
figure 3

ROUGE-2 results on task 2 of DUC 2002 vs. ω

Fig. 4
figure 4

ROUGE-W results on task 2 of DUC 2002 vs. ω

Fig. 5
figure 5

ROUGE-1 results on task 2 of DUC 2004 vs. ω

Fig. 6
figure 6

ROUGE-2 results on task 2 of DUC 2004 vs. ω

Fig. 7
figure 7

ROUGE-W results on task 2 of DUC 2004 vs. ω

Seen from the above figures, when the penalty degree factor ω is increased, the ROUGE scores of almost all the approaches are first increased and then decreased. The reason is that the larger ω is, the greater penalty is imposed on the affinity rank score, and when ω is large enough, the diversity penalty could be over-imposed, thus deteriorating the final performance. The approaches considering the cross-document relationships between sentences (i.e. “Inter-Link” and “Uniform Link”) always outperform the approaches based only on the within-document relationships between sentences (i.e. “Intra-Link”) under different values of the penalty degree factor ω. Among the two approaches considering the cross-document relationships between sentences, the approach based only on the cross-document relationships between sentences (i.e. “Inter-Link”) performs better than or at least as well as the approach based on both the cross-document and within-document relationships (i.e. “Uniform Link”) under different values of the penalty degree factor ω. This result further validates the great importance of the cross-document relationships betweens sentences for generic multi-document summarization.

In order to investigate how the relative contributions from the cross-document relationships and the within-document relationships between sentences influence the summarization performance, we propose the following “Union-Link” method to compute the final information richness score of a sentence s i by linearly combing the information richness score InfoRich intra (s i ) based on the within-document relationships and the information richness score InfoRich inter (s i ) based on the cross-document relationships as follows:

$$ InfoRich{\text{(}}s_{i} {\text{)}} = \lambda \cdot InfoRich_{{intra}} {\text{(}}s_{i} {\text{)}} + {\text{(}}1 - \lambda {\text{)}} \cdot InfoRich_{{inter}} {\text{(}}s_{i} {\text{)}} $$
(16)

where \( \lambda \in [0,1] \) is a weighting parameter, specifying the relative contributions to the final information richness of sentences from the cross-document relationships and the within-document relationships between sentences. If λ = 0, InfoRich(s i ) is equal to InfoRich inter (s i ); if λ = 1, InfoRich(s i ) is equal to InfoRich intra (s i ); and if λ = 0.5, the cross-document relationships and the within-document relationships are assumed to be equally important. Note that the “Union-Link” is almost the same with the “Uniform Link” when λ is set to 0.5.

Figures 810 show the performances of the “Union-Link” approach under different values of the weighting parameter λ. The summarization performances without diversity penalty imposition (ω = 0) and the summarization performances with diversity imposition (ω = 8) on task 2 of DUC 2002 and task 2 of DUC 2004 are shown in the figures.

Fig. 8
figure 8

ROUGE-1 of generic “Union-Link” vs. λ

Fig. 9
figure 9

ROUGE-2 of generic “Union-Link” vs. λ

Fig. 10
figure 10

ROUGE-W of generic “Union-Link” vs. λ.

Seen from Figs. 810, it is clear that the performance values of the approaches decrease with the increase of λ over all three metrics on both tasks of DUC 2002 and DUC 2004, which demonstrates that the less relative contributions are given to the cross-document relationships between sentences, the worse the summarization performance is. The cross-document relationships between sentences are much more important than the within-document relationships between sentences for generic multi-document summarization.

4.2.2 Topic-focused multi-document summarization

Similar to generic multi-document summarization, the proposed approaches were compared with the top 3 performing systems and two baseline systems (i.e. the lead baseline and the coverage baseline) on task 3 of DUC 2003 and the task of DUC 2005 respectively. The top three systems are the systems with the highest ROUGE scores, chosen from the performing systems in the tasks of DUC 2003 and DUC 2005 respectively.

The three approaches of “Uniform Link”, “Intra-Link” and “Inter-Link” were investigated in the experiments. The approaches are defined in the same way with that for generic multi-document summarization and they compute the biased information richness with Eqs. 10, 12 and 13 respectively.

Tables 5 and 6 show the ROUGE comparison results on task 3 of DUC 2003 and the only task of DUC 2005 respectively. In the tables, S4–S17 are the system IDs of the top performing systems. The penalty degree factor ω for the above approaches is tuned on task 2 of DUC 2003 and set to 10.

Table 5 Performance comparison on Task 3 of DUC 2003
Table 6 Performance comparison on the task of DUC 2005

We can see from the tables that the two graph-ranking based approaches using the cross-document relationships between sentences (i.e. “Inter-Link” and “Uniform Link”) much outperform the top performing systems and baseline systems (including the “Intra-Link”) on both tasks over the three metrics. Among the three graph-ranking based approaches, the approach based only on the cross-document relationships between sentences (i.e. “Inter-Link”) performs best on both tasks, while the approach based only on the within-document relationships between sentences (i.e. “Intra-Link”) performs worst. The above observations demonstrate the great importance of the cross-document relationships between sentences for topic-focused multi-document summarization.

Figures 1116 show the comparison results of the three graph-ranking based approaches under different values of the penalty degree factor ω. Figures 1113 show the comparison results over the three metrics on task 3 of DUC 2003 respectively and Figs. 1416 show the comparison results over the three metrics on DUC 2005 respectively.

Fig. 11
figure 11

ROUGE-1 results on task 3 of DUC 2003 vs. ω

Fig. 12
figure 12

ROUGE-2 results on task 3 of DUC 2003 vs. ω

Fig. 13
figure 13

ROUGE-W results on task 3 of DUC 2003 vs. ω.

Fig. 14
figure 14

ROUGE-1 results on DUC 2005 vs. ω

Fig. 15
figure 15

ROUGE-2 results on DUC 2005 vs. ω

Fig. 16
figure 16

ROUGE-W results on DUC 2005 vs. ω

Seen from the figures, the approaches based on the cross-document relationships between sentences (i.e. “Inter-Link” and “Uniform Link”) almost always outperform the approach based only on the within-document relationships between sentences (i.e. “Intra-Link”) when ω varies from 0 to 18. Among the two approaches using the cross-document relationships between sentences, the “Inter-Link” performs almost always better than or at least as well as the “Uniform Link” when ω varies. This further validates the great importance of the cross-document relationships betweens sentences for topic-focused multi-document summarization. We can also see that the ROUGE scores of almost all the approaches are first increased and then decreased with the increase of ω, which is because the diversity penalty could be over-imposed and thus deteriorate the final performance when ω is large enough.

In order to investigate how the relative contributions from the cross-document relationships and the within-document relationships between sentences influence the topic-focused summarization performance, we propose the approach of “Union-Link” to compute the final biased information richness of a sentence s i , which linearly combines the biased information richness InfoRich intra (s i ) based on the within-document random walk and the biased information richness InfoRich inter (s i ) based on the cross-document random walk as in Eq. 16.

Figures 1719 show the performances of the “Union-Link” approach when λ varies from 0 to 1. The performances of the approach without diversity penalty imposition (ω = 0) and the approach with diversity penalty imposition (ω = 10) on task 3 of DUC 2003 and the task of DUC 2005 are shown in the figures, respectively.

Fig. 17
figure 17

ROUGE-1 of topic-focused “Union-Link” vs. λ

Fig. 18
figure 18

ROUGE-2 of topic-focused “Union-Link” vs. λ

Fig. 19
figure 19

ROUGE-W of topic-focused “Union-Link” vs. λ

Seen from Figs. 1719, the ROUGE values of the approaches decrease with the increase of λ over almost all the three metrics on both tasks, which demonstrates that the less relative contributions are given to the cross-document relationships between sentences, the worse the system performance is. The cross-document relationships between sentences are much more important than the within-document relationships between sentences for topic-focused multi-document summarization.

In order to demonstrate the value of integrating the relevance between the sentences and the given topic for topic-focused summarization, we compare the proposed “Uniform Link” method with one particular baseline method, which treats the task of topic-focused summarization as the task of generic summarization by ignoring the sentence-to-topic relevance, i.e., the baseline method makes use of only the sentence-to-sentence relationships and use Eq 6 to compute the information richness scores of the sentences, instead of using Eq 10 to compute the biased information richness scores of the sentences.

Figures 2022 show the comparison results over three ROUGE metrics on the two DUC tasks, respectively. In the figures, the ROUGE values based on different ω are presented and compared, and we can find that on both tasks, the proposed method (“w/ topic”) almost always outperforms the baseline method (“w/o topic”) over different ω and three ROUGE metrics. The results validate that the relevance between sentences and topic are important to the task of topic-focused summarization because the relevance benefits to compute the biased information richness of the sentences.

Fig. 20
figure 20

ROUGE-1 comparison of “Uniform Link” w/ or w/o topic weighting

Fig. 21
figure 21

ROUGE-2 comparison of “Uniform Link” w/ or w/o topic weighting

Fig. 22
figure 22

ROUGE-W comparison of “Uniform Link” w/ or w/o topic weighting

4.3 Discussion

The experimental results demonstrate the great importance of the cross-document relationships between sentences for both generic and topic-focused multi-document summarizations, which can be explained by the essence of multi-document summarization. The aim of multi-document summarization is to extract important information form the whole document set,Footnote 9 in other words, the information in the summary should be globally important on the whole document set. So the information contained in a globally informative sentence will be also expressed in the sentences of other documents and the votes or recommendations of neighbors in other documents are more important than the votes or recommendations of neighbors in the same document. The above idea has been applied in many other applications. For example, in the field of web search, a web page’s inlinks from other websites are usually more important than the inlinks from within the same website for evaluating the importance of the web page; in the field of digital library, the citing of a paper by some paper from other organizations is more important than the citing by some paper in the same organization for evaluating the quality of the paper.

Next, we explore whether the cross-document random walk model is more likely to select sentences from different documents and whether the summarization performance would be better if sentences are chosen from more documents. The number of different documents containing the top 5 and 10 sentences with the largest affinity rank scores is recorded for each summarization process, and then the document numbers are averaged over each DUC task. Table 7 gives the averaged numbers. We can see from the table that the cross-document random walk model (“Inter-Link”) does not always select sentences from more documents than the within-document random walk model (“Intra-Link”). For generic summarization tasks, the within-document random walk model (“Intra-Link”) selects sentences from more documents than the other two models considering the cross-document relationships (“Inter-Link” and “Uniform Link”), while it has the worst summarization performance. The comparison results demonstrate that the number of different documents containing summary sentences has no correlation with the summarization performance.

Table 7 Average numbers of different documents containing summary sentences (top 5 and 10 sentences)

Finally, it is noteworthy that the use of only cross-document relationships for multi-document summarization will largely reduce the number of edges in the affinity graph and improve the efficiency of the iterative algorithm.

5 Conclusion and future work

In this paper we differentiate the two kinds of relationships between sentences for generic multi-document summarization, i.e. the cross-document relationships and the within-document relationships. Experimental results on DUC 2002 and DUC 2004 demonstrate the great importance of the cross-document relationships between sentences. The system can achieve best performance even based only on the cross-document relationships between sentences. Furthermore, we integrate the relevance of the sentences to the specified topic into the graph-ranking based method for topic-focused multi-document summarization. The within-document relationships and the cross-document relationships are differentiated and two separate random walk models are constructed. Experimental results on DUC 2003 and DUC 2005 demonstrate the great importance of the cross-document relationships between sentences for topic-focused multi-document summarization. The proposed approach can achieve best performance even based only on the cross-document relationships between sentences.

In future work, we will apply the graph-ranking based method to web page summarization, and further investigate to incorporate the link structure between web pages in the summarization process. The rich link information is unique for web page summarization, and the links usually have different purposes. We will classify the links between web pages according to different purposes, and then differentiate the link information in the summarization process to improve the summarization performance.