Elsevier

Neurocomputing

Volume 157, 1 June 2015, Pages 356-366
Neurocomputing

Unsupervised document summarization from data reconstruction perspective

https://doi.org/10.1016/j.neucom.2014.07.046Get rights and content

Abstract

Due to its wide applications in information retrieval, document summarization is attracting increasing attention in natural language processing. A large body of recent literature has implemented document summarization by extracting sentences that cover the main topics of a document with a minimum redundancy. In this paper, we take a different perspective from data reconstruction and propose a novel unsupervised framework named Document Summarization based on Data Reconstruction (DSDR). Specifically, our approach generates a summary which consist of those sentences that can best reconstruct the original document. To model the relationship among sentences, we firstly introduce the linear reconstruction which approximates the document by linear combinations of the selected sentences. We then extend it into the non-negative reconstruction which allows only additive, not subtractive, linear combinations. In order to handle the nonlinear cases and respect the geometrical structure of sentence space, we also extend the linear reconstruction in the manifold adaptive kernel space which incorporates the manifold structure by using graph Laplacian. Extensive experiments on summarization benchmark data sets demonstrate that our proposed framework outperform state of the art.

Introduction

With the explosion of the textual information on the World Wide Web, people are overwhelmed by innumerable accessible documents. This means that we are in great need for technologies like document summarization that can better help users digest the information on the Web. Summarization techniques address this problem by condensing the document into a short piece of text covering the main topics. For example, search engines can provide users with snippets as the previews of the document contents, and help them to find the desired document. News sites usually describe hot news topics in concise headlines to facilitate browsing all news. Both the snippets and headlines are specific forms of document summary in real applications. Especially in the microblogging services, such as Twitter, Weibo and Tumblr, a hot topic can yield millions of short massages including enormous noises and redundancies. The possible solution is to summarize the massive tweets into a set of short text pieces covering the main topics [1].

Document summarization can be categorized as abstractive summaries or extractive summaries. Given a document, the abstractive summary is generated from complex natural language processing like information fusion, sentence compression and reformulation. Obviously, it is a difficult task for computer to automatically generate a satisfactory summary by abstraction. So the common practice is to perform extractive summarization in which a subset of existing sentences is used to form a final summary. Most of the existing generic summarization approaches use a ranking model to select sentences from a candidate set [2], [3], [4]. But these methods suffer from the redundancy problem in that top ranked sentences usually share much information in common. Although there are some methods [5], [6], [7] trying to reduce the redundancy, selecting sentences which have both good information coverage and minimum redundancy is a non-trivial task.

The motivation of our work is that the traditional methods usually solve the document summarization as a natural language problem rather than a data reconstruction problem although the second has been explored greatly in the literature of machine learning such as dimension reduction and feature selection. So in this paper, we propose a novel unsupervised summarization framework from the perspective of data reconstruction. As far as we know, our work is the first to treat the document summarization as a data reconstruction problem. We argue that a good summary should consist of those sentences that can best reconstruct the original document. Therefore, the reconstruction error becomes a natural criterion for measuring the quality of summary. The new framework, namely Document Summarization based on Data Reconstruction (DSDR), finds the summary sentences by minimizing the reconstruction error. DSDR learns a reconstruction function for each candidate sentence of an input document and then formulates an objective function minimizing the error to obtain an optimal summary. The geometric interpretation is that DSDR tends to select sentences that span the intrinsic subspace of candidate sentence space, so that it is able to cover the core information of the document.

We firstly introduce the linear reconstruction to model the relationship between the document and the summary. The linear reconstruction aims to approximate the document by linear combinations of the selected summary sentences. Further, inspired by previous studies which indicate the existence of psychological and physiological evidence for parts-based representation in the human brain [8], [9], [10], we assume that document summary should consist of the parts of sentences, and introduce the non-negative constraints into the DSDR framework. With the non-negative constraints, our method leads to parts-based representation so that no redundant information needs to be subtracted from the combination. Still another issue to be addressed in document summary is the nonlinearity of the sentence space, as recent research [11] shows that the raw sentences are supposed to be highly nonlinear in distribution. The linear functions therefore lead to suboptimal fit in that neither the linear reconstruction nor the non-negative linear reconstruction respect the nonlinear manifold structure of sentence space. So we propose a novel nonlinear reconstruction which is performed in the manifold adaptive kernel space by using graph Laplacian [11], [12], [13]. By extracting sentences which can reconstruct the document in the kernel space, we are able to produce a better summary than the classical methods.

It is worthwhile to highlight the following three contributions of our proposed DSDR framework in this paper:

  • We propose a novel unsupervised summarization framework from the perspective of data reconstruction which as we known is the first work to treat the document summarization from such a perspective.

  • We firstly introduce the linear reconstruction and a greedy optimization method to solve the problem efficiently and effectively. Further, we propose the non-negative reconstruction and the corresponding iterative method to get a global optimum. To handle the nonlinearity, we finally propose the nonlinear reconstruction based on the manifold adaptive kernel.

  • The proposed framework should not be restricted to the three types of reconstruction mentioned in this paper. Actually it is suitable for any other data reconstruction types. Since DSDR is unsupervised and language independent, it can be extended to summarize non-English document easily and even multi-language document.

This work is an extended and improved follow-up to our earlier work [14]. In comparison, we add a substantially theoretical analysis about extending DSDR in the manifold adaptive kernel space. For both linear reconstruction and non-negative linear reconstruction, the details of the mathematical translations are introduced additionally. We also extend the experiments here, such as implementing DSDR in the manifold adaptive kernel space and comparing it with existing approaches.

Our paper is organized as follows. We briefly review the related work in Section 2. In Section 3, we introduce the details of the Document Summarization based on Data Reconstruction (DSDR) including the optimization algorithms. Finally, we experimentally demonstrate the effectiveness of our proposed approaches in Section 4 and conclude in Section 5.

Section snippets

Related work

Recently, lots of extractive document summarization methods have been studied. Most of them involve assigning salient scores to sentences or paragraphs of the original document and composing the result summary of the top units with the highest scores. The computation rules of salient scores can be categorized into three groups [15]: feature based measurements, lexical chain based measurements and graph based measurements [4]. Salient scores in feature based measurements are usually related with

The proposed framework

Suppose we have a document and its summary as shown in Fig. 1. It can be found that a good summary should match the following two conditions. First, the selected sentences are able to cover most information of all sentences so that they can represent the original document. And we call the process of covering as “reconstruction”. Second, the reconstruction of these sentences should be concise so that the summary will keep minimum redundancy. So we believe that a good summary should contain those

Experiments

In this study, we use the standard summarization benchmark data sets DUC 2006 and DUC 2007 for the evaluation. DUC 2006 and DUC 2007 contain 50 and 45 document sets respectively, with 25 news articles in each set. The sentences in each article have been separated by NIST.1 And every sentence is either used in its entirety or not at all for constructing a summary. The length of a result summary is limited by 250 tokens (whitespace delimited).

Conclusion

In this paper, we propose a novel unsupervised summarization framework called the Document Summarization based on Data Reconstruction (DSDR) which selects the most representative sentences that can best reconstruct the entire document. We introduce the linear reconstruction firstly and extend it in two different ways (non-negative and manifold adaptive kernel). The experimental results show that our DSDR framework can outperform other state-of-the-art summarization approaches. DSDR with linear

Acknowledgment

This work was supported in part by National Basic Research Program of China (973 Program) under Grant 2013CB336500, National Natural Science Foundation of China (Grant nos. 61125203, 61173185, 90920303, and 61173186), Zhejiang Provincial Natural Science Foundation of China (Grant nos. Y1101043 and LZ13F020001) and Foundation of Zhejiang Provincial Educational Department under Grant Y201018240.

Zhanying He received the BS degree in Software Engineering from Zhejiang University, China, in 2009. She is currently a candidate for a PhD degree in computer science at Zhejiang University. Her research interests include information retrieval, data mining and machine learning.

References (47)

  • S.E. Palmer

    Hierarchical structure in perceptual representation

    Cogn. Psychol.

    (1977)
  • P. Li et al.

    Clustering analysis using manifold kernel concept factorization

    Neurocomputing

    (2012)
  • L. Shou, Z. Wang, K. Chen, G. Chen, Sumblr: continuous summarization of evolving tweet streams, in: Proceedings of the...
  • S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw....
  • J.M. Kleinberg

    Authoritative sources in a hyperlinked environment

    J. ACM

    (1999)
  • X. Wan, J. Yang, Collabsum: exploiting multiple document clustering for collaborative single document summarizations,...
  • J.M. Conroy, D.P. O׳leary, Text summarization via hidden Markov models, in: Proceedings of the 24th Annual...
  • S. Park, J.-H. Lee, D.-H. Kim, C.-M. Ahn, Multi-document summarization based on cluster using non-negative matrix...
  • D. Shen, J. Sun, H. Li, Q. Yang, Z. Chen, Document summarization using conditional random fields, in: Proceedings of...
  • E. Wachsmuth et al.

    Recognition of objects and their component parts: responses of single units in the temporal cortex of the macaque

    Cereb. Cortex

    (1994)
  • D. Cai et al.

    Graph regularized non-negative matrix factorization for data representation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • D. Cai et al.

    Manifold adaptive experimental design for text categorization

    IEEE Trans. Knowl. Data Eng.

    (2012)
  • M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Adv. Neural Inf....
  • X. He, P. Niyogi, Locality preserving projections, in: Adv. Neural Inf. Process. Syst. 16...
  • Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, X. He, Document summarization based on data reconstruction, in:...
  • M. Hu, A. Sun, E.-P. Lim, Comments-oriented document summarization: understanding documents with readers׳ feedback, in:...
  • Y. Gong, X. Liu, Generic text summarization using relevance measure and latent semantic analysis, in: Proceedings of...
  • D. Wang, T. Li, S. Zhu, C. Ding, Multi-document summarization via sentence-level semantic analysis and symmetric matrix...
  • G.A. Miller

    Wordnet: a lexical database for English

    Commun. ACM

    (1995)
  • Y. Choi

    Tree pattern expression for extracting information from syntactically parsed text corpora

    Data Min. Knowl. Discov.

    (2011)
  • A. Nenkova, L. Vanderwende, K. McKeown, A compositional context sensitive multi-document summarizer: exploring the...
  • M. Amini, P. Gallinari, The use of unlabeled data to improve supervised learning for text summarization, in:...
  • J. Kupiec, J. Pedersen, F. Chen, A trainable document summarizer, in: Proceedings of the 18th Annual International ACM...
  • Cited by (4)

    • A Two-stage Chinese text summarization algorithm using keyword information and adversarial learning

      2021, Neurocomputing
      Citation Excerpt :

      At present, most of the traditional Chinese summarization techniques adopt the extraction method. He et al. [8] used the method of data reconstruction to linearly combine the extracted sentences to form a summary. The TextRank [9] and LexRank [10] algorithms use the relevance between texts to construct graphs, and extract keywords or key sentences by calculating the weights of nodes.

    • KeyphraseDS: Automatic generation of survey by exploiting keyphrase information

      2017, Neurocomputing
      Citation Excerpt :

      The ILP framework aims to maximize the coverage of informative sentences in addition to other objectives such as information unit coverage [6,7], or coherence [36], which in the meanwhile ensures redundancy removal and global decision. Other methods, such as topic model [37,38], non-negative matrix factorization (NMF) [39], CRF [40], HMM [41], ensemble methods [42,43], data reconstruction [44,45], have also been proposed to select significant sentences. Scientific summary [46,24,2,25,47] has attracted much attention for recent years.

    • BHLM: Bayesian theory-based hybrid learning model for multi-document summarization

      2018, International Journal of Modeling, Simulation, and Scientific Computing

    Zhanying He received the BS degree in Software Engineering from Zhejiang University, China, in 2009. She is currently a candidate for a PhD degree in computer science at Zhejiang University. Her research interests include information retrieval, data mining and machine learning.

    Chun Chen received the BS degree in Mathematics from Xiamen University, China, in 1981, and his MS and PhD degrees in Computer Science from Zhejiang University, China, in 1984 and 1990, respectively. He is a professor in the College of computer science, Zhejiang University. His research interests include information retrieval, data mining, computer vision, computer graphics and embedded technology.

    Jiajun Bu received the BS and PhD degrees in computer science from Zhejiang University, China, in 1995 and 2000, respectively. He is a professor in the College of Computer Science, Zhejiang University. His research interests include embedded system, data mining, information retrieval and mobile database.

    Can Wang received the BS degree in economics, MS and PhD degrees in computer science from Zhejiang University, China, in 1995, 2003 and 2009, respectively. He is currently a faculty member in the College of Computer Science at Zhejiang University. His research interests include information retrieval, data mining and machine learning.

    Lijun Zhang received the BS and PhD degrees in computer science from Zhejiang University, China, in 2007 and 2012, respectively. He worked as a postdoc in Michigan State University, USA, from 2012 to 2014. He is currently an associate professor in computer science at Nanjing University, China. His research interests include machine learning, information retrieval, and data mining.

    Deng Cai is a professor in the State Key Lab of CAD&CG, College of Computer Science at Zhejiang University, China. He received the PhD degree in computer science from University of Illinois at Urbana Champaign in 2009. Before that, he received his Bachelor׳s degree and a Master׳s degree from Tsinghua University in 2000 and 2003, respectively, both in automation. His research interests include machine learning, data mining and information retrieval.

    Xiaofei He received the BS degree in computer science from Zhejiang University, China, in 2000 and the PhD degree in computer science from the University of Chicago, in 2005. He is a professor in the State Key Lab of CAD&CG at Zhejiang University, China. Prior to joining Zhejiang University in 2007, he was a research scientist at Yahoo! Research Labs, Burbank, CA. His research interests include machine learning, information retrieval, and computer vision. He is a senior member of IEEE.

    View full text