Elsevier

Expert Systems with Applications

Volume 42, Issue 21, 30 November 2015, Pages 8146-8155
Expert Systems with Applications

Query-oriented unsupervised multi-document summarization via deep learning model

https://doi.org/10.1016/j.eswa.2015.05.034Get rights and content

Highlights

  • First attempt of deep learning for query-oriented multi-document summarization.

  • Novel algorithm pushes out important concepts layer by layer effectively.

  • Confirm excellent extraction ability under unsupervised learning framework.

Abstract

Capturing the compositional process from words to documents is a key challenge in natural language processing and information retrieval. Extractive style query-oriented multi-document summarization generates a summary by extracting a proper set of sentences from multiple documents based on pre-given query. This paper proposes a novel document summarization framework based on deep learning model, which has been shown outstanding extraction ability in many real-world applications. The framework consists of three parts: concepts extraction, summary generation, and reconstruction validation. A new query-oriented extraction technique is proposed to extract information distributed in multiple documents. Then, the whole deep architecture is fine-tuned by minimizing the information loss in reconstruction validation. According to the concepts extracted from deep architecture layer by layer, dynamic programming is used to seek most informative set of sentences for the summary. Experiment on three benchmark datasets (DUC 2005, 2006, and 2007) assess and confirm the effectiveness of the proposed framework and algorithms. Experiment results show that the proposed method outperforms state-of-the-art extractive summarization approaches. Moreover, we also provide the statistical analysis of query words based on Amazon’s Mechanical Turk (MTurk) crowdsourcing platform. There exists underlying relationships from topic words to the content which can contribute to summarization task.

Introduction

Automatically generating summaries from large text corpora has long been attracting research attention from both information retrieval and natural language processing, the earlier studies of which could be dated back to the 1950s and 1960s (Baxendale, 1958, Edmundson, 1969, Luhn, 1958). Automatic generation of summaries creates shortened versions of texts to help users catch important information in the original text with bearable time costs (Khanpour, 2009). Currently, the creation of summaries is a task best handled by humans. However, with the explosion of textual data, especially in big data era, it is no longer financially possible, or feasible, to produce all types of summaries by hand. Earlier studies on text summarization aimed at summarizing from pre-given documents without requirements, which is usually referred to as generic summarization (Berger & Mittal, 2000). With the development of information retrieval, query-oriented summarization task, which requires summarizing from a set of document to answer a pre-given query, has started attracting more and more attention (Tang, Yao, & Chen, 2009). According to the size of the input, text summarization tasks can be grouped into single-document and multi-document summarization tasks (Shen, Sun, Li, Yang, & Chen, 2007). Based on the writing style of the output summary, text summarization techniques can be divided into extractive approaches and abstractive approaches (Song et al., 2011, Wong et al., 2008). Due to the limitation of current natural language generation techniques, extractive approaches are the mainstream in the field. An extractive approach selects a number of indicative text fragments from the input documents to form a summary instead of re-writing an abstract (Chen, Yang, Zha, Zhang, & Zhang, 2008) under a budget constraint. A budget constraint is natural in summarization task as the length of the summary is often restricted (Lin & Bilmes, 2010). In the paper, we adopt the extractive style to develop techniques for query-oriented multi-document summarization.

Almost all extractive summarization methods are faced with two key problems: how to rank textual units, and how to select a subset of those ranked units (Jin, Huang, & Zhu, 2010). The first one on ranking requires systems to model the relevance of a textual unit to a topic or a query. The second one on selection requires systems to improve diversity or to remove redundancy so that more relevant information can be covered by the summary within a limited length.

Attempts to solutions of sentence ranking are varied. Some of solutions are based on surface features (Luhn, 1958, Radev et al., 2004), some on graphs (Wan, 2009, Wan and Xiao, 2009, Wei et al., 2010), and some on supervised learning (Cao et al., 2007, Ouyang et al., 2011). After obtaining a list of ranked sentences, it is then important to select a subset of sentences to form a good summary that includes diverse information within a length limit. Goldstein, Mittal, Carbonell, and Kantrowitz (2000) were among the first to propose global models using the maximum marginal relevance (MMR) criteria. The models score sentences under consideration as a weighted combination of relevance plus redundancy with sentences already in the summary. Currently, greedy MMR style algorithms are the standard algorithms in document summarization. McDonald (2007) proposed to replace the greedy search of MMR with a globally optimal formulation, where the basic MMR framework can be expressed as a knapsack packing problem, and an integer linear program (ILP) solver can be used to maximize the resulting objective function.

This paper presents a new method following the extractive style to summarize documents using deep techniques. Deep learning models the learning task using deep architectures composed of multiple layers of parameterized nonlinear modules. These models have been proved outstanding in feature extraction of visual data. To our knowledge, this is the first attempt that utilizes deep learning in query-oriented multi-document summarization task. Different from the existing methods, we neither directly rank the textual units based on the relevance to the topic or query, nor directly improve diversity or remove redundancy. The proposed deep learning algorithm is partitioned into three stages: concept extraction, reconstruction validation, and summary generation. In the concept extraction stage, hidden layers are used to abstract the documents layer by layer using greedy layer-wise extraction algorithm. The second stage of reconstruction validation intends to reconstruct the data distribution by fine-tuning the whole deep architecture globally. Finally, dynamic programming (DP) is utilized to maximize the importance of the summary with the length constraint. A novel framework with several new algorithms is proposed in the following part.

Section snippets

Related work on deep learning

Different from shallow learning models, deep learning is learning multiple levels of representation and abstraction so as to extract more senses out of data. Besides evidence from neuroscience, theoretical analyses from machine learning also confirmed that deep models are more compact and expressive than shallow models in representing most learning functions, especially highly variable ones. Many empirical validations are also reported to support that deep architectures are promising in solving

Basic idea of proposed model

Humans do not have difficulty with summarizing documents based on given queries. Query-oriented multi-document summarization, however, has remained a well-known challenge in natural language processing in the past fifteen years of extensive research. In the evaluation of the summarization tasks in the Document Understanding Conference (DUC), the summaries created by human peers are much better than those extracted automatically. Motivated by this fact, we aimed at designing a proper deep

Evaluation setup

In this section, we conduct several experiments for multi-document summarization task evaluation in the Document Understanding Conference (DUC) on three open benchmark dataset DUC 2005, DUC 2006 and DUC 2007. There are altogether 50 topics in DUC 2005, 50 in DUC 2006, and 45 in DUC 2007. Each DUC topic consists a topic description and a relevant document set. The components of the topics are almost the same except for the source and number of documents. Each DUC 2005 topic includes 25–50

Conclusions

In this paper, we proposed a novel deep learning model for query-oriented multi-documents summarization. The main contributions of the work are summarized as follows: (1) Our work is the first attempt of deep learning methods for the query-oriented multi-document summarization task; (2) By inheriting the outstanding abstraction ability of deep learning methods, a novel framework is proposed to push out important concepts layer by layer effectively; (3) Under unsupervised learning framework, the

Acknowledgment

This research was supported by National Natural Science Foundation of China (NSFC) 61373122.

References (48)

  • Z. Cao et al.

    Learning to rank: from pairwise approach to listwise approach

  • Z. Cao et al.

    Ranking with recursive neural networks and its application to multi-document summarization

  • L. Cao et al.

    Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression

  • E.K. Chen et al.

    Learning object classes from image thumbnails through deep neural networks

  • G. Dahl et al.

    Generating more realistic images using gated MRF’s

  • Dang, H. T. (2005). Overview of DUC 2005. In Proceedings of DUC 2005....
  • Denil, M., Demiraj, A., Freditas, N. (2014). Extraction of salient sentences from labelled documents. Eprint Arxiv (pp....
  • H.P. Edmundson

    New methods in automatic extracting

    Journal of the ACM

    (1969)
  • D.J. Felleman et al.

    Distributed hierarchical processing in the primate cerebral cortex

    Cerebral Cortex

    (1991)
  • Filatova, E., & Hatzivassiloglou, V. (2004). A formal model for information selection in multisentence text extraction....
  • J. Goldstein et al.

    Multi-document summarization by sentence extraction

  • G.E. Hinton

    Training products of experts by minimizing contrastive divergence

    Neural Computation

    (2002)
  • Hinton, G. E. (2010). A practical guide to training restricted Boltzmann machine. Technical report (pp. 1–21). UTML TR...
  • G.E. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Computation

    (2006)
  • Cited by (70)

    • AUSS: An arabic query-based update-summarization system

      2022, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      They stated that the quality of their approach could be affected by the word embedding method, therefore, combining lexical matching scores and cosine similarity scores of different word embedding methods could improve the results as in (Sen et al., 2019). Deep models have been used in generating generic summaries in Arabic (Alami et al., 2018) and English (Zhong et al., 2015; Yousefi-Azar and Hamey, 2017; Liu et al., 2012), but such models are still an open area of research, with no such works done for update summaries to the best of our knowledge, preventing meaningful comparisons with such models. To the best of our knowledge, this is the first work that addresses the update summarization problem in Arabic (Al Qassem et al., 2017), although previous studies on generic summarization exist (Alami et al., 2017; Alami et al., 2015), AUSS creates update summaries only.

    • A hybrid deep learning architecture for opinion-oriented multi-document summarization based on multi-feature fusion

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Recently deep learning approaches have also been used for abstractive and extractive summarization. Zhong, Liu, Li and Long [48] proposed a method based on the auto-encoder (AE) algorithm for extractive summarization. Denil, Demiraj and de Freitas [49] developed a model based on the CNN algorithm to output significant sentence to be included in the summary.

    View all citing articles on Scopus
    View full text