Query-oriented unsupervised multi-document summarization via deep learning model
Introduction
Automatically generating summaries from large text corpora has long been attracting research attention from both information retrieval and natural language processing, the earlier studies of which could be dated back to the 1950s and 1960s (Baxendale, 1958, Edmundson, 1969, Luhn, 1958). Automatic generation of summaries creates shortened versions of texts to help users catch important information in the original text with bearable time costs (Khanpour, 2009). Currently, the creation of summaries is a task best handled by humans. However, with the explosion of textual data, especially in big data era, it is no longer financially possible, or feasible, to produce all types of summaries by hand. Earlier studies on text summarization aimed at summarizing from pre-given documents without requirements, which is usually referred to as generic summarization (Berger & Mittal, 2000). With the development of information retrieval, query-oriented summarization task, which requires summarizing from a set of document to answer a pre-given query, has started attracting more and more attention (Tang, Yao, & Chen, 2009). According to the size of the input, text summarization tasks can be grouped into single-document and multi-document summarization tasks (Shen, Sun, Li, Yang, & Chen, 2007). Based on the writing style of the output summary, text summarization techniques can be divided into extractive approaches and abstractive approaches (Song et al., 2011, Wong et al., 2008). Due to the limitation of current natural language generation techniques, extractive approaches are the mainstream in the field. An extractive approach selects a number of indicative text fragments from the input documents to form a summary instead of re-writing an abstract (Chen, Yang, Zha, Zhang, & Zhang, 2008) under a budget constraint. A budget constraint is natural in summarization task as the length of the summary is often restricted (Lin & Bilmes, 2010). In the paper, we adopt the extractive style to develop techniques for query-oriented multi-document summarization.
Almost all extractive summarization methods are faced with two key problems: how to rank textual units, and how to select a subset of those ranked units (Jin, Huang, & Zhu, 2010). The first one on ranking requires systems to model the relevance of a textual unit to a topic or a query. The second one on selection requires systems to improve diversity or to remove redundancy so that more relevant information can be covered by the summary within a limited length.
Attempts to solutions of sentence ranking are varied. Some of solutions are based on surface features (Luhn, 1958, Radev et al., 2004), some on graphs (Wan, 2009, Wan and Xiao, 2009, Wei et al., 2010), and some on supervised learning (Cao et al., 2007, Ouyang et al., 2011). After obtaining a list of ranked sentences, it is then important to select a subset of sentences to form a good summary that includes diverse information within a length limit. Goldstein, Mittal, Carbonell, and Kantrowitz (2000) were among the first to propose global models using the maximum marginal relevance (MMR) criteria. The models score sentences under consideration as a weighted combination of relevance plus redundancy with sentences already in the summary. Currently, greedy MMR style algorithms are the standard algorithms in document summarization. McDonald (2007) proposed to replace the greedy search of MMR with a globally optimal formulation, where the basic MMR framework can be expressed as a knapsack packing problem, and an integer linear program (ILP) solver can be used to maximize the resulting objective function.
This paper presents a new method following the extractive style to summarize documents using deep techniques. Deep learning models the learning task using deep architectures composed of multiple layers of parameterized nonlinear modules. These models have been proved outstanding in feature extraction of visual data. To our knowledge, this is the first attempt that utilizes deep learning in query-oriented multi-document summarization task. Different from the existing methods, we neither directly rank the textual units based on the relevance to the topic or query, nor directly improve diversity or remove redundancy. The proposed deep learning algorithm is partitioned into three stages: concept extraction, reconstruction validation, and summary generation. In the concept extraction stage, hidden layers are used to abstract the documents layer by layer using greedy layer-wise extraction algorithm. The second stage of reconstruction validation intends to reconstruct the data distribution by fine-tuning the whole deep architecture globally. Finally, dynamic programming (DP) is utilized to maximize the importance of the summary with the length constraint. A novel framework with several new algorithms is proposed in the following part.
Section snippets
Related work on deep learning
Different from shallow learning models, deep learning is learning multiple levels of representation and abstraction so as to extract more senses out of data. Besides evidence from neuroscience, theoretical analyses from machine learning also confirmed that deep models are more compact and expressive than shallow models in representing most learning functions, especially highly variable ones. Many empirical validations are also reported to support that deep architectures are promising in solving
Basic idea of proposed model
Humans do not have difficulty with summarizing documents based on given queries. Query-oriented multi-document summarization, however, has remained a well-known challenge in natural language processing in the past fifteen years of extensive research. In the evaluation of the summarization tasks in the Document Understanding Conference (DUC), the summaries created by human peers are much better than those extracted automatically. Motivated by this fact, we aimed at designing a proper deep
Evaluation setup
In this section, we conduct several experiments for multi-document summarization task evaluation in the Document Understanding Conference (DUC) on three open benchmark dataset DUC 2005, DUC 2006 and DUC 2007. There are altogether 50 topics in DUC 2005, 50 in DUC 2006, and 45 in DUC 2007. Each DUC topic consists a topic description and a relevant document set. The components of the topics are almost the same except for the source and number of documents. Each DUC 2005 topic includes 25–50
Conclusions
In this paper, we proposed a novel deep learning model for query-oriented multi-documents summarization. The main contributions of the work are summarized as follows: (1) Our work is the first attempt of deep learning methods for the query-oriented multi-document summarization task; (2) By inheriting the outstanding abstraction ability of deep learning methods, a novel framework is proposed to push out important concepts layer by layer effectively; (3) Under unsupervised learning framework, the
Acknowledgment
This research was supported by National Natural Science Foundation of China (NSFC) 61373122.
References (48)
Learning multiple layers of representation
Trends in Cognitive Sciences
(2007)- et al.
Applying regression models to query-focused multi-document summarization
Information Processing and Management
(2011) - et al.
Centroid-based summarization of multiple documents
Information Processing and Management
(2004) - et al.
Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization
Expert System with Applications
(2011) - et al.
Deep networks for audio event classification in soccer videos
Neocortex size and behavioural ecology in primates
Proceedings of Royal Society of London B: Biological Sciences
(1996)- et al.
A document rating system for preference judgements
Machine-made index for technical literature—An experiment
IBM Journal of Research Development
(1958)- et al.
Query-relevant summarization using FAQs
Crowdsourcing as a model for problem solving: An introduction and cases
Convergence
(2008)
Learning to rank: from pairwise approach to listwise approach
Ranking with recursive neural networks and its application to multi-document summarization
Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression
Learning object classes from image thumbnails through deep neural networks
Generating more realistic images using gated MRF’s
New methods in automatic extracting
Journal of the ACM
Distributed hierarchical processing in the primate cerebral cortex
Cerebral Cortex
Multi-document summarization by sentence extraction
Training products of experts by minimizing contrastive divergence
Neural Computation
A fast learning algorithm for deep belief nets
Neural Computation
Cited by (70)
AUSS: An arabic query-based update-summarization system
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :They stated that the quality of their approach could be affected by the word embedding method, therefore, combining lexical matching scores and cosine similarity scores of different word embedding methods could improve the results as in (Sen et al., 2019). Deep models have been used in generating generic summaries in Arabic (Alami et al., 2018) and English (Zhong et al., 2015; Yousefi-Azar and Hamey, 2017; Liu et al., 2012), but such models are still an open area of research, with no such works done for update summaries to the best of our knowledge, preventing meaningful comparisons with such models. To the best of our knowledge, this is the first work that addresses the update summarization problem in Arabic (Al Qassem et al., 2017), although previous studies on generic summarization exist (Alami et al., 2017; Alami et al., 2015), AUSS creates update summaries only.
Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
2021, Expert Systems with ApplicationsAn unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings
2021, Expert Systems with ApplicationsA hybrid deep learning architecture for opinion-oriented multi-document summarization based on multi-feature fusion
2021, Knowledge-Based SystemsCitation Excerpt :Recently deep learning approaches have also been used for abstractive and extractive summarization. Zhong, Liu, Li and Long [48] proposed a method based on the auto-encoder (AE) algorithm for extractive summarization. Denil, Demiraj and de Freitas [49] developed a model based on the CNN algorithm to output significant sentence to be included in the summary.
Query-Based Extractive Text Summarization Using Sense-Oriented Semantic Relatedness Measure
2024, Arabian Journal for Science and EngineeringDilated convolution for enhanced extractive summarization: A GAN-based approach with BERT word embedding
2024, Journal of Intelligent and Fuzzy Systems