Abstractive summarization of long texts by representing multiple compositionalities with temporal hierarchical pointer generator network
Introduction
Summarization occupies a central role in information processing of various kinds. It has been an important challenge in a number of domains, including the natural language processing (NLP) and the information retrieval communities. There are two main approaches to the task of summarization: extraction and abstraction (Hahn & Mani, 2000). Extractive summarization involves selecting words, phrases, or sentences from the text and concatenating them into a summary, whereas abstractive summarization involves generating novel sentences from information extracted from the text (Carenini & Cheung, 2008). Extractive summarization is largely a selection problem and, in most cases, is computationally simpler than abstractive summarization, which requires both a knowledge modeling process, as well as a language generation process (Radev et al., 2003), that does not rely on extracting from the source. Due to its complexity, conventional abstractive summarization research is mostly limited to short text, and summarization of longer text is still far from satisfactory.
Recent developments in deep neural networks (Schmidhuber, 2015) have enabled major advances in Computer Vision and a range of other artificial intelligence applications. Convolutional neural networks (CNNs) have outperformed humans in object recognition tasks (He, Zhang, Ren, & Sun, 2015). In recent years, NLP has also seen a number of advances in fully data-driven approaches, such as neural language models (Mikolov, Karafiát, Burget, Cernocký, & Khudanpur, 2010) and distributed representation models such as word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). These may offer some fresh perspectives on how knowledge can be modeled computationally. In particular, recurrent neural network (RNN) encoder–decoders with attention have been widely adopted in sequence-to-sequence natural language tasks. Certain tasks in NLP such as machine translation, speech recognition, and text classification have seen huge success with the help of CNNs, RNNs, and sequence-to-sequence models. These tasks need to learn alignment and efficient feature extraction but require less reasoning, understanding of compositionality, and inference capabilities. The recently introduced Transformer model (Vaswani et al., 2017) has demonstrated its capabilities in machine translation without using RNNs. Bidirectional Encoder Representations from Transformer (BERT) (Devlin, Chang, Lee, & Toutanova, 2019) has improved the efficiency of feature extraction giving impressive results on tasks such as text classification. However, it remains unclear how to best utilize pre-trained LMs for generation tasks like abstractive text summarization. Rush, Chopra, and Weston (2015) have tried to address sentence compression problems using RNN encoder–decoders with attention. It consists of two RNNs, which handle knowledge modeling of text and language generation, respectively. RNN encoder–decoders have been widely used for handling longer sequence problems. These models are being improved with newer models for summarization (Nallapati et al., 2016, See et al., 2017) with additional properties to handle known issues like the rare word problem etc. but abstractive summarization of long text still remains a far-from resolved topic of research.
We argue that, in order to solve tasks like abstractive summarization, a biologically inspired model is a must. While there are differing views on whether these deep neural network models can be said to model brain function, or are only based on them in a loose, non-technical sense, it is nevertheless a curious trend that the biggest successes in the field so far have come from algorithms, which are found to have strong relation to the brain. Work on bridging the gap between deep neural networks and biological neural networks is only beginning but may lead to important and principled insights about both in the future. Even though artificial neural networks are inspired by the brain, the current models are designed with specific engineering goals but not with the intent to model brain computations (Kriegeskorte, 2015). Nevertheless, initial studies comparing internal representations between these models and the brain find surprisingly similar representational spaces. We argue that a multilayer recurrent neural network also shares biologically plausible representations in a hierarchical order. We also posit that this representation can be further improved if it is designed to take further inspiration from the architecture of the brain’s hierarchy.
In this paper, we explore ways of enhancing the recent developments in summarization models with efforts to approach abstractive summarization as a knowledge modeling problem with inspirations from human brain. We explore the organizationof multilayer gated recurrent neural network architectures in NLP tasks and show that the pattern of decreasing timescale hierarchical organization has strong resemblance to the recent significant findings from neuroscience (Ding, Melloni, Zhang, Tian, & Poeppel, 2016) about the temporal-hierarchical properties of biological neural networks in encoding language. Our work strongly parallels with this groundbreaking study that showed that there is strong evidence for a neural tracking of hierarchical linguistic structures in the brain. First, we adopt a temporal hierarchy concept with multiple timescale gated recurrent unit (MTGRU) by implementing multiple timescales in different layers of the RNN (Kim et al., 2016, Singh and Lee, 2017, Singh et al., 2017). In addition to our previous works, Zhong, Cangelosi, and Ogata (2017) have independently verified the enhanced ability of the MTGRU in handling long term dependencies on different tasks. The authors analyzed the fast and slow networks in MTGRU and concluded that it is necessary for learning long-term dependencies in large dimension multi-modal data-sets. However, optimal setting of the timescale values for each layer can be difficult. Therefore, secondly, we develop a temporal hierarchical structure where the timescales are learned automatically during the training process to adapt each layer to each level of composition in language. This idea is also supported by the concept of temporal hierarchy found in the human brain (Botvinick, 2007, Meunier et al., 2009), to handle multiple levels of compositionality. We demonstrate that our model is capable of understanding the semantics of a multi-sentence with longer source text and knowing what is important about it. This is the first necessary step towards abstractive summarization. We build our hierarchical model with timescale adaptation over the pointer–generator–coverage network (See et al., 2017) that is able to alleviate the rare word problem and repetition problem in RNNs. We evaluate our model with an Introduction-Abstract pair summarization dataset from scientific articles as well as with the CNN/Daily Mail summarization dataset (Hermann et al., 2015).
Section snippets
Related works
In this section, we discuss in detail the existing researches in the field of automatic summarization, distributed representation techniques for modeling knowledge, and biologically inspired temporal hierarchy.
Data acquisition and preprocessing
The WMT datasets, which consist of tens of millions of translation pairs, played a crucial role in making fully data-driven machine translation possible in Cho et al. (2014). However, automatic summarization lacks such a large dataset. The main datasets for automatic summarization are the DUC datasets (Jones, 2007). Other recent papers have used twitter-like summary to title dataset (Hu et al., 2015) or a news dataset consisting of short news and headlines (Rush et al., 2015). Because
The proposed model
In this section, we elaborate the proposed models for long text summarization. The architecture of the proposed automatic summarization system is shown in Fig. 1.
Experiments and results
We evaluated our model using two summarization corpora. The first experiment is performed in order to demonstrate that our model can handle longer text for summarization using the Introduction-Abstract summary pair described in Section 3, and to analyze the representation of compositionalities in the multilayer recurrent neural network with and without temporal hierarchies and report the merits of the proposed multiple timescales approach. The second experiment is performed to compare the
Discussion
In summarization, due to the very long text inputs, which may consist of multiple paragraphs, we need to handle the vanishing information issue carefully. The input length of our long text summarization Introduction-Abstract data consists of 922 tokens on average. The recently introduced BERT model is currently fixed to a maximum input sequence length of 512 tokens, hence making it impossible to be used for this particular task. The conventional recurrent networks including LSTM and GRU can
Conclusion
In this paper, we have demonstrated the capability of the proposed temporal hierarchical network, which uses the multiple timescales with adaptation, in the multi-sentence text abstractive summarization task. We confirmed the hypothesis that a temporal hierarchy in a deep network can represent compositionality better, thereby enhancing the ability to handle longer sequences of text for summarization. The temporal hierarchy, which is being trained to represent the compositions in each layer,
Acknowledgments
This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2016-0-00564, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding) (50%) and by the Technology Innovation Program: Industrial Strategic Technology Development Program (10073162) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea) (50%).
References (49)
- et al.
How to improve text summarization and classification by mutual cooperation on an integrated framework
Expert Systems with Applications
(2016) - et al.
Diversifying customer review rankings
Neural Networks
(2015) - et al.
Motor primitive and sequence self-organization in a hierarchical recurrent neural network
Neural Networks
(2004) Deep learning in neural networks: An overview
Neural Networks
(2015)- et al.
A topic modeling based approach to novel document automatic summarization
Expert Systems with Applications
(2017) - et al.
Text summarization using unsupervised deep learning
Expert Systems with Applications
(2017) - et al.
Neural machine translation by jointly learning to align and translate
CoRR
(2014) Multilevel structure in behaviour and in the brain: a model of fuster’s hierarchy
Philosophical Transactions of the Royal Society, Series B (Biological Sciences)
(2007)- et al.
Extractive vs nlg-based abstractive summarization of evaluative text: The effect of corpus controversiality
- Chen, Y. -C., & Bansal, M. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings...
On the properties of neural machine translation: Encoder–decoder approaches
CoRR
Learning phrase representations using RNN encoder-decoder for statistical machine translation
CoRR
Empirical evaluation of gated recurrent neural networks on sequence modeling
CoRR
Document embedding with paragraph vectors
CoRR
Bert: Pre-training of deep bidirectional transformers for language understanding
Cortical tracking of hierarchical linguistic structures in connected speech
Nature Neuroscience
Adaptive subgradient methods for online learning and stochastic optimization
Journal of Machine Learning Research (JMLR)
Overcoming the lack of parallel data in sentence compression
The challenges of automatic summarization
Computer
Adaptive learning of linguistic hierarchy in a multiple timescale recurrent neural network
Teaching machines to read and comprehend
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies
Lcsts: A large scale chinese short text summarization dataset
Cited by (18)
Long short-term memory with activation on gradient
2023, Neural NetworksLearning to Generate Tips from Song Reviews
2023, Neural NetworksHierarchical and lateral multiple timescales gated recurrent units with pre-trained encoder for long text classification
2021, Expert Systems with ApplicationsCitation Excerpt :Previous works have applied this hierarchical structure to RNNs in movement tracking (Paine & Tani, 2004), sensorimotor control systems (Yamashita & Tani, 2008) and speech recognition (Heinrich et al., 2012). The hierarchical MTGRU demonstrates the ability to capture multiple compositionalities similar to the findings of Ding et al. (2016) as shown by Moirangthem and Lee (2020). This better representation learning capability enhances the ability of the network to model longer sequences of text.
DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding
2023, Scientific DataGraph-Based Extractive Text Summarization Sentence Scoring Scheme for Big Data Applications
2023, Information (Switzerland)A board game to improve freshmen on computer networks: Beyond layers abstraction
2023, Education and Information Technologies