Elsevier

Neural Networks

Volume 124, April 2020, Pages 1-11
Neural Networks

Abstractive summarization of long texts by representing multiple compositionalities with temporal hierarchical pointer generator network

https://doi.org/10.1016/j.neunet.2019.12.022Get rights and content

Abstract

In order to tackle the problem of abstractive summarization of long multi-sentence texts, it is critical to construct an efficient model, which can learn and represent multiple compositionalities better. In this paper, we introduce a temporal hierarchical pointer generator network that can represent multiple compositionalities in order to handle longer sequences of texts with a deep structure. We demonstrate how a multilayer gated recurrent neural network organizes itself with the help of an adaptive timescale in order to represent the compositions. The temporal hierarchical network is implemented with a multiple timescale architecture where the timescale of each layer is also learned during the training process through error backpropagation through time. We evaluate our proposed model using an Introduction-Abstract summarization dataset from scientific articles and the CNN/Daily Mail summarization benchmark dataset. The results illustrate that, we successfully implement a summary generation system for long texts by using the multiple timescale with adaptation concept. We also show that we have improved the summary generation system with our proposed model on the benchmark dataset.

Introduction

Summarization occupies a central role in information processing of various kinds. It has been an important challenge in a number of domains, including the natural language processing (NLP) and the information retrieval communities. There are two main approaches to the task of summarization: extraction and abstraction (Hahn & Mani, 2000). Extractive summarization involves selecting words, phrases, or sentences from the text and concatenating them into a summary, whereas abstractive summarization involves generating novel sentences from information extracted from the text (Carenini & Cheung, 2008). Extractive summarization is largely a selection problem and, in most cases, is computationally simpler than abstractive summarization, which requires both a knowledge modeling process, as well as a language generation process (Radev et al., 2003), that does not rely on extracting from the source. Due to its complexity, conventional abstractive summarization research is mostly limited to short text, and summarization of longer text is still far from satisfactory.

Recent developments in deep neural networks (Schmidhuber, 2015) have enabled major advances in Computer Vision and a range of other artificial intelligence applications. Convolutional neural networks (CNNs) have outperformed humans in object recognition tasks (He, Zhang, Ren, & Sun, 2015). In recent years, NLP has also seen a number of advances in fully data-driven approaches, such as neural language models (Mikolov, Karafiát, Burget, Cernocký, & Khudanpur, 2010) and distributed representation models such as word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). These may offer some fresh perspectives on how knowledge can be modeled computationally. In particular, recurrent neural network (RNN) encoder–decoders with attention have been widely adopted in sequence-to-sequence natural language tasks. Certain tasks in NLP such as machine translation, speech recognition, and text classification have seen huge success with the help of CNNs, RNNs, and sequence-to-sequence models. These tasks need to learn alignment and efficient feature extraction but require less reasoning, understanding of compositionality, and inference capabilities. The recently introduced Transformer model (Vaswani et al., 2017) has demonstrated its capabilities in machine translation without using RNNs. Bidirectional Encoder Representations from Transformer (BERT) (Devlin, Chang, Lee, & Toutanova, 2019) has improved the efficiency of feature extraction giving impressive results on tasks such as text classification. However, it remains unclear how to best utilize pre-trained LMs for generation tasks like abstractive text summarization. Rush, Chopra, and Weston (2015) have tried to address sentence compression problems using RNN encoder–decoders with attention. It consists of two RNNs, which handle knowledge modeling of text and language generation, respectively. RNN encoder–decoders have been widely used for handling longer sequence problems. These models are being improved with newer models for summarization (Nallapati et al., 2016, See et al., 2017) with additional properties to handle known issues like the rare word problem etc. but abstractive summarization of long text still remains a far-from resolved topic of research.

We argue that, in order to solve tasks like abstractive summarization, a biologically inspired model is a must. While there are differing views on whether these deep neural network models can be said to model brain function, or are only based on them in a loose, non-technical sense, it is nevertheless a curious trend that the biggest successes in the field so far have come from algorithms, which are found to have strong relation to the brain. Work on bridging the gap between deep neural networks and biological neural networks is only beginning but may lead to important and principled insights about both in the future. Even though artificial neural networks are inspired by the brain, the current models are designed with specific engineering goals but not with the intent to model brain computations (Kriegeskorte, 2015). Nevertheless, initial studies comparing internal representations between these models and the brain find surprisingly similar representational spaces. We argue that a multilayer recurrent neural network also shares biologically plausible representations in a hierarchical order. We also posit that this representation can be further improved if it is designed to take further inspiration from the architecture of the brain’s hierarchy.

In this paper, we explore ways of enhancing the recent developments in summarization models with efforts to approach abstractive summarization as a knowledge modeling problem with inspirations from human brain. We explore the organizationof multilayer gated recurrent neural network architectures in NLP tasks and show that the pattern of decreasing timescale hierarchical organization has strong resemblance to the recent significant findings from neuroscience (Ding, Melloni, Zhang, Tian, & Poeppel, 2016) about the temporal-hierarchical properties of biological neural networks in encoding language. Our work strongly parallels with this groundbreaking study that showed that there is strong evidence for a neural tracking of hierarchical linguistic structures in the brain. First, we adopt a temporal hierarchy concept with multiple timescale gated recurrent unit (MTGRU) by implementing multiple timescales in different layers of the RNN (Kim et al., 2016, Singh and Lee, 2017, Singh et al., 2017). In addition to our previous works, Zhong, Cangelosi, and Ogata (2017) have independently verified the enhanced ability of the MTGRU in handling long term dependencies on different tasks. The authors analyzed the fast and slow networks in MTGRU and concluded that it is necessary for learning long-term dependencies in large dimension multi-modal data-sets. However, optimal setting of the timescale values for each layer can be difficult. Therefore, secondly, we develop a temporal hierarchical structure where the timescales are learned automatically during the training process to adapt each layer to each level of composition in language. This idea is also supported by the concept of temporal hierarchy found in the human brain (Botvinick, 2007, Meunier et al., 2009), to handle multiple levels of compositionality. We demonstrate that our model is capable of understanding the semantics of a multi-sentence with longer source text and knowing what is important about it. This is the first necessary step towards abstractive summarization. We build our hierarchical model with timescale adaptation over the pointer–generator–coverage network (See et al., 2017) that is able to alleviate the rare word problem and repetition problem in RNNs. We evaluate our model with an Introduction-Abstract pair summarization dataset from scientific articles as well as with the CNN/Daily Mail summarization dataset (Hermann et al., 2015).

Section snippets

Related works

In this section, we discuss in detail the existing researches in the field of automatic summarization, distributed representation techniques for modeling knowledge, and biologically inspired temporal hierarchy.

Data acquisition and preprocessing

The WMT datasets, which consist of tens of millions of translation pairs, played a crucial role in making fully data-driven machine translation possible in Cho et al. (2014). However, automatic summarization lacks such a large dataset. The main datasets for automatic summarization are the DUC datasets (Jones, 2007). Other recent papers have used twitter-like summary to title dataset (Hu et al., 2015) or a news dataset consisting of short news and headlines (Rush et al., 2015). Because

The proposed model

In this section, we elaborate the proposed models for long text summarization. The architecture of the proposed automatic summarization system is shown in Fig. 1.

Experiments and results

We evaluated our model using two summarization corpora. The first experiment is performed in order to demonstrate that our model can handle longer text for summarization using the Introduction-Abstract summary pair described in Section 3, and to analyze the representation of compositionalities in the multilayer recurrent neural network with and without temporal hierarchies and report the merits of the proposed multiple timescales approach. The second experiment is performed to compare the

Discussion

In summarization, due to the very long text inputs, which may consist of multiple paragraphs, we need to handle the vanishing information issue carefully. The input length of our long text summarization Introduction-Abstract data consists of 922 tokens on average. The recently introduced BERT model is currently fixed to a maximum input sequence length of 512 tokens, hence making it impossible to be used for this particular task. The conventional recurrent networks including LSTM and GRU can

Conclusion

In this paper, we have demonstrated the capability of the proposed temporal hierarchical network, which uses the multiple timescales with adaptation, in the multi-sentence text abstractive summarization task. We confirmed the hypothesis that a temporal hierarchy in a deep network can represent compositionality better, thereby enhancing the ability to handle longer sequences of text for summarization. The temporal hierarchy, which is being trained to represent the compositions in each layer,

Acknowledgments

This work was partly supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (2016-0-00564, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding) (50%) and by the Technology Innovation Program: Industrial Strategic Technology Development Program (10073162) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea) (50%).

References (49)

  • ChoK. et al.

    On the properties of neural machine translation: Encoder–decoder approaches

    CoRR

    (2014)
  • ChoK. et al.

    Learning phrase representations using RNN encoder-decoder for statistical machine translation

    CoRR

    (2014)
  • ChungJ. et al.

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    CoRR

    (2014)
  • DaiA.M. et al.

    Document embedding with paragraph vectors

    CoRR

    (2015)
  • DevlinJ. et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

  • DingN. et al.

    Cortical tracking of hierarchical linguistic structures in connected speech

    Nature Neuroscience

    (2016)
  • DuchiJ. et al.

    Adaptive subgradient methods for online learning and stochastic optimization

    Journal of Machine Learning Research (JMLR)

    (2011)
  • FilippovaK. et al.

    Overcoming the lack of parallel data in sentence compression

  • HahnU. et al.

    The challenges of automatic summarization

    Computer

    (2000)
  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on...
  • HeinrichS. et al.

    Adaptive learning of linguistic hierarchy in a multiple timescale recurrent neural network

  • HermannK.M. et al.

    Teaching machines to read and comprehend

  • HochreiterS. et al.

    Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

  • HuB. et al.

    Lcsts: A large scale chinese short text summarization dataset

  • Cited by (18)

    • Hierarchical and lateral multiple timescales gated recurrent units with pre-trained encoder for long text classification

      2021, Expert Systems with Applications
      Citation Excerpt :

      Previous works have applied this hierarchical structure to RNNs in movement tracking (Paine & Tani, 2004), sensorimotor control systems (Yamashita & Tani, 2008) and speech recognition (Heinrich et al., 2012). The hierarchical MTGRU demonstrates the ability to capture multiple compositionalities similar to the findings of Ding et al. (2016) as shown by Moirangthem and Lee (2020). This better representation learning capability enhances the ability of the network to model longer sequences of text.

    View all citing articles on Scopus
    View full text