Elsevier

Neurocomputing

Volume 371, 2 January 2020, Pages 177-187
Neurocomputing

Finding decision jumps in text classification

https://doi.org/10.1016/j.neucom.2019.08.082Get rights and content

Highlights

  • We propose Jumper, a novel framework that models text classification as a sequential decision process.

  • Experiments show that Jumper makes decisions whenever the evidence is enough, therefore reducing total text reading by 30-40% and often finding the key rationale of prediction.

  • Jumper achieves classification accuracy better than or comparable to state-of-the-art models in several benchmarks and industrial datasets.

  • Jumper is able to make a decision at the theoretically optimal decision position.

Abstract

Text classification is one of the key problems in natural language processing (NLP), and in early years, it was usually accomplished by feature-based machine learning models. Recently, the deep neural network has become a powerful learning machine, making it possible to work with text itself as raw input for the classification problems. However, existing neural networks are typically end-to-end and lack explicit interpretation of the prediction. In this paper, we propose Jumper, a novel framework that models text classification as a sequential decision process. Generally, Jumper is a neural system that scans a piece of text sequentially and makes classification decisions at the time it wishes, which is inspired by the cognitive process of human text reading. In our framework, both the classification result and when to make the classification are part of the decision process, controlled by a policy network and trained with reinforcement learning. Experimental results of real-world applications demonstrate the following properties of a properly trained Jumper: (1) it tends to make decisions whenever the evidence is enough, therefore reducing total text reading by 30–40% and often finding the key rationale of the prediction; and (2) it achieves classification accuracy better than or comparable to state-of-the-art models in several benchmark and industrial datasets. We further conduct a simulation experiment with mock data, which confirms that Jumper is able to make a decision at the theoretically optimal decision position.

Introduction

Natural language understanding plays an important role in various applications, including text classification [14], information extraction [44], and machine comprehension [9], [31]. Recently, neural networks have become a prevailing technique in natural language processing (NLP) and have achieved significant performance in these tasks.

However, previous work mainly focuses on the ultimate performance of a task (e.g., classification accuracy). For example, Kim [14] builds several variants of convolutional neural networks for sentiment classification. It is typically not clear where and how the model makes such a decision, which are in fact important in real industrial applications for debuggability and interpretability [21].

In our paper, we propose a novel framework, Jumper, that models text understanding as a sequential decision process, inspired by the cognitive process of humans. When people read text, we look for clues, perform reasoning, and obtain information from text. Jumper mimics this process by reading the text in a sentence-by-sentence manner with a neural network. At each sentence, the model makes decisions (also known as actions) based on the input, and at the end of this process, it would have some “understanding” of the text.

More specifically, our paper focuses on paragraph-level text classification, which may have more semantic changes than sentence-level classification. When our neural network reads a paragraph, it is assumed to have a default value “None” at the beginning. At each decision step, a sentence of the paragraph is fed to the neural network; the network then decides if it is confident enough to “jump” to a non-default value. We impose a constraint that each jump is a finalized decision, which cannot be updated in the future. Such decision process is depicted in Fig. 1, and we call our model Jumper.

Our model is trained by reinforcement learning with only weak supervision. That is to say, we assume our training labels only contain the ultimate results, but no supervision signal is given regarding which step the model should make a decision. This also conforms to human reading, as people are typically certain about reading comprehension results, but it is difficult to model how human belief changes when they read.

An intriguing consequence of the one-jump constraint is that it forces our model to be serious about both when to predict and what to predict. This is because a paragraph does not contain a special symbol indicating the end of the paragraph. If our model defers its decision later than it could have made an accurate enough prediction, it takes a risk of not being able to predict. On the other hand, if the model predicts too early, it takes a risk of low accuracy. By optimizing the expected reward in reinforcement learning, the model learns to make decisions at an “optimal” time step.

Jumper has the following advantages, compared with traditional end-to-end text classification:

  • Jumper is able to locate where the evidence of classification is (if it is within one or a few sentences), which coincides with recent work on rationalizing neural prediction [15].

  • In those tasks where information is scattered more widely, Jumper learns to make a decision as long as it is confident enough, making it possible to skip reading the remaining part of a paragraph without loss of accuracy.

To evaluate our approach, we first design a simulation experiment where we generate mock data with manually specified distributions. The simulation shows that Jumper is able to find the theoretically optimal decision step in the online-prediction fashion.

Then, we apply our model in real-world tasks, including two benchmark datasets (movie review1 and the AG’s news corpus2), as well as an industrial application of analyzing occupational injury. Experiments show that our Jumper achieves comparable or higher ultimate classification accuracy compared with strong baselines. Moreover, it reduces the length of text reading by 30–40%, resulting in fast inference. For information extraction-style classification where the evidence is centered in a single sentence, our model is able to find the key rationale without training labels of jumping positions.

Section snippets

Related work

Text classification aims to categorize a piece of text into predefined classes, and is a fundamental problem in natural language processing (NLP), with applications ranging from sentiment analysis [27], [47] to topic classification [36], [45].

The representation of words or sentences plays an important role in text classification. Traditional text classification usually adopts hand-crafted features or feature templates (e.g., bag-of-words features), based on which machine learning models are

The proposed method

In our approach, a paragraph is segmented into sub-sentences3, each of which could be thought of as a basic unit for some “proposition” and is fed to Jumper in order. Jumper takes an action after reading each sub-sentence. Fig. 2 provides an overview of Jumper, which has a hierarchical structure as follows.

  • A sentence encoder reads the words in a sentence and extracts the semantic features

Simulation experiment

In this section, we present a simulation experiment on mock data to investigate if Jumper could learn to predict at an optimal time step. We conduct such simulation experiments because it is hard to quantitatively analyze the optimality with real data.

Real data experiments

In addition to the simulation experiment, we also evaluate Jumper on three real-world tasks, including two benchmark datasets and one industrial application.

Conclusion and future work

In this paper, we have proposed a novel framework, Jumper, that models text classification as a sequential decision process on a sentence-by-sentence basis when reading a paragraph. We train Jumper by reinforcement learning with a one-jump constraint. Experiments show that Jumper could find the optimal decision position on a synthetic dataset; that Jumper achieves comparable or higher performance than baselines; that it reduces text reading by a large extent; and that it can find the key

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank the anonymous reviewers for their insightful suggestions. This work was supported in part by the National Natural Science Foundation of China (No. 61836004) and the Beijing Brain Science Special Project (No. Z181100001518006). Lili Mou is an Amii Fellow; he also thanks AltaML for support.

Xianggen Liu is currently a Ph.D. student at Tsinghua University. He received BS degree in the School of Computer Science and Engineering, University of Electronic Science and Technology of China. He mainly focuses on neural networks and machine learning, and their applications in natural language processing (NLP) problems, such as text classification, tagging, parsing, and reasoning.

References (48)

  • ZhangY. et al.

    Rationale-augmented convolutional neural networks for text classification

    EMNLP

    (2016)
  • D. Bahdanau et al.

    Neural machine translation by jointly learning to align and translate

    ICLR

    (2015)
  • R. Bellman

    A Markovian decision process

    J. Math. Mech.

    (1957)
  • Y. Bengio

    Practical recommendations for gradient-based training of deep architectures

    Neural Networks: Tricks of the Trade

    (2012)
  • P. Blunsom et al.

    A convolutional neural network for modelling sentences

    ACL

    (2014)
  • ChoK. et al.

    Learning phrase representations using RNN encoder–decoder for statistical machine translation

    EMNLP

    (2014)
  • J. Devlin et al.

    Bert: pre-training of deep bidirectional transformers for language understanding

    NAACL-HLT

    (2019)
  • K. Greff et al.

    LSTM: a search space odyssey

    IEEE Trans. Nerual Netw. Learning Syst.

    (2015)
  • HeK. et al.

    Deep residual learning for image recognition

    CVPR

    (2016)
  • HeX. et al.

    Character-level question answering with attention

    EMNLP

    (2016)
  • HuB. et al.

    Convolutional neural network architectures for matching natural language sentences

    NIPS

    (2014)
  • K.S. Jones

    A statistical interpretation of term specificity and its application in retrieval

    J. Documentation.

    (1972)
  • A. Joulin et al.

    Bag of tricks for efficient text classification

    EACL

    (2017)
  • I. Kanaris et al.

    Words versus character n-grams for anti-spam filtering

    Int. J. Artif. Intell. Tools

    (2007)
  • KimY.

    Convolutional neural networks for sentence classification

    EMNLP

    (2014)
  • LeiT. et al.

    Rationalizing neural predictions

    EMNLP

    (2016)
  • LiL. et al.

    Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method

    Bioinformatics

    (2001)
  • LinZ. et al.

    A structured self-attentive sentence embedding

    ICLR

    (2017)
  • LiuX. et al.

    Jumper: learning when to make classification decisions in reading

    IJCAI

    (2018)
  • L.M. Manevitz et al.

    One-class SVMs for document classification

    J. Mach. Learn. Res.

    (2001)
  • C. Manning et al.

    Introduction to Information Retrieval

    (2008)
  • G. Marcus, Deep learning: a critical appraisal, arXiv preprint arXiv:1801.00631,...
  • T. Mikolov et al.

    Efficient estimation of word representations in vector space

    ICLR

    (2013)
  • T. Mikolov et al.

    Distributed representations of words and phrases and their compositionality

    NIPS

    (2013)
  • Cited by (30)

    • Multifractal complexity analysis-based dynamic media text categorization models by natural language processing with BERT

      2022, Multi-Chaos, Fractal and Multi-Fractional Artificial Intelligence of Different Complex Systems
    • MTD: Multi-Timestep Detector for Delayed Streaming Perception

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Early Classifying Multimodal Sequences

      2023, ACM International Conference Proceeding Series
    View all citing articles on Scopus

    Xianggen Liu is currently a Ph.D. student at Tsinghua University. He received BS degree in the School of Computer Science and Engineering, University of Electronic Science and Technology of China. He mainly focuses on neural networks and machine learning, and their applications in natural language processing (NLP) problems, such as text classification, tagging, parsing, and reasoning.

    Lili Mou is an assistant professor at the Department of Computing Science, University of Alberta. Lili received his BS and Ph.D. degrees from School of EECS, Peking University. After that, he worked as a postdoctoral fellow at the University of Waterloo and a research scientist at Adeptmind (a startup in Toronto, Canada). His research interests include deep learning applied to natural language processing as well as programming language processing. He has publications at top conferences and journals like AAAI, ACL, CIKM, COLING, EMNLP, ICASSP, ICML, IJCAI, INTERSPEECH, NAACL-HLT, and TACL.

    Haotian Cui received his BS and Master degrees in 2015 and 2018, respectively, from Department of Biomedical Engineering, Tsinghua University. His research interests relate to NLP and data mining.

    Zhengdong Lu is the founder and CTO of DeeplyCurious (a startup in Beijing, China). Lu got his Ph.D. degree from Oregon Health & Science University in 2008. Dr. Lu then worked as a postdoctoral researcher at the University of Texas at Austin for several years. Before founding DeeplyCurious, he worked as an associate researcher in MSRA and a senior researcher at Noahs Ark Lab, Huawei. Dr. Lu has published more than 60 top conference and journal articles and has served as a reviewer for several international conferences (NIPS, ICML, and IEEE transaction on PAMI). His research interests include machine learning, deep learning, natural language understanding, and data mining.

    Sen Song is a principal investigator at the Department of Biomedical Engineering, Tsinghua University. He got his Ph.D. degree at Brandeis University in USA and worked for several years at Massachusetts Institute of Technology as a postdoctoral fellow. Since joining Tsinghua, he has also been increasingly interested in brain-inspired neural networks and spiking neural networks.

    This paper is an extension to Liu et al. [18]. The code of our work is available at: https://github.com/Liuxg16/jumper-codes

    View full text