Elsevier

Knowledge-Based Systems

Volume 212, 5 January 2021, 106548
Knowledge-Based Systems

ASRNN: A recurrent neural network with an attention model for sequence labeling

https://doi.org/10.1016/j.knosys.2020.106548Get rights and content

Abstract

Natural language processing (NLP) is useful for handling text and speech, and sequence labeling plays an important role by automatically analyzing a sequence (text) to assign category labels to each part. However, the performance of these conventional models depends greatly on hand-crafted features and task-specific knowledge, which is a time consuming task. Several conditional random fields (CRF)-based models for sequence labeling have been presented, but the major limitation is how to use neural networks for extracting useful representations for each unit or segment in the input sequence. In this paper, we propose an attention segmental recurrent neural network (ASRNN) that relies on a hierarchical attention neural semi-Markov conditional random fields (semi-CRF) model for the task of sequence labeling. Our model uses a hierarchical structure to incorporate character-level and word-level information and applies an attention mechanism to both levels. This enables our method to differentiate more important information from less important information when constructing the segmental representation. We evaluated our model on three sequence labeling tasks, including named entity recognition (NER), chunking, and reference parsing. Experimental results show that the proposed model benefited from the hierarchical structure, and it achieved competitive and robust performance on all three sequence labeling tasks.

Introduction

Natural language processing (NLP) is useful for handling text and speech. Within NLP, sequence labeling is the important task of identifying and assigning category labels to each unit or subsequence within a given input. Due to its role in several downstream tasks, including relation extraction [1], [2], entity linking [3], and coreference resolution [4], it has received substantial attention for several decades. Some conventional sequence labeling models, including the conditional random fields (CRF) [5] and maximum entropy models (MEM) [6], establish the conditional probability over the input sequence by analyzing individual input units (i.e., characters or words). Other sequence labeling approaches, known as segmentation models, including semi-Markov conditional random fields (semi-CRF) model [7], analyze the input at the segment (i.e., subsequence) level. Compared with sequence labeling models, segmentation models capture more segment-level features (i.e., segment length, boundary words, etc.) without limitations from local label dependencies. However, the performance of these conventional models depends greatly on hand-crafted features and task-specific knowledge. In practice, it is time consuming to develop such systems, and the performance of the developed system often declines when it is applied to a new domain or task. Therefore, other researchers have proposed neural network conditional random fields (CRF) based models for sequence labeling [8], [9], [10], [11]. Instead of conventional hand-crafted binary features, these neural network CRF methods provide continuous features, require no feature engineering, and offer stable performance for a variety of problems. Compared with conventional CRF models, neural networks are capable of modeling long term dependency context information in the sequence and learning distributed representations from unlabeled training data. A key problem in neural CRF models is how to use neural networks to extract useful representations for each unit or segment in the input sequence.

In this paper, we propose a hierarchical attention neural semi-CRF model for sequence labeling. The proposed model uses a neural-based semi-CRF structure to complete varied tasks of sequence labeling without hand-crafted features or task-specific knowledge. It can not only extract neural features automatically but also incorporate hand-crafted sparse CRF/semi-CRF features easily. In addition, the proposed model balances these two types of features by tuning a learnable parameter in training phrase. The proposed model also incorporates an attention mechanism to attend differently on the input characters/words, which makes the model automatically find the most intuitive characters/words from the input sequences. To find the features at different levels, the proposed model has three layers: a character-level encoder, a word-level encoder, and a semi-CRF layer, from the bottom to the top. The first two encoders are used to extract corresponding character-level and word-level features with the attention mechanism. The upper semi-CRF layer is used to incorporate features at different levels into one graphical model and to jointly model the probability distribution of different sequence labels. An attention segmental recurrent neural network (ASRNN) is used in the lowest character-level encoder layer to extract character-level representation. The output of the character-level encoder layer combines with pre-trained word embeddings and enters the middle word-level encoder layer to extract segment representations. The middle encoder uses ASRNN, which is capable of attending differentially to each word when constructing segmental representations by using the attention mechanism. The extracted segment representations pass to the top semi-Markov conditional random fields (semi-CRF) layer to jointly decode the best label sequence. Our model requires no feature engineering or data processing, making it applicable to a wide variety of sequence labeling tasks. We conducted extensive experiments on three sequence labeling tasks: entity recognition (NER), chunking, and reference parsing. Experimental results show that our model benefits from the hierarchical structure and offers robust performance over a variety of sequence labeling tasks. Our main contributions can be summarized as follows:

  • We propose an ASRNN that automatically constructs segment representations by using an attention mechanism to attend differently on individual characters and words.

  • We introduce an end-to-end hierarchical attention neural semi-CRF model, which can effectively extract word-level and segment level neural network features. Moreover, to further enhance the model performance, sparse CRF features and semi-CRF features can be easily utilized in the proposed models. We also utilize learnable weights in an objective function to automatically balance neural features and sparse features.

  • The designed ASRNN model needs no data processing or task-specific feature engineering, and it shows competitive and robust performance for several different sequence labeling tasks.

Section snippets

HMM-based model

The hidden Markov model (HMM) and its extensions were proposed by Baum et al. [12], [13], [14], [15], [16], which can be used for representation as a dynamic Bayesian network. When applying an HMM to sequence labeling, the states (i.e., labels) are invisible; the outputs (i.e., words or segments), which are dependent on the states, are visible. Each state has a probability distribution over the possible output tokens, called emission probabilities. Each state also has a probability distribution

Semi-CRF

The semi-CRF models provide variable length segmentations of the label sequence. Compared with the traditional CRF, a semi-CRF model is capable of capturing more segment level information (i.e., boundary words features, segment length features, etc.). A semi-CRF architecture models the conditional probability of the possible out sequence s over input sequence x as p(s|x)=1Z(x)exp{WG(x,s)},where G(x,s) is the feature function, W is the weight vector, and Z(x) is the normalization factor of all

Proposed models for sequence prediction

This section introduces the proposed hierarchical attention neural semi-Markov random fields model. The proposed model shown in Fig. 1, and it has three layers as, a character-level encoder, a word-level encoder, and a semi-CRF layer, from the bottom to the top. The bottom character-level encoder is used to extract character-level information of each word. The extracted character-level information (i.e., a n-dimension vectors, where n is a hyper parameter) is concatenated with the word

Experimental evaluation

We evaluated our ASRNN model on three NLP tasks of sequence labeling including: named entity recognition (NER), chunking, and reference parsing.

Conclusion and future work

In this paper, we have presented the problem of incorporating character-level information into a neural semi-CRF model and proposed an attention segmental recurrent neural network (ASRNN) based on the hierarchical attention neural semi-CRF model for the task of sequence labeling. In the conventional CRF-based model, character-level information, such as prefix and suffix features, have been shown to be quite effective. The proposed ASRNN model extracts these features automatically. Empirical

CRediT authorship contribution statement

Jerry Chun-Wei Lin: Development of conceptualization, Methodology, Formal analysis. Yinan Shao: Development of conceptualization, Methodology, Formal analysis. Youcef Djenouri: Experimental validation. Unil Yun: Formal review and editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (52)

  • YuD. et al.

    Using continuous features in the maximum entropy model

    Pattern Recognit. Lett.

    (2009)
  • P. Gupta, B. Andrassy, Table filling multi-task recurrent neural network for joint entity and relation extraction, in:...
  • M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without labeled data, in:...
  • S. Guo, M.W. Chang, E. Kiciman, To link or not to link? a study on end-to-end Tweet entity linking, in: Annual...
  • J. Lu, D. Venugopal, V. Gogate, V. Ng, Joint inference for event coreference resolution, in: International Committee on...
  • J.D. Lafferty, A. Mccallum, F.C.N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling...
  • BergerA.L. et al.

    A maximum entropy approach to natural language processing

    Comput. Linguist.

    (1996)
  • S. Sarawagi, W.W. Cohen, Semi-Markov conditional random fields for information extraction, in: The Annual Conference on...
  • L. Kong, C. Dyer, N.A. Smith, Segmental recurrent neural networks, in: The International Conference on Learning...
  • MaX. et al.

    End-To-End Sequence Labeling Via Bi-Directional LSTM-CNNs-CRF

    (2016)
  • M. Rei, G.K.O. Crichton, S. Pyysalo, Attending to characters in neural sequence labeling models, arXiv:1611.04361,...
  • ZhuoJ. et al.

    Segment-level Sequence Modeling Using Gated Recursive Semi-Markov Conditional Random Fields

    (2016)
  • BaumL.E. et al.

    Statistical inference for probabilistic functions of finite state Markov chains

    Ann. Math. Stat.

    (1996)
  • BaumL.E. et al.

    An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology

    Bull. Amer. Math. Soc.

    (1967)
  • BaumL.E. et al.

    Growth transformations for functions on manifolds

    Pacific J. Math.

    (1968)
  • BaumL.E. et al.

    A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

    Ann. Math. Stat.

    (1970)
  • BaumL.E.

    An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process

    Inequalities

    (1972)
  • FineS. et al.

    The Hierarchical Hidden Markov Model: Analysis and Applications

    (1998)
  • H.P. Zhang, Q. Liu, X.Q. Cheng, H. Zhang, H.K. Yu, Chinese lexical analysis using hierarchical hidden Markov model, The...
  • D. Shen, J. Zhang, G. Zhou, J. Su, C.L. Tan, Effective adaptation of a hidden Markov model-based named entity...
  • J.H. Lim, Y.S. Hwang, S.Y. Park, H.C. Rim, Semantic role labeling using maximum entropy model, in: The Conference on...
  • SunW. et al.

    The Integration of Dependency Relation Classification and Semantic Role Labeling Using Bilayer Maximum Entropy Markov Models

    (2008)
  • A. Ratnaparkhi, A maximum entropy model for part-of-speech tagging, in: The Conference on Empirical Methods in Natural...
  • RosenbergD.S. et al.

    Mixture-of-Parents Maximum Entropy Markov Models

    (2007)
  • A.O. Muis, W. Lu, Weak semi-Markov CRFs for noun phrase chunking in informal text, in: The Conference of the North...
  • H. Zhao, C.N. Huang, M. Li, T. Kudo, An improved Chinese word segmentation system with conditional random field, in:...
  • Cited by (173)

    • Entity recognition method for airborne products metrological traceability knowledge graph construction

      2024, Measurement: Journal of the International Measurement Confederation
    View all citing articles on Scopus
    View full text