Keywords

1 Introduction

As the fundamental task of an intelligent answer system, NER also deals with a highly technical aspect of NLP and plays a significant part in accurately identifying nouns with special meanings in a text, such as names of people, places, organizations, etc. The purpose of NER is to provide useful information of this kind for automatic question answering systems, intelligent customer service solutions, automatic summarization, and other similar NLP tasks. Initially, NER depended on expert rules, namely rules formulated by subject-matter (domain) experts and scholars, for recognition tasks. This method required averting conflicts between different rules and consequently necessitated expending considerable time and effort establishing rules. Coupled with the difficulty in transfer learning in different fields, the conventional approach did not work well for complicated tasks. To solve this problem, researchers had begun to use machine learning techniques – in place of “human” rules – to perform NER tasks.

The machine learning methods used for sequence labeling range from conditional random fields (CRF), transfer learning (TL) to hidden Markov model (HMM). Researchers have recently regarded NER as a sequence labeling task to determine the category of each input word or phrase. At present, they are able to identify entities through feature extraction. Xu and Zhu et al. [1] for example, downloaded news articles from NASA’s official website and extracted 36 dimensional features, including lexical, morphological, and contextual features. Xu and Zhu also adopted the BIO scheme to tag the beginning, inside, and outside of an entity besides using the CRF model for entity recognition. Li and Wei et al. [2] based on the features of Word, Part of Speech (POS), Left bound + Right bound (LB + RB), Radical (Rad), and Numeral (Num), employed a CRF approach for the recognition of crops, diseases and pesticides named entities. They compared three schemes involving single features, non-contextual feature combinations, and contextual feature combinations and concluded that the scheme of contextual feature combinations produce the best result and have an advantage in some specific research areas.

Nevertheless, in light of the diverse and complex nature of research work, collecting samples for novel fields may not be a cost-effective process. Wang and Shen et al. [3] applied TrAdaBoost to their study and further improved the transferability of algorithms, thus using a non-political news corpus for the entity recognition of political news texts. Traditionally, tagging is mainly about manually selecting a set of task-oriented feature templates, whose quality determines the result of a labeling task. Researchers need to have an intimate knowledge of both relevant fields and linguistics, which makes a lot of demands in terms of time and effort. Zhang and Wu et al. [4] with the SBEIO scheme, used the character-phrase vector combination as the input of a three-layer neural network. Then they applied the Viterbi algorithm to the neural network output for the best labeling result. However, Zhang and Wu did not differentiate the effects of character vectors and phrase vectors on NER.

Recent years have witnessed a wide range of applications of attention mechanisms [5], especially in machine translation, image captioning, and speech recognition [5]. Wang et al. [6] presented an LSTM model and vectorized extracted features. On the basis of the model, they carried out a semantic relation extraction according to the attention-induced local phrase vectors.

Using word2vec to vectorize words and phrases, this paper integrates an attention mechanism with Bi-LSTM for Chinese NER. This approach focuses on both local and global features and therefore can substantially improve the effectiveness of Chinese NER.

2 Bi-LSTM Model Construction Based on an Attention Mechanism

Named entity recognition can be abstracted as a label prediction problem, and can be classified into word label prediction and word label prediction according to the different units of sentence segmentation. Word label prediction is the process of labeling the word “SBMEO” [7] as the basic unit of sentence segmentation. “S” means that the single word is a named entity, B is the first word of the named entity, “M” is the middle word of the named entity, and the “E” table is the last word of the named entity, “O” indicates that the word does not belong to any part of the named entity. The label prediction of words is the basic unit of sentence segmentation and the corresponding entity tagging after the segmentation. For example, the non-named entity is labeled as “O”, the name of the person is “PERSON”, and the organization is labeled as “ORGANIZATION”. The words and words obtained from the two segmentation modes are vectorized to form the character vector matrix \( {\mathbf{A}}_{c} \) and the phrase vector matrix \( {\mathbf{A}}_{w} \) of the input text. The two annotation patterns are vectored to get the word marking matrix \( {\mathbf{P}}_{c} \) and the word tagging matrix \( {\mathbf{P}}_{w} \).

2.1 Word Vector

The word vector model is divided into one-hot representation and distributed representation. The One-hot vector is a vector length with a dictionary size. Only one of the vectors is 1 and the rest is 0. The word vector dimension is too large and too sparse, and it cannot represent the similarity between words. The distribution is presented in 1985 by Hinton. The idea is to abstract a dictionary into a vector space, and all words are a point in the space, expressed by a fixed length vector, and each dimension of the vector represents a potential feature of the word, which captures the grammatical and semantic features [8].

Word2vec is a software tool used by Google to train word vectors in 2013. The framework includes CBOW and Skip-gram. CBOW uses a number of words in the context to predict the current word, and the size of the corpus is better at the hundred megabytes; and the Skip-gram is just the opposite. The words pretest the context from the current word, suitable for the small size of the corpus [9].

We used word2vec to train and get dictionary DC and dictionary DW. Given a Chinese sentence S, S is composed of n words and can be expressed as C [1:n]. S is made up of M words and can be represented as W[1:m]. The vector \( {\text{S}}_{{{\text{A}}_{c} }} \) and the word vector matrix \( {\text{S}}_{{{\text{A}}_{w} }} \) of S are shown in the formula (1) and the formula (2).

$$ {\text{S}}_{{{\text{A}}_{c} }} = [DC_{{C_{1} }}^{T} , \cdots ,DC_{{C_{n} }}^{T} ] $$
(1)
$$ {\text{S}}_{{{\text{A}}_{w} }} = [DW_{{W_{1} }}^{T} , \cdots ,DW_{{W_{m} }}^{T} ] $$
(2)

Word tagging has i labels, and word tagging has J labels. The word tagging matrix \( S_{{{\mathbf{P}}_{c} }} \) and word tagging matrix \( S_{{{\mathbf{P}}_{w} }} \) of S can be obtained by one-hot, and their dimensions are \( {\mathbf{i}} \times {\mathbf{n}} \) and \( {\mathbf{j}} \times {\mathbf{m}} \).

2.2 Bi-LSTM Model

Recurrent neural network (RNN) is a neural network model for sequential annotation. By adding the self-connected hidden layer across time points, the model has certain memory ability. However, RNN cannot effectively deal with the long-distance dependency problem, and there are gradient disappearance and gradient explosion cases. The LSTM model is an improved method for the traditional recurrent neural network. It uses the memory unit to replace the implicit function of the traditional recurrent neural network. This improvement allows the LSTM model to memorize contexts whose range is longer than that of traditional recurrent neural networks. Traditional recurrent neural networks read the input data from one end of the sequence to the other, so the data stored in the recurrent neural network at any time has only the information of the current and past time to solve the problem of gradient disappearance and gradient explosion. It has been widely used in all aspects of the Natural Language Processing [10,11,12]. In the LSTM model, long and short memory functions are realized through Input Gate, Output Gate and Forget Gate. Its structure is shown in Fig. 1.

Fig. 1.
figure 1

LSTM single node structure

The following formulas show how a storage unit t is updated at each time step:

$$ i_{t} = \sigma (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i} ) $$
(3)
$$ \tilde{C}_{t} = tanh(W_{c} x_{t} + U_{c} h_{t - 1} + b_{c} ) $$
(4)
$$ f_{t} = \sigma (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f} ) $$
(5)
$$ C_{t} = i_{t} *\tilde{C}_{t} + f_{t} *C_{t - 1} $$
(6)
$$ o_{t} = \sigma (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o} ) $$
(7)
$$ h_{t} = o_{t} *\tanh (C_{t} ) $$
(8)

Among them, \( x_{t} \) is the input of unit t, \( W_{i} ,W_{c} ,W_{f} ,W_{o} ,U_{i} ,U_{c} ,U_{f} ,U_{o} \) are weight matrix, \( b_{i} ,b_{c} ,b_{f} ,b_{o} \) and W_f are bias vectors.

2.3 Attention Mechanism

Soft Attention Model [12] is a probability distribution of the attention probability of each input calculation, which can highlight the importance of a particular word to the whole sentence and consider more contextual semantic association [6]. The attention mechanism is based on the input character vector or the phrase vector, and outputs one dimensional vector \( {\text{SAM}}_{c} = [{\text{k}}_{1} \cdots {\text{k}}_{\text{n}} ] \) and \( {\text{SAM}}_{w} = [{\text{k}}_{1} \cdots {\text{k}}_{\text{m}} ] \) in which \( {\text{k}}_{x} \) represents the attention probability of position x. Using the attention probability to update the phrase vector output of the Bi-LSTM model, we get \( {\text{S}}_{{{\text{A}}_{c} }} \), such as formula (9).

$$ {\text{S}}_{{{\text{A}}_{c} }} = \tilde{S}_{{{\text{A}}_{c} }}^{T} \times {\text{SAM}}_{c} $$
(9)

\( \tilde{S}_{{{\text{A}}_{c} }}^{T} \) represents the transpose matrix of the output of the Bi-LSTM model, and the updated phrase vector \( {\text{S}}_{{{\text{A}}_{w} }} \) can be obtained similarly.

2.4 An Attention-Based Bi-LSTM Model

In this paper, we first use word2vec to vectorize the text, after which the vectorized numerical matrix is input into an attention-based model and a Bi-LSTM model for parallel computing. Then we multiply the results of the two models. Finally, we use softmax to classify vector types for Chinese NER.

The Bi-LSTM model is a three-layer neural network, which consists of two recurrent neural networks and a fully connected layer. Of the two recurrent neural networks, one computes the vector from front to back, and the other from back to front. Then a fully connected layer is used to integrate the results for local feature extraction. An attention mechanism is used to extract global features through the output attention probability matrix. The model’s network structure is shown in Fig. 2.

Fig. 2.
figure 2

Network structure

3 Development Experiments

3.1 Corpus

This paper selects the January 1998 (RMRB-98-1) corpus as the development set. On the basis of Stanford NER’s named entity markup to RMRB-98-1, two postgraduates are used to proofread the results respectively, and the error marks existing in the tagging tool are modified, and the error instances are listed as follows.

The vectorized model in this paper uses a corpus of about 398 MB in data size and the January 1998 issues of the People’s Daily (RMRB-98-1), which are about 6.38 MB in data size. The character vectors and phrase vectors are trained by CBOW and Skip-gram models respectively. The above two corpuses are manually proofread, so this paper ignores the errors that may be caused by quality of segmentation.

In this paper, we use a corpus of the January 1998 issues the People’s Daily (RMRB-98-1) as the development set. After the Stanford NER is adopted for it, two postgraduate students are respectively assigned to the tasks of double-checking the recognition results, and the error mark and missing mark in the annotation tool are modified. The error examples are shown in the following Table 1.

Table 1. Labeling error instances

The Stanford NER can be divided into 8 types: non entity standard O; coordinate azimuth standard LOCATION; province etc. GPE; factory and others FACILITY; organization and others ORGANIZATION; government, etc. DEMONYM; person name PERSON; and other standard MISC, such as money, number, serial number, percentage, time, continuous sequence, and collection.

3.2 Evaluation Index

The evaluation indexes of this paper are precision, recall and F-score, and the calculation method is shown in the formulae (1012).

$$ {\mathbf{precision = }}\frac{{correct_{out} }}{{all_{out} }} $$
(10)
$$ {\mathbf{recall = }}\frac{{correct_{out} }}{{all_{test} }} $$
(11)
$$ {\mathbf{F}}_{{{\mathbf{score}}}} = \frac{2 \times precision \times recall}{precision + recall} $$
(12)

In which correctout is the correct number of non “O” tags, allout is the total number of non “O” tags, and alltest is the total number of non “O” tags in the test set.

3.3 Experiment Planning

We used RMRB-98-1 as a data set to analyze the parameter output dimensions in the Bi-LSTM model and compared the changes of the evaluation indexes under different dimensions. The experimental results are shown in Fig. 3.

Fig. 3.
figure 3

Output dimension settings

The two local maxims of the output dimension are 250 and 400, and the more the number of dimensions is, the slower the system runs, and the effect of the experimental results is not significantly improved. Therefore, the output dimension is set to 250.

3.4 Experimental Results

We carried out four groups of experiments, the main purpose of which was to compare the effects of character vectors and phrase vectors as input vectors and phrase vector training corpus size on model recognition. In the first group of experiments, RMRB-98-1 was used as the training vector of basic corpus, the character vector as the input of attention model, and the phrase vector as the input of Bi-LSTM model. In the second group of the experiments, the corpus of character segmentation was used as the basic training vector, the phrase vector was the input of the attention model and the character vector was the input of the Bi-LSTM model. In the third group of the experiments, the corpus of character segmentation was used as the basic training vector, the character vector was input of the Bi-LSTM model and the character vector was input of the attention model. In the fourth experiment, RMRB-98-1 was used as the basic training vector, the character vector as the input of the Bi-LSTM model and the character vector as the input of the attention model. A total of 13,633 pieces, approximately 70% of the data, were selected as the training set while 5844 pieces, approximately 30% of the data, were used as the cross-validation set to prevent the over-fitting of training. Word2vec was used for unsupervised training of the data to generate character vectors and phrase vectors.

By comparing the results of Experiments 1–2 and Experiments 3–4 in Table 2, it can be concluded that the results of the experiment using RMRB-98-1 as the vectorized data set are slightly higher than that using the corpus with the laboratory word segmentation as the vectorized data set. This is due to the fact that a decrease in the relevance of the vectorized corpus has decreased while the relevance of the phrases, leading to the lower accuracy of the training. By comparing the results of Experiments 1–4 and Experiments 2–3, it can be concluded that character vectors are more effective than phrase vectors in terms of NER. This is because, as a result of the incompleteness of the dictionary, many unlisted words will appear in the recognition process, which will result in recognition errors.

Table 2. Experimental results

We compared the named entity recognition algorithm with the CRF algorithm used in the literature [1], and used the plate corpus and the English data set MUC-6 as the basic corpus to train. This model directly quantifies the input in the MUC-6 corpus and inputs the model.

From Table 3, we can find that the use of the Att-BLSTM model in the Chinese language has achieved a better result. The main reason is that the character vectors and phrase vectors in this paper are handled separately, whereas word vectors, which are typical of the English language, do not apply to the system in question. This model is better than the traditional CRF algorithm, although the Recall rate is lower than that of the CRF algorithm in the English domain, but the F-score is higher in general.

Table 3. Experiment result comparisons

4 Conclusion

This paper proposes an attention-based Bi-LSTM approach to Chinese NER, which helps to effectively extract local and global features. By experiment we compared the effects of character vectors, phrase vectors, and vectorized foundation datasets on Chinese NER. The result empirically demonstrates that phrase vectors integrated with a highly domain-relevant corpus produces consistently better performance. At present, not much research has been carried out in the field of vector corpus and whether data can be realized in domain transfer. We consider adding data in the field of Finance in the next step to further observe the mobility of the system. Furthermore, it is predicted that our model has a distinctive advantage over traditional CRF algorithms in analyzing the stock market.