Empirical study on character level neural network classifier for Chinese text

doi:10.1016/j.engappai.2019.01.009

Engineering Applications of Artificial Intelligence

Volume 80, April 2019, Pages 1-7

https://doi.org/10.1016/j.engappai.2019.01.009 Get rights and content

Abstract

Character level models are drawing attention recently. A number of these models have been proposed and shown successful in Natural Language Processing tasks. While most of the models are experimented mainly on English, or other alphabetic languages, a number of problems arise when they applied these models to non-alphabetic language such as Chinese. In this study, we investigated the problems encountered when transferring these models to the Chinese and put forward some solutions. We propose a double embedding neural network model that is also character level and consists of both CNN and RNN with two separate embeddings. The model is applied to a fundamental Natural Language Processing task, text classification. Experiment results conducted on the Chinese corpus demonstrated that our character level neural network model performs just as well as or better than those word level classification models. Our model is able to reach 95.9% accuracy on a Chinese Fudan news dataset, which outperforms the state-of-the-art models.

Introduction

Text classification, the task of assigning predefined labels to text documents, is essential in many natural language processing applications, such as news filtering, information retrieval and sentiment analysis (Aggarwal and Zhai, 2012). Text classification on Chinese is challenging problem for many researchers because there is no natural delimiter in the Chinese text (Chang et al., 2008). To improve the performance of Chinese text classification, efforts must be made through two major tasks: feature selection and classifier selection.

State-of-the-art feature selection is based on bag-of-words (n-gram) model with some discriminative feature selectors such as MI, pLSA and LDA. However, these feature selection methods suffer from data sparsity and are unsatisfactory in capturing semantics of words, thus affect classification accuracy (Lai et al., 2015). Word embedding (Mikolov et al., 2013) addresses the data sparsity problem by learning a low dimensional vector for each word. Mikolov et al. showed that word embedding is able to infer the semantics and syntactics of words. There is usually a preliminary and important pre-processing step called sentence segmentation or word parsing, which may enhance the performance of NLP tasks (Chang et al., 2008, Chen et al., 2015, Liu et al., 2017, Tang et al., 2008) on Chinese. Yet the accuracy of sentence segmentation decreases when handling domain specific text or informally written text such as mirco-blogs. The error propagated to the model further affects the classification performance. An emerging set of models, namely, character level classification models (Conneau et al., 2016, Zhang et al., 2015), can avoid such a problem because sentence segmentation is eliminated in them. Moreover, the number of Chinese characters is less than that of words, so the sparsity issue can be less severe. Therefore, the models using embedding at character level models in Chinese are more suitable for Chinese than those word level based classification models.

The second task of text classification is selecting a machine learning model. Traditional models with strong baseline include Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR), etc. (Wang and Manning, 2012). Recently, the neural network models have brought new insights to the various NLP tasks. Models applying neural network to word embedding without any syntactic or semantic knowledge are reported competitive to state-of-the-art models (Kim, 2014, Zhang et al., 2015). Convolution neural network (CNN) and Recurrent neural network (RNN) are two mainstream architectures of the neural network. Long short term memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a popular choice for RNN. Combining the two models together resulted in better performance in some cases. Recurrent convolution neural network (Lai et al., 2015) applies a recurrent network over words to learn embedding then max-pooling layer to extract key features. C-LSTM (Zhou et al., 2015) firstly performs convolution operation over word embedding followed by a LSTM layer. These models stack one type of the neural network over another capturing semantics and feeding it to the other layer. Whereas in some models, a static embedding, obtained from an unsupervised neural language model, transfers information extracted from large text corpus to the task (Kim, 2014).

A comparative study of CNN and RNN (Yin et al., 2017) for Natural Language Processing from both theory and experiment concluded that CNNs are hierarchical architecture and RNNs are sequential architectures. For example, given embedding of an input sequence $e_{1}$ , $e_{2}$ , $e_{3}$ , $e_{4}$ , the CNN would use a window with size 2 and compute extra features, $f_{1}$ $=$ CNN( $e_{1}$ , $e_{2}$ ) and $f_{2}$ $=$ CNN( $e_{2}$ , $e_{3}$ ). Following such pattern higher order of features can be further computed. In the reference Zhang et al. (2015), convolution was performed on characters of an English word to compute a representation of the word. This can be done on Chinese too. The only difference is that since the Chinese do not have a natural delimiter, convolution is performed on the whole combination within a window size for Chinese, capturing word combinations and non-word combinations separately. On the other hand, RNN performs computation sequentially, which means, at a certain step, it takes in the current input and the previous output to compute an output. Such kind of computation indicates that the order of the sequence is considered (in language, we consider order of a language involves syntax). The CNN and RNN capture different types of structural information, the CNN for hierarchical and the RNN for sequential (Yin et al., 2017). In the task of character level text classification for Chinese, hierarchical structures such as word and phrase combination can be induced with CNN, while syntactical information can be learned using RNN. While embedding two kinds of knowledge into one vector may be challenging due to the irrelevancy of the knowledge, an appropriate representation may not be induced. Updating of the vector may go back and forth during training.

Given the problems mentioned above, this paper proposes a model that does not require any word segmentation when processing Chinese text data. Two embeddings are used to capture hierarchical and sequential information separately. The model consists of the CNN and RNN networks. Instead of stacking a certain type of networks, we have combined them in a concurrent manner, thus allowed the model to capture different types of information without interfering with each other. We conducted experiments using multiple character level models and traditional machine learning models on two Chinese datasets, Fudan News and Sougou News. We also compare our results with those reported from state-of-the-art researches as well.

The main contributions of this study are summarized as follows:

•
We demonstrate that the character level classification models are not affected from the propagated error problem due to the fact that such a step is eliminated. Also, we show that the character level models have many advantages compared with the word level models.
•
We present a Double Embedding Neural Network Classifier. It consists of the CNN and the RNN. We explain how to combine the two networks for allowing them to capture different types of information without interfering with each other.
•
Our Double Embedding Neural Network Classifier model was applied to the Chinese Text classification task. Experiments show that this character level model outperforms the state-of-the-art word level models.

Section snippets

Related work

The task of text classification is widely studied. For Chinese text, most current models firstly perform word segmentation before feeding features to a model for classification (Lai et al., 2016, Lu et al., 2010). Unfortunately, inherent errors caused by word segmentation will be propagated into the model. Recent works in Chinese text classification show that performance at word level can reach a high accuracy rate.

Character level models are also proposed by some researchers. These models

A double embedding neural network classifier

The model for Chinese text classification, shown in Fig. 1, is a combination of a CNN $+$ highway layer and a RNN layer. The input document is mapped to two different embeddings, which are fed to the two blocks of the model. Output is concatenated to a soft-max classifier. The model is modular where one embedding is fed to a CNN $+$ highway network and the other is fed to a RNN network. As the depth of the network increases, the training of the network is becoming more and more difficult. The highway

Experiments and results

Experiment and results will be presented in this section. The performance of the model is evaluated by accuracy on test set.

Conclusion

Text classification for Chinese is feasible at character level. This paper shows that preprocessing of Chinese, such as word segmentation, is not a prerequisite for Chinese text processing. Neural network models are able to outperform many traditional approaches. Combining different Neural network models with multiple embeddings gives us a best result. Experiments show that character level model is able to compete with word level models. Finally, it is necessary to incorporate linguistics

Acknowledgments

This work is supported by China National High-Tech Project (863) under grant (No.2015AA015401). Beijing Key Lab of Networked Multimedia also supports our research work. The work is supported by 973 Program, China (No.2014CB340504), the State Key Program of National Natural Science of China (No.61533018), NSFC-ANR, China (No.61261130588), National Natural Science Foundation of China (No.61402220, No. 61502221), the State Scholarship Fund of CSC, China (No. 201608430240), the Philosophy and

References (26)

LiuY. et al.
Ensemble method to joint inference for knowledge extraction
Expert Syst. Appl.
(2017)
LuS.-H. et al.
Chinese text classification by the NaïVe Bayes classifier and the associative classifier with multiple confidence threshold values
Know.-Based Syst.
(2010)
AggarwalC.C. et al.
A survey of text classification algorithms
ChangP.-C. et al.
Optimizing Chinese word segmentation for machine translation performance
ChenX. et al.
Gated recursive neural network for Chinese word segmentation.
ACL (1)
(2015)
Conneau, A., Schwenk, H., Barrault, L., LeCun, Y., 2016. Very deep convolutional networks for natural language...
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2012. Improving neural networks by...
HochreiterS. et al.
Long short-term memory
Neural Comput.
(1997)
Huang, W., Wang, J., 2016. Character-level Convolutional Network for Text Classification Applied to Chinese Corpus,...
KimY.
Convolutional neural networks for sentence classification
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL
(2014)

KimY. et al.

Character-aware neural language models

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

(2016)

LaiS. et al.

How to generate a good word embedding

IEEE Intell. Syst.

(2016)

LaiS. et al.

Recurrent convolutional neural networks for text classification

Cited by (16)

Leveraging the meta-embedding for text classification in a resource-constrained language
2023, Engineering Applications of Artificial Intelligence
This paper proposes an intelligent text classification framework for a resource-constrained language like Bengali, which is considered a challenging task due to the lack of standard corpora, appropriate hyper-parameter tuning method, and pre-trained language-specific embedding. The proposed framework comprises an average meta-embedding feature fusion module and a convolutions neural network module called AVG-M+CNN. This work also proposes an algorithm, i.e., automatic hyperparameter tuning and selection, for enhancing the performance of the AVG-M+CNN technique. All meta-embedding models are evaluated using the intrinsic, e.g., semantic, syntactic, relatedness word similarity, analogy tasks and extrinsic evaluators. The intrinsic evaluator evaluates 200 Bengali semantic, syntactic and relatedness word pairs. Spearman ( $\hat{ρ}$ ), Pearson ( $\hat{r}$ ) and cosine similarity correlations are used to evaluate 18 individual embedding and 9 meta-embedding models. The 3COSADD and 3COSMUL evaluators evaluate the 300 analogy tasks. The extrinsic evaluator evaluates a total of 156 classification models on four corpora: BARD, IndicNLP, Prothom-Alo and $B T C C 11$ (a newly developed corpus having eleven distinct categories). Among these, the AVG-M+CNN model achieves the highest accuracy regarding four Bengali corpora: 95.92 $\pm$ .001% for BARD, 93.10 $\pm$ .001% for Prothom-Alo, 90.07 $\pm$ .001% for $B T C C 11$ and 87.44 $\pm$ .001% for IndicNLP, respectively.
Contrastive knowledge integrated graph neural networks for Chinese medical text classification
2023, Engineering Applications of Artificial Intelligence
This paper aims at medical text classification, where texts describe medicines, diseases, or other medical topics. This field is still challenging since medical texts contain intensive specialization and terminology, which require professional semantic and structured knowledge to classify. Based on our observations, medical knowledge graph (KG) can provide such knowledge although they may be ambiguous. To this end, we propose contrastive knowledge integrated graph neural networks (ConKGNN) to make full use of the above knowledge. Specifically, the proposed method builds two graphs for a medical text, i.e. text graph and text-specific subgraph, containing the text information and relevant KG information, respectively. Two graphs are merged into a united graph, which is jointly modeled by graph neural networks (GNN). In this way, our approach adequately learns interactions between neighbors. Meanwhile, it promotes the mutual influences between text and KG. We further propose graph-based supervised contrastive learning. By randomly cutting off nodes from the text graph, an augmented united graph is obtained. Learning it in a contrastive way could enhance the robustness of introducing KG information. Comprehensive experiments are conducted on five Chinese medical datasets and experimental results show our model outperforms strong baselines remarkably. Consequently, our model can serve as an efficient medical text classifier with excellent performance. We release the code at https://github.com/nolongernome/ConKGNN.
Enhanced prototypical network for few-shot relation extraction
2021, Information Processing and Management
Citation Excerpt :
Although it can produce a great amount of labeled data, noise is always introduced to the generated data, and noise data may guide to poorly-performing models because the pair of entities may have different relations in different sentences. Then, multiple instance learning (MIL) (Chung et al., 2019; Luo et al., 2017; Wan et al., 2019; Zeng et al., 2015) was proposed to alleviate the problem by relaxing the distant supervision assumption. MIL predicted the labels at the bag level which aggregated multiple instances.
Most existing methods for relation extraction tasks depend heavily on large-scale annotated data; they cannot learn from existing knowledge and have low generalization ability. It is urgent for us to solve the above problems by further developing few-shot learning methods. Because of the limitations of the most commonly used CNN model which is not good at sequence labeling and capturing long-range dependencies, we proposed a novel model that integrates the transformer model into a prototypical network for more powerful relation-level feature extraction. The transformer connects tokens directly to adapt to long sequence learning without catastrophic forgetting and is able to gain more enhanced semantic information by learning from several representation subspaces in parallel for each word. We evaluate our method on three tasks, including in-domain, cross-domain and cross-sentence tasks. Our method achieves a trade-off between performance and computation and has an approximately 8% improvement in different settings over the state-of-the-art prototypical network. In addition, our experiments also show that our approach is competitive when considering cross-domain transfer and cross-sentence relation extraction in few-shot learning methods.
Monotonic alignments for summarization
2020, Knowledge-Based Systems
Citation Excerpt :
Some branches reach out to using generative models for summarization [11,12]. Other researches [13–17] include being faithful by extracting actual fact descriptions [18], where the problem of fake summaries was addressed; incorporating key information to the model [19], where keywords are extracted and later used as long term guide; using global context [20] during the decoding process; producing customized summaries [21] according to personal preferences. There have been researching papers that present their findings in using monotonic alignment in the task of summarization [22,23], while they show that it shows potential in single sentence summarization, it still struggles when trying to do multi-sentence summarization.
Summarization is the task that creates a summary with the major points of the original document. Deep learning plays an important role in both abstractive and extractive summary generations. While a number of models show that combining the two gives good results, this paper focuses on a pure abstractive method to generate good summaries. Our model is a stacked RNN network with a monotonic alignment mechanism. Monotonic alignment has an advantage because it produces the context that is in the same sequence as the original document, at the same time eliminating repeating sequences. To obtain monotonic alignment, this paper proposes two energies that are calculated using only the previous alignment state. We use sub-word method to reduce the rate of producing OOVs(Out of Vocabulary). The dropout is used for generalization and the residual connection to overcome gradient vanishing. We experiment on CNN/daily new and Reddits dataset. Our method out-performs the previous models with monotonic alignment by 4 ROUGE-1 points and achieves the results comparable to state of the art.
Research on Classifier Design of Chinese Abstracts of Postgraduate Dissertations in Library Collection
2022, Proceedings of SPIE - The International Society for Optical Engineering
JSON document clustering based on schema embeddings
2022, Journal of Information Science

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.01.009..

View full text

Empirical study on character level neural network classifier for Chinese text☆

Abstract

Introduction

Section snippets

Related work

A double embedding neural network classifier

Experiments and results

Conclusion

Acknowledgments

Expert Syst. Appl.

Know.-Based Syst.

A survey of text classification algorithms

Optimizing Chinese word segmentation for machine translation performance

Gated recursive neural network for Chinese word segmentation.

ACL (1)

Long short-term memory

Neural Comput.

Convolutional neural networks for sentence classification

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL

Character-aware neural language models

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

How to generate a good word embedding

IEEE Intell. Syst.

Recurrent convolutional neural networks for text classification