Empirical study on character level neural network classifier for Chinese text

https://doi.org/10.1016/j.engappai.2019.01.009Get rights and content

Abstract

Character level models are drawing attention recently. A number of these models have been proposed and shown successful in Natural Language Processing tasks. While most of the models are experimented mainly on English, or other alphabetic languages, a number of problems arise when they applied these models to non-alphabetic language such as Chinese. In this study, we investigated the problems encountered when transferring these models to the Chinese and put forward some solutions. We propose a double embedding neural network model that is also character level and consists of both CNN and RNN with two separate embeddings. The model is applied to a fundamental Natural Language Processing task, text classification. Experiment results conducted on the Chinese corpus demonstrated that our character level neural network model performs just as well as or better than those word level classification models. Our model is able to reach 95.9% accuracy on a Chinese Fudan news dataset, which outperforms the state-of-the-art models.

Introduction

Text classification, the task of assigning predefined labels to text documents, is essential in many natural language processing applications, such as news filtering, information retrieval and sentiment analysis (Aggarwal and Zhai, 2012). Text classification on Chinese is challenging problem for many researchers because there is no natural delimiter in the Chinese text (Chang et al., 2008). To improve the performance of Chinese text classification, efforts must be made through two major tasks: feature selection and classifier selection.

State-of-the-art feature selection is based on bag-of-words (n-gram) model with some discriminative feature selectors such as MI, pLSA and LDA. However, these feature selection methods suffer from data sparsity and are unsatisfactory in capturing semantics of words, thus affect classification accuracy (Lai et al., 2015). Word embedding (Mikolov et al., 2013) addresses the data sparsity problem by learning a low dimensional vector for each word. Mikolov et al. showed that word embedding is able to infer the semantics and syntactics of words. There is usually a preliminary and important pre-processing step called sentence segmentation or word parsing, which may enhance the performance of NLP tasks (Chang et al., 2008, Chen et al., 2015, Liu et al., 2017, Tang et al., 2008) on Chinese. Yet the accuracy of sentence segmentation decreases when handling domain specific text or informally written text such as mirco-blogs. The error propagated to the model further affects the classification performance. An emerging set of models, namely, character level classification models (Conneau et al., 2016, Zhang et al., 2015), can avoid such a problem because sentence segmentation is eliminated in them. Moreover, the number of Chinese characters is less than that of words, so the sparsity issue can be less severe. Therefore, the models using embedding at character level models in Chinese are more suitable for Chinese than those word level based classification models.

The second task of text classification is selecting a machine learning model. Traditional models with strong baseline include Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR), etc. (Wang and Manning, 2012). Recently, the neural network models have brought new insights to the various NLP tasks. Models applying neural network to word embedding without any syntactic or semantic knowledge are reported competitive to state-of-the-art models (Kim, 2014, Zhang et al., 2015). Convolution neural network (CNN) and Recurrent neural network (RNN) are two mainstream architectures of the neural network. Long short term memory (LSTM) (Hochreiter and Schmidhuber, 1997) is a popular choice for RNN. Combining the two models together resulted in better performance in some cases. Recurrent convolution neural network (Lai et al., 2015) applies a recurrent network over words to learn embedding then max-pooling layer to extract key features. C-LSTM (Zhou et al., 2015) firstly performs convolution operation over word embedding followed by a LSTM layer. These models stack one type of the neural network over another capturing semantics and feeding it to the other layer. Whereas in some models, a static embedding, obtained from an unsupervised neural language model, transfers information extracted from large text corpus to the task (Kim, 2014).

A comparative study of CNN and RNN (Yin et al., 2017) for Natural Language Processing from both theory and experiment concluded that CNNs are hierarchical architecture and RNNs are sequential architectures. For example, given embedding of an input sequence e1, e2, e3, e4, the CNN would use a window with size 2 and compute extra features, f1 = CNN(e1, e2) and f2 = CNN(e2, e3). Following such pattern higher order of features can be further computed. In the reference Zhang et al. (2015), convolution was performed on characters of an English word to compute a representation of the word. This can be done on Chinese too. The only difference is that since the Chinese do not have a natural delimiter, convolution is performed on the whole combination within a window size for Chinese, capturing word combinations and non-word combinations separately. On the other hand, RNN performs computation sequentially, which means, at a certain step, it takes in the current input and the previous output to compute an output. Such kind of computation indicates that the order of the sequence is considered (in language, we consider order of a language involves syntax). The CNN and RNN capture different types of structural information, the CNN for hierarchical and the RNN for sequential (Yin et al., 2017). In the task of character level text classification for Chinese, hierarchical structures such as word and phrase combination can be induced with CNN, while syntactical information can be learned using RNN. While embedding two kinds of knowledge into one vector may be challenging due to the irrelevancy of the knowledge, an appropriate representation may not be induced. Updating of the vector may go back and forth during training.

Given the problems mentioned above, this paper proposes a model that does not require any word segmentation when processing Chinese text data. Two embeddings are used to capture hierarchical and sequential information separately. The model consists of the CNN and RNN networks. Instead of stacking a certain type of networks, we have combined them in a concurrent manner, thus allowed the model to capture different types of information without interfering with each other. We conducted experiments using multiple character level models and traditional machine learning models on two Chinese datasets, Fudan News and Sougou News. We also compare our results with those reported from state-of-the-art researches as well.

The main contributions of this study are summarized as follows:

  • We demonstrate that the character level classification models are not affected from the propagated error problem due to the fact that such a step is eliminated. Also, we show that the character level models have many advantages compared with the word level models.

  • We present a Double Embedding Neural Network Classifier. It consists of the CNN and the RNN. We explain how to combine the two networks for allowing them to capture different types of information without interfering with each other.

  • Our Double Embedding Neural Network Classifier model was applied to the Chinese Text classification task. Experiments show that this character level model outperforms the state-of-the-art word level models.

Section snippets

Related work

The task of text classification is widely studied. For Chinese text, most current models firstly perform word segmentation before feeding features to a model for classification (Lai et al., 2016, Lu et al., 2010). Unfortunately, inherent errors caused by word segmentation will be propagated into the model. Recent works in Chinese text classification show that performance at word level can reach a high accuracy rate.

Character level models are also proposed by some researchers. These models

A double embedding neural network classifier

The model for Chinese text classification, shown in Fig. 1, is a combination of a CNN + highway layer and a RNN layer. The input document is mapped to two different embeddings, which are fed to the two blocks of the model. Output is concatenated to a soft-max classifier. The model is modular where one embedding is fed to a CNN+highway network and the other is fed to a RNN network. As the depth of the network increases, the training of the network is becoming more and more difficult. The highway

Experiments and results

Experiment and results will be presented in this section. The performance of the model is evaluated by accuracy on test set.

Conclusion

Text classification for Chinese is feasible at character level. This paper shows that preprocessing of Chinese, such as word segmentation, is not a prerequisite for Chinese text processing. Neural network models are able to outperform many traditional approaches. Combining different Neural network models with multiple embeddings gives us a best result. Experiments show that character level model is able to compete with word level models. Finally, it is necessary to incorporate linguistics

Acknowledgments

This work is supported by China National High-Tech Project (863) under grant (No.2015AA015401). Beijing Key Lab of Networked Multimedia also supports our research work. The work is supported by 973 Program, China (No.2014CB340504), the State Key Program of National Natural Science of China (No.61533018), NSFC-ANR, China (No.61261130588), National Natural Science Foundation of China (No.61402220, No. 61502221), the State Scholarship Fund of CSC, China (No. 201608430240), the Philosophy and

References (26)

  • LiuY. et al.

    Ensemble method to joint inference for knowledge extraction

    Expert Syst. Appl.

    (2017)
  • LuS.-H. et al.

    Chinese text classification by the NaïVe Bayes classifier and the associative classifier with multiple confidence threshold values

    Know.-Based Syst.

    (2010)
  • AggarwalC.C. et al.

    A survey of text classification algorithms

  • ChangP.-C. et al.

    Optimizing Chinese word segmentation for machine translation performance

  • ChenX. et al.

    Gated recursive neural network for Chinese word segmentation.

    ACL (1)

    (2015)
  • Conneau, A., Schwenk, H., Barrault, L., LeCun, Y., 2016. Very deep convolutional networks for natural language...
  • Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2012. Improving neural networks by...
  • HochreiterS. et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • Huang, W., Wang, J., 2016. Character-level Convolutional Network for Text Classification Applied to Chinese Corpus,...
  • KimY.

    Convolutional neural networks for sentence classification

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL

    (2014)
  • KimY. et al.

    Character-aware neural language models

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

    (2016)
  • LaiS. et al.

    How to generate a good word embedding

    IEEE Intell. Syst.

    (2016)
  • LaiS. et al.

    Recurrent convolutional neural networks for text classification

  • Cited by (16)

    • Enhanced prototypical network for few-shot relation extraction

      2021, Information Processing and Management
      Citation Excerpt :

      Although it can produce a great amount of labeled data, noise is always introduced to the generated data, and noise data may guide to poorly-performing models because the pair of entities may have different relations in different sentences. Then, multiple instance learning (MIL) (Chung et al., 2019; Luo et al., 2017; Wan et al., 2019; Zeng et al., 2015) was proposed to alleviate the problem by relaxing the distant supervision assumption. MIL predicted the labels at the bag level which aggregated multiple instances.

    • Monotonic alignments for summarization

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Some branches reach out to using generative models for summarization [11,12]. Other researches [13–17] include being faithful by extracting actual fact descriptions [18], where the problem of fake summaries was addressed; incorporating key information to the model [19], where keywords are extracted and later used as long term guide; using global context [20] during the decoding process; producing customized summaries [21] according to personal preferences. There have been researching papers that present their findings in using monotonic alignment in the task of summarization [22,23], while they show that it shows potential in single sentence summarization, it still struggles when trying to do multi-sentence summarization.

    • Research on Classifier Design of Chinese Abstracts of Postgraduate Dissertations in Library Collection

      2022, Proceedings of SPIE - The International Society for Optical Engineering
    • JSON document clustering based on schema embeddings

      2022, Journal of Information Science
    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.01.009..

    View full text