Keywords

1 Introduction

Automatic document classification is applied in numerous electronic business (e-business) scenarios [1, 16]. For example, a medium-sized company may receive quite a few emails daily without accurate and concrete information such as recipient’s name or department, which have to be read by an assigned agent so that the destinations can be determined. Thus, it is possible that an automatic document classification system can reduce human workload to a great extent.

More generally, given the rapid growth of web digital documents, it is often beyond one’s ability to categorize information by reading thoroughly the pool of documents. Accurate and automatic text classification techniques are hence needed to classify the incoming text documents into different categories such as news, contracts, reports, etc. Users can hence estimate the content and determine the priorities of each document, maintaining more organized working schedule.

A typical method of automatic text classification is that given a training set of documents with known categorical labels and word dependency information, calculate the list of possibilities for each test document on the each label assigned. Certainly, the label with the highest likelihood corresponds to the predicted category that a test document belongs to. Classical machine learning (ML) algorithms such as Bayesian classifier, decision Tree, K-nearest neighbor, support vector machine and neural network were often applied in text classification [11]. In recent years deep learning algorithms are also introduced in these tasks. One representative trial was the application of convolutional neural network (CNN), a powerful network in computer vision [12]. Recurrent neural network, which can capture information that has been calculated so far, was later introduced and became a popular method to handle sequence-formed information, with satisfactory classification performance [24].

However, most strategies mentioned above seldom view the classification problem from the perspective of semantic analysis. For example, the traditional Bayesian-based text classification method constructs a classification model based on the frequencies of some feature words in corpus. Unfortunately, it does not consider polysemous words (a word which holds different meanings depending on the context) and synonymous words (different words which hold a similar meaning) for semantic analysis during the classification procedure. For example, the Chinese word “Xiaomi” can mean either an agricultural product or a high-tech company; hence documents including “Xiaomi” possibly be classified as “agriculture” or “technology” when using the traditional Bayesian method. Similar problems also exist in the classification of English documents. For example, English documents containing the word “program” may not only represent computer code programs and be classified as “computer”, but also represent a scheduled radio or television show and be classified as “entertainment”.

On the other hand, synonymous words can also cause mis-classification of documents. For example, the word “people” is synonymous with “mass” and “mob” and they may occur in documents with various topics (e.g., architecture, culture and history). Therefore, choosing these words as features of the classification model may cause classification errors. These situations also exist in document classification tasks of word-embedding-based deep learning methods. For example, during feature extraction procedure the word dependence is calculated based on the statistical analysis on the posterior probability of a word following another one. However, a single embedding cannot represent multiple meanings, while similar embeddings may refer to different topic types.

More precise descriptions of the two problems:

  1. (1)

    Problem of polysemy: some words have multiple meanings, which may lead to mis-classification of documents;

  2. (2)

    Problem of synonym: different words with similar meanings are often used in different scenarios, but when they appear in an article at the same time, it may lead to mis-classification of documents;

In the later sections of this paper, the authors will try to resolve these two research problems.

Khan et al. [11] suggested that semantic analysis could help enhance the performance of classification. In practice, semantic analysis is generally implemented by the introduction of ontology that represents terms and concepts in domain-wise manner, and the domains are pre-defined by expert knowledge bases [11]. Although a few attempts have been made, such as using ontological knowledge [6] and WordNet for word sense disambiguation (WSD) [13, 21], so far limited progresses have been achieved. This is mainly due to domain constraint of ontology or the ambiguity across different natural languages, which may lead to polysemy and synonymy issues [19] and finally result in uncertainty of document classification [7].

In this research, we report a novel semantic embedding and similarity computing approach to implement semantic document categorization. The first strategy aims to solve polysemy problem by using a novel semantic similarity computing method (SSC) so that the most context-fitting meaning of a word in one sentence can be determined by referring to the meaning of similar sentences expressing this word in a common dictionary. In this paper, CoDic [8, 22] and Hownet [5] are used as common dictionaries for meaning determination and term expansion. With their help, words with ambiguity will be removed from the feature list, enabling more distinctive features to be selected. The second strategy aims to solve the synonym problem by adopting a strong correlation analysis method (SCM), where synonyms unrelated to the classification task are deleted. Otherwise, select the specific meaning of one word in the synonym group from the common dictionary and replace others in the same group.

2 Related Work

Automated document classification, also called categorization of document, has a history that can date back to the beginning of the 1960s. The incredible increase in online documents in the last decades intensified and renewed the interests in automated document classification and data mining. In the beginning, document classification focused on heuristic methods, that is, solving the task by applying a group of rules based on expert knowledge. However, this method was proved to be inefficient, so in recent years more focuses are turned to automatic learning and clustering approaches. These approaches can be divided into three categories based the characteristics of their learning phases:

  1. (1)

    Supervised document classification: this method guides the whole learning process of a classifier model by providing complete training dataset that contains document content and category labels at the same time. The process of supervision is like training students using exercises with “correct” answers.

  2. (2)

    Semi-supervised document classification: a method with a mixture of supervised and unsupervised document classification. Part of documents have category labels while the others do not.

  3. (3)

    Unsupervised document classification: this method is executed without priori knowledge of the document categories. The process of unsupervised learning is like that of students doing final examination which they do not have standard answers for reference.

However, regardless of whichever learning methods, many of them require the conversion of unstructured text to digital numbers in the data pre-processing stage. The most traditional (and intuitional) algorithm is one-hot representation, which uses N-dimension binary vector to represent vocabulary with each dimension stands for one word [11]. However, this strategy easily incurs the curse of dimensionality for representation of long texts. This is because a big vocabulary generates high-dimension, but extremely sparse vectors for long documents. Therefore, dimensionality reduction operation which removes redundant and irrelevant features is needed [2]. This demand is satisfied by the methodology called feature extraction/selection. The goal of feature extraction is the division of a sentence into meaningful clusters and meanwhile removing insignificant components as much as possible. Typical tasks at the pre-processing stage include tokenization, filtering, lemmatization and stemming [20]. After that, feature selection aims to select useful features of a word for further analysis. Compared with one-hot representation that generates high-dimensional, sparse vectors, an improved solution called TF-IDF produces more refined results. In this frequency-based algorithm, the “importance” of a word is represented by the product of term frequency (how frequent the word shows up in a document) and inverse document frequency (log-inverse of the frequency that documents containing such word in the overall document base) [14, 20]. These two algorithms, however, clearly suffer from limitations as a result of neglecting the grammar and word relations in documents. More recently, distributed representation that illustrates dependencies between words are more widely used, as it reflects the relationships of words in one document [15]. Currently, the most widely used strategy to learn the vectorized words is to maximize the corpus likelihood (prediction-based), with the word2vec toolbox being one of the most popular tools. Implementation of this algorithm is dependent on the training of representation neural network with words in the form of binary vectors generated by one-hot representation. The weights of the network keep being updated until convergence, which generates a vector that lists the possibility of each word could follow the input word in a document [11, 15].

3 Semantic Document Classification

This section proposes two novel strategies to resolve the research problems mentioned above.

3.1 Strategy to Resolve Polysemy Problem: SSC

The first strategy aims to solve polysemy problem by using a novel semantic similarity computing method. As previously mentioned, the most context-fitting meaning of a word can be determined by referring to the semantics of related sentences in a common dictionary (e.g., CoDic for English and Hownet for Chinese).

In our method, we implement the semantic similarity computing method (SSC) for the comparison of similarity between two sentences. The SSC splits a text document into sentences. For each word (w) in a sentence (s), all of its concepts from the dictionary are extracted based on its Part-of-speech (PoS) tag in the sentence. Then, semantically compare each concept of w with s and return the concept with the highest similarity score. Words without determinative meaning will be removed from the list of features, and hereby more distinctive terms are more likely to be left and selected as features. The pseudocode of the SSC algorithm is shown in Table 1.

Table 1. Semantic similarity computing (SSC)

The workflow of the SSC is quite simple. According to Table 1, it is clear that the first step is the segmentation of each sentence into words (word_tokenize) and tokenize them (pos_tag) with their parts of speech. Then, we get the synonym set (synset) for each tagged word in the sentence according to their PoS (tagged_to_synset). After that, we remove the none component in each synset. Next, for each synset in the first sentence (sent1), we compute the similarity score of the most similar word (path_similarity) in the second sentence (sent2). After computing the similarity score of all synsets of sent1 with that of sent2, an average similarity value between them can be returned. By using this method, we can acquire the similarity values between all test sentences (ss) and the target sentence (ts). In the end, the test sentence with the highest similarity value can be chosen as the most semantically similar sentence.

Cosine Similarity (CS) is an often used method to compute the similarity score between two vectors (e.g., m for \(sentence_1\), n for \(sentence_2\)) by measuring the cosine of angle \(\theta \) between them.

$$\begin{aligned} CS = \cos (\theta )= \frac{m*n}{\left\| m \right\| *\left\| n \right\| }=\frac{\sum \limits _{i=1}^N m_i n_i}{\sqrt{\sum \limits _{i=1}^N m_i^2}\sqrt{\sum \limits _{i=1}^N n_i^2}} \end{aligned}$$
(1)

Therefore, for CS, the most important step is to convert sentences into vectors. A common way is to use the model of bag of words with TF (term frequency) or TF-IDF (term frequency-inverse document frequency). Another method is to utilize Word2Vec or self-trained word embedding to implement the mappings from words to vectors.

3.2 Strategy to Resolve Synonym Problem: SCM

There may be many synonyms in a large text, but not all of them are suitable text features. As is known to all, selecting effective text features can reduce the dimension of feature space, enhance the generalization ability of the model and reduce overfitting, so as to improve the effect and efficiency of classification and clustering [3]. Therefore, effective feature selection is particularly important. In this paper, we refine the synonym problem into a sub-problem: how to determine the degree of the relevance between a feature and the classification task and then remove the feature words in the synonym group that are less or not relevant to the classification task.

In this paper, a novel correlation analysis algorithm, named SCM, is proposed to obtain effective feature sets. The idea of the SCM contains two important considerations:

  1. (1)

    The feature words with strong category discrimination ability are extracted by using the category discrimination method (CDM), and then the correlation between other feature words and categories is measured by the feature correlation analysis (FCA). That is, the selected feature is guaranteed to be the most relevant to the category first, and then the degree of correlation between other features and selected features is calculated.

  2. (2)

    When a feature showing a strong correlation with the selected feature is found, the SCM will not include it into the feature candidate set even if the feature has a strong correlation with the category. Because compared with existing feature candidate set, the new undetermined features cannot provide additional category-related information. The mathematical foundation of this idea is that linear dependent vectors cannot construct the base of a vector space but orthogonal vectors can.

This paper adopts TF-IDF (Term frequency-Inverse Document Frequency) as the implementation of CDM. By applying TF-IDF to the synonym group in undetermined features, we can get a feature candidate set composed of a number of features with strong category discrimination ability. In TF-IDF, the importance of a word is represented by the product of the word frequency (i.e., the frequency with which the word appears in the document) and the inverse document frequency (i.e., dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient). The formulas of TF-IDF are as follows.

$$\begin{aligned} tf_{i,j}=\frac{n_{{i,j}}}{\sum _{k}n_{k,j}} \end{aligned}$$
(2)
$$\begin{aligned} idf_{i} =\lg {\frac{|D|}{|\{D_j:t_{i}\in d_{j}\}|+1}} \end{aligned}$$
(3)
$$\begin{aligned} tf{-}idf_{i,j}=tf_{i,j}\times idf_{i} \end{aligned}$$
(4)

where (2) refers to the importance of a term \(t_i\) in a particular document \(d_j\). The molecule \(n_{i,j}\) is the number of occurrences of \(t_i\) in \(d_j\), and the denominator is the sum of the number of occurrences of all words in \(d_j\). Formula (3) is a measurement of the general importance of a word in all documents. Its molecule represents the total number of documents in the corpus. The denominator represents the number of documents containing the word \(t_i\). Formula (4) is the product of “term (word) frequency (TF)” and “inverse document frequency (IDF)”. The more important a word is to a certain category of texts, the higher its tf-idf value will be, and vice versa. Therefore, TF-IDF tends to filter out common words and retain important words to certain category of texts.

The SCM proceeds to calculate how strongly all features (in each synonym group) are related to category (C) in the feature candidate set. The formulas are as follows,

$$\begin{aligned} H(X)= \sum _{i=0}^n (p_i*lg \frac{1}{p_i} ) \end{aligned}$$
(5)
$$\begin{aligned} H(X|Y)= \sum _{j} p(Y_j) \sum _{i} p(X_i|Y_j) lg \frac{1}{p(X_i|Y_j)} \end{aligned}$$
(6)
$$\begin{aligned} I(X|Y)=H(X) - H(X|Y) \end{aligned}$$
(7)
$$\begin{aligned} Corr(X,Y)= \frac{I(X|Y)+I(Y|X)}{H(X)+H(Y)} \end{aligned}$$
(8)

where X is an n-dimensional random variable and Y is a certain of class (or category). Formula (5) represents the entropy of X, that is the uncertainty of X. Formula (6) means the uncertainty of X given the occurrence of Y. Formula (7) represents the information gain between H(X) and H(X|Y). Formula (8) is used to measure the degree of correlation between a feature (X) and a category (Y).

According to the degree of correlation, the features in each synonym group are arranged in a descending order respectively, and then the ordered feature sequences are sent back into the feature candidate set. Select the first feature in the sequence, that is, the feature with the strongest correlation with the category (C), and remove it from the feature candidate set and put it into the feature result set.

In order to eliminate redundant features, it is necessary to calculate the degree of mutual independence between any two features (within a synonym group). Thus, this section proposes a novel feature correlation analysis method, called FCA, to exclude unnecessary features in synonym groups of the feature candidate set. The idea of the FCA is simple: if a remaining feature in the candidate set is a strong category-correlated feature, and its mutual independence with the selected feature is greater than or equal to a threshold alpha, it indicates that the candidate feature is independent of the selected feature, and it needs to be included in the feature result set. Otherwise, the feature is considered as redundant and should be deleted. Repeat this process until the feature candidate set is empty. The formulas are as follows:

$$\begin{aligned} IDP(X_i,X_j)= \frac{I(X_i;Y|X_j)+I(X_j;Y|X_i)}{2H(Y)} \end{aligned}$$
(9)
$$\begin{aligned} I(X;Y|Z)= lg\frac{p(X|Y Z)}{p(X| Z)} \end{aligned}$$
(10)

where (9) is used to measure the degree of mutual independence between feature \(X_i\) and feature \(X_j\) when the category (Y) is known. Formula (10) describe the mutual information between feature X and feature Y in the case of given condition Z.

4 Experiments

This section designs experiments for a comparison between classical document classification algorithms and our improved ones.

4.1 Datasets Description

To test the reliability and robustness of our strategy, we use:

Dataset 1: a movie review dataset from Rotten Tomatoes [17, 25]. This dataset contains 10662 samples of review sentences, with 50% positive comments and the remaining negative ones. The size of the vocabulary of the dataset is 18758. Since the dataset does not come with an official train/test split, we simply extract \(10\%\) of shuffled data as evaluation (dev) set to control the complexity of the model. In the next research stage, we will use 10-fold cross-validation on the dataset.

Dataset 2: 56821 Chinese news dataset. It is available in the PaddlePaddleFootnote 1, an open source platform launched by Baidu for deep learning applications. It contains 10 categories: international (4354), culture (5110), entertainment (6043), sports (4818), finance (7432), automobile (7469), education (8066), technology (6017), stock (3654) and real estate (3858).

4.2 Experiment on Neural Network (NN)

In this experiment, the baseline CNN is taken as an example to compare the performance of classical NN and the improved one with our proposed strategy in document classification. We set the same hyper-parameters to make a comparison between CNN and our method (\(Sem_{CNN}\)) (see Table 2). From Table 2, it is known that both of the two trained models are evaluated on the dev dataset every 100 global steps and then they are stored in checkpoints before the training process starting again. After multiple training epochs, the models stored in the checkpoint can be recovered and used for testing on a new dataset. Detailed neural network structure can be found in the open source codeFootnote 2. The experimental procedure is described as follows.

Table 2. Hyper-parameters used in CNN and \(Sem_{CNN}\)
  1. (1)

    Each document in the corpus will be firstly transformed into our semantic document (i.e., documents with semantics embedding) [23] by extending each polysemous word and category-correlated synonymous word with its context-fitting concepts from the common dictionary (i.e., CoDic for English and Hownet for Chinese) with the help of the SSC and the SCM strategies, which aims for accurate semantic interpretation and term expansion. CoDic is a semantic collaboration dictionary constructed under our CONEX project [8, 22, 23]. In CoDic, each concept is identified by a unique internal identifier (iid). The reason of this design is to guarantee semantic consistency and interoperability of documents while transferring across heterogeneous contexts. For example, from Figs. 1 and 2, it is clear that in CoDic, the word “program” with the meaning of “a scheduled radio or television show” is uniquely labeled by an iid “0x5107df021015”, while its another meaning “a set of coded instructions for insertion into a machine ...” has another unique iid “0x5107df02101c”. Currently, CoDic is implemented in XML, where each concept is represented as an entry with a unique iid (see Fig. 3). It is convenient to extract all different meanings of any given word for later semantic analysis by using existed packages (e.g., xml.etree.cElementTree for Python and javax.xml.parsers for Java). Hownet as a common dictionary to handle Chinese documents is used similarly.

  2. (2)

    Build a \(Sem_{CNN}\) network. The first layer embeds words and their extracted accurate concepts into low-dimensional vectors. The second layer performs convolutions over the semantic-embedded document tensors using different sized filters (e.g., \(filter\_size\) = [3, 4, 5]). Different sized filters will create different shaped feature maps (i.e., tensors). Third, max-pooling is used to merge the results of the convolution layer into a long feature vector. Next, dropout regularization is added in the result of max-pooling to trade-off between the complexity of the model being trained and the generalization of testing on evaluation dataset. The last layer is to classify the result using a Softmax strategy.

  3. (3)

    Calculate loss and accuracy. The general loss function for classification problems is the cross-entropy loss which takes the prediction and the real value as input. Accuracy is another useful metric being tracked during training and testing processes. It can be used to prevent model overfitting during the model training by interrupting the training process at the turning point where the classification accuracy on the evaluation dataset starts decreasing regardless of the continuously declining error on the training dataset. The parameters taken at this critical point are then used as the model training results.

  4. (4)

    Record the summaries/checkpoints during training and evaluation. After an object declaration of CNN/\(Sem_{CNN}\) class, batches of data are generated and fed into it to train a reliable classification model. While the loss and accuracy are recorded to keep track of their evolvement over iterations, some important parameters (e.g., the embedding for each word, the weights in the convolution layers) are also saved for later usage (e.g., testing on new datasets).

  5. (5)

    Test the classification model. Data for testing are loaded and their true labels are extracted for computing the performance of prediction. Then, the classification model is restored from the checkpoints, executing on the test dataset and producing a prediction for each semantic document. After that, the prediction results are compared with the true labels to obtain the testing accuracy of the classification model.

Fig. 1.
figure 1

Word “program” with the meaning “a scheduled radio or television show” in CoDic.

Fig. 2.
figure 2

Word “program” with the meaning “a set of coded instructions for insertion into a machine” in CoDic.

Fig. 3.
figure 3

CoDic in XML.

4.3 Experiment on Machine Learning (ML) Approaches

The procedures of training classification models using classical machine learning algorithms with the proposed strategy are listed as follows, while the details can be also found in our open source code.

  1. (1)

    Transform words into vectors based on inputted texts (Note: Chinese document needs to execute word segmentation beforehand.). Collect all words used in texts, perform a frequency distribution and then find out effective features suitable for document classification by using the proposed strategies (SSC and SCM). After that, each text will be converted to a long word vector, where True (or 1) means a word (or a feature) exists while False (or 0) means absence.

  2. (2)

    Execute multiple classical machine learning approaches (e.g., Naïve Bayes, NB) based on the word vectors from Step (1). In this experiment, three variants of NB classifier are used. They are Original NB, multinomial NB and Bernoulli NB classifier. All of them take word features and corresponding category labels as input to train classification models. It is of note that sometimes the classifier should be modified based on realistic cases. For example, in order to avoid the probability being close to zero and underflow problem in NB, it is better to initialize the frequency of each word to one and take natural log of the product in the computation of posterior probability, respectively.

  3. (3)

    Save the trained classifiers for later usage. This is because the training process might be time-consuming, which depends on numerous factors such as dataset size and the computation complexity during model training. Thus, it is impractical to train classification models each time when you need them.

  4. (4)

    Boost multiple classifiers to create a voting system that is taken as a baseline for comparison. To do this, we build a typical classifier (i.e., VoteClasssifier) with multiple basic classical machine learning classification algorithms (i.e., taking multiple basic classifier objects as input when initialized), each of which gets one vote. In VoteClasssifier, the classify method is created by iterating through each basic machine learning classifier object to classify based on the same input features. This experiment chooses the most popular metrics (e.g., accuracy) among these classifiers. The classification can be regarded as a vote. After iterating all the classifier objects, it returns the most popular vote.

4.4 Experiment Result and Analysis

In the actual testing process, we need to maintain a common synonymous word dictionary and a common polysemous word dictionary. The reason we need to maintain these two dictionaries is that the computation workload to judge polysemy and synonyms in a long text are very heavy. For example, if there are n words in a text and each word has m different meanings, then the computational complexity of determining polysemous words is \(O(n*m)\), and the computational complexity of determining synonyms is \(O(n*(n-1))\), so that the total computational complexity is \(O(n*(m+n-1))\geqslant O(n^2)\). Therefore, maintaining these two dictionaries can reduce computational complexity and reduce the pre-processing time of text classification.

Table 3 shows the experimental comparison between classical machine learning algorithms and their improved counterparts on Dataset 1. In this experiment, classical machine learning algorithms include Original Naïve Bayes (NB), Multinomial Naïve Bayes (MNB), Bernoulli Naïve Bayes (BNB), Logistic Regression (LR), support vector machine (SVM) with stochastic gradient descent (SGD), Linear SVC (SVC) and Nu-Support Vector Classification (NSVC).

Table 3. Comparison of classical machine learning algorithms and our improved ones on Dataset 1

From Table 3, it is clear to see that our improved algorithms outperform the classical machine learning algorithms in the accuracy of model prediction on the evaluation dataset. It is of note that three-variant NB algorithms and LR perform better than three-variant SVM algorithms, in both of the classical ones and improved ones. The VoteClassifier plays a role of baseline for the comparison between different algorithms. Tables 4 and 5 show that \(Sem_{CNN}\) performs better than CNN in terms of accuracy and loss in different numbers of epochs. With the increase of the epoch, both of them increase in the accuracy of evaluation and decrease in the loss continuously (before reaching overfitting).

Table 4. Comparison of \(Sem_{CNN}\) and traditional CNN on Dataset 1.
Table 5. Comparison of \(Sem_{CNN}\) and traditional CNN on Dataset 2.

5 Conclusion

This paper introduces a novel semantic document classification approach. It mainly has two improvements: (1) solving the polysemy problem by using a novel semantic similarity computing method (SSC). The SSC implements semantic analysis by executing semantic similarity computation and semantic embedding with the help of the common dictionary. In this paper, we use CoDic for English texts and Hownet for Chinese texts. (2) solving the synonym problem by proposing a novel strong correlation analysis method (SCM). The SCM consists of the CDM strategy for the selection of feature candidate set and the FCA strategy for the determination of the final feature set. Experiments show that our strategy can improve the performance of semantic document classification compared with that of traditional ones.

We will continue going further after this research. More multiple deep learning models (e.g., DualTextCNN, DualBiLSTM, DualBiLSTMCNN or BiLSTMAttention) will be tested for semantic text similarity on document datasets with different natural languages. We would also try to compare this strategy with state-of-the-art embedding methods such as FastText [10], BERT [4] and ULMFit [9] and ELMo [18].