Keywords

1 Introduction

Text classification aims to categorize a document into one of the predefined class categories, denoted by \(\mathcal {Y}\), where a document is represented in the form of bag of words \(\mathcal {X}\), denoted as a feature vector \(x \in \mathbb {R}^{d \times 1}\) with d unique terms [1]. Text classification algorithms have been applied successfully in many big data driven text applications and services, such as spam filtering [2], tagging online news [3], social media analysis [4], bioscience [5] and chat bot [6]. Statistical learning based methods, such as Support Vector Machine (SVM) and its family of algorithms, are widely used for text classification [7,8,9].

The typical workflow of statistical learning based text classification system consists of four core components: text preprocessing, feature extraction, classifier training and classification [10]. Research [11] has shown that the sophistication of the feature selection process is critical to the effectiveness and efficiency of the text classifier.

Traditional text feature extraction methods use TF-IDF algorithm [12] to extract text features and transform the corpus of N text documents with a vocabulary of d terms into the d dimensional vector space of N TF-IDF feature vectors. By representing each text document as a feature vector in a high dimensional vector space, a statistical feature based classifier, such as SVM, can be used for text classification through supervised learning. Several known limitations exist for the TF-IDF based feature extraction method. For example, the conventional TF-IDF weighting factor pays no attention to synonyms and semantic relation between synonyms. Moreover, when dealing with large corpus with tens of thousands documents and terms, the feature vector space has very high dimensionality and is highly skewed and sparse. As a result, the computation complexity increases dramatically for large corpus.

To address these limitations, we propose an optimized feature extraction method for text classifications by defining a new weighting factor based on the Inter-Category Distribution of terms and documents, called TF-IDF-ICD. Based on this new weighting factor, we introduce a Dimensional Reduction (DR) technique based on TF-IDF-ICD weighting factor, which confines the dimensionality of the feature vector space by limiting the number of Feature Terms to only the most critical features, namely those term features with high TF-IDF-ICD values. Our TF-IDF-ICD optimized text classification system consists of three steps: (1) By combining with the text preprocessing, the synonym merging, and converging the statistic weight of synonym, the TF-IDF-ICD factors of all terms to every single document in the corpus are calculated and the result is represented as a team feature vector to represent the corresponding document. (2) By filtering out the low weight terms and selecting only those high weighted terms as the Feature Terms, we reduce the dimensions of the feature vector space for the given corpus. (3) After performing feature extraction in the first two steps, the resulting document feature vectors are fed into an SVM classifier to train the classifier. We evaluate the performance of the proposed optimization method for text classification using a well known corpus of 10, 000 text documents. The experimental results show that our approach outperforms the conventional TF-IDF solution by \(7.84\%\) at only about \(43.19\%\) of the original training time, demonstrating the effectiveness of the proposed method for achieving better trade-off between classification accuracy and computational complexity.

2 Related Work

Text data is one of the dominating data types in Big data driven applications and services. A wide variety of techniques have been designed and researched for text classification. Document representation and feature selection are two of the fundamental core tasks in text classification.

The TF-IDF based method is one of the most popular methods for document representation. Given a corpus of N documents with d terms as the vocabulary, we transform each document in the corpus into a d-dimensional feature vector x based on TF-IDF weighting factor. Many feature selection methods have been proposed and compared in [13]. An important function in feature selection is to reduce the dimensionality of the term feature space through feature transformation methods, which create a new and smaller set of features as a function of the original set of features. A typical example of such a feature transformation method is Latent Semantic Indexing (LSI), and its probabilistic variant PLSA [14]. Once the feature space is determined, the document vectors are fed into a chosen text classifier. As noted in [15], text data is ideally suited for SVM classification because of the sparse high-dimensional nature of text.

Most of the conventional TF-IDF based methods lack of consideration on semantic features, such as synonyms and their relations. Several independent research efforts have been engaged to involve semantic features. For example, Huang et al. [16] proposed an improved method based on the feature term vectors extracted from TF-IDF, and analyzed the semantic similarity of feature terms with external dictionaries. Zhu et al. [17] proposed a method using word2vec model to calculate the similarity between words and words, which solves the problem that low-frequency words with high class discrimination are ignored by statistical weights. Qu et al. [18] considered the relations between the feature and the class, showing the problem of the traditional TF-IDF algorithms that ignore the inter-category distribution of feature words.

Our approach is inspired by these existing efforts. Concretely, we propose three optimizations to enhance the efficiency of TF-IDF based feature extraction and text classification: (1) We propose to merge synonyms in text preprocessing step. (2) We propose to compute the inter-category distribution of terms and the inter-category distribution of documents and use the product of these two ICD vectors to define the TF-IDF-ICD weighting factor as an optimization for feature extraction. (3) We introduce a tunable threshold control knob to perform dimension reduction on the TF-IDF-ICD vector space. Our experiments show the effectiveness of our approach compared to the existing methods.

3 Feature Extraction Optimization

3.1 Text Preprocessing with Synonym Fusion

The text preprocessing is the first step. We remove the text format and symbols and perform word segmentation and semantic labeling of the text documents using the open source HanLP [19] tools. Less significant terms are then filtered out for simplicity and noise cancellation. In this work, adverbs, locative words, number words, and auxiliary words etc, are regarded as less significant terms. Even though some solutions [20] remove the name of persons, places and organizations during text preprocessing. We argue that the terms such as the name of person have strong reliance on categories, such as the relation between “Taylor Swift” and music category, between “Yao Ming” and the sports category. Thus, we keep such terms instead of removing them. Our experiments show that the classification accuracy is higher by retaining these special terms.

Synonyms may dilute the statistical feature of a document. We propose to combine synonyms as one term for feature calculation. Synonyms in the text documents are identified according to the dictionary, such as the “HIT-SCIR Synonym Dictionary (Extended Edition)” [21], which contains 77, 343 terms organized according to a five-level classification structure to form a tree structure, and each leaf node represents a group of terms. Three possible relationships are identified among these terms: “\(=\)” stands for synonym, “\(\#\)” stands for relevant terms and “@” stands for independent terms. We replace the terms marked with “\(=\)” with the first term of the group of terms and sum the term frequency so as to achieve the purpose of synonym fusion.

3.2 Text Feature Extraction with TF-IDF-ICD

In order to involve semantic features into the text classification, we advocate to use a new weighting factor, the TF-IDF with Inter-Category Distributions of both terms and documents, to replace the conventional TF-IDF.

Let \(\mathbb {D}=\{d_1, d_2, d_3,\cdots ,d_i,\cdots ,d_N\}\) be the corpus of N documents, where each document \(d_i\) belongs to a category \(C_j (j<m)\). m is the total number of categories. A document \(d_i=\{W_{i1}, W_{i2}, \cdots , W_{ik}\}\) has k terms, where \(W_{ik}\) is the k-th feature term of document \(d_i\). The TF-IDF-ICD weighting factor is calculated using Eq. 1.

$$\begin{aligned} w(W_{ik})=TF(W_{ik})\times (IDF(W_{ik})+ \alpha \times ICD(W_{ik})) \end{aligned}$$
(1)

where \(\alpha \) (\(\alpha \ge 0\)) is a weighting factor used to tune the weight ratio of the inter-category distributions to the inverse document frequency.

Let \(n_{d_i,W_{ik}}\) denote the raw count of term \(W_{ik}\) appearing in document \(d_i\) and \(D_{W_{ik}}\) denote the number of documents that contain term \(W_{ik}\) in the corpus. We define the term frequency adjusted by the document length and the inverse document frequency in Eqs. 2 and 3 respectively.

$$\begin{aligned} TF(W_{ik})=\frac{n_{d_i,W_{ik}}}{\sum _k {n_{d_i,W_{ik}}}} \end{aligned}$$
(2)
$$\begin{aligned} IDF(W_{ik})=\log \frac{N}{D_{W_{ik}}+1} \end{aligned}$$
(3)

Before defining the extended weighting factor \(ICD(W_{ik})\), we first define two concepts: \(ICDT(W_{ik})\) and \(ICDD(W_{ik})\).

\(ICDT(W_{ik})\) describes the occurrence weight of term \(W_{ik}\). It represents the inter-category distribution for term \(W_{ik}\) and is defined in Eq. 4. A higher weight indicates that term \(W_{ik}\) has a higher category preference.

$$\begin{aligned} ICDT(W_{ik})=\sqrt{\frac{n_{C_j,W_{ik}}}{n_{W_{ik}}+1}} \end{aligned}$$
(4)

where \(n_{C_j,W_{ik}}\) is the raw count of term \(W_{ik}\) appearing in category \(C_j\) and \(n_{W_{ik}}\) is the raw count of term \(W_{ik}\) appearing in the corpus.

\(ICDD(W_{ik})\) describes the occurrence weight of documents in which the term \(W_{ik}\) appears. It is the inter-category distribution for a document and is defined in Eq. 5. Similarly, a higher weight indicates that term \(W_{ik}\) has a higher category preference.

$$\begin{aligned} ICDD(W_{ik})=\log \frac{(D_{Cj,W_{ik}}+1)\times N}{D_{W_{ik}}+1} \end{aligned}$$
(5)

where \(D_{C_j,W_{ik}}\) is the number of documents that contain term \(W_{ik}\) in category \(C_j\), and \(D_{W_{ik}}\) is the number of documents that contain term \(W_{ik}\) in the corpus.

The extended weighting factor \(ICD(W_{ik})\) is defined as a product of two inter-category distributions, the ICD for the term and the ICD for the document, in Eq. 6.

$$\begin{aligned} ICD(W_{ik})=ICDT(W_{ik})\times ICDD(W_{ik}) \end{aligned}$$
(6)

3.3 Dimensionality Reduction

After we transform all documents in the corpus into the feature vectors based on the new TF-IDF-ICD weighting factor, the next step is to perform the Dimensionality Reduction by limiting the number of Feature Terms to be used for text classification. Based on the TF-IDF-ICD value of every term in the text document and thus the corresponding feature vector, we only select the terms with the top \(\mu \%\) highest TF-IDF-ICD value as the critical Feature Terms. This operation reduces the dimensionality of the feature vectors of the document corpus. The union set of the selected Feature Terms from all the documents forms a feature vector space that represents the corpus. Consequently, the TF-IDF-ICD values of those critical Feature Terms of a document is used to represent the document. Our experiments show that adequate reduction of the number of Feature Terms has little effect on the accuracy of classification, while can significantly improve the time complexity of the text classification.

3.4 Text Classification with TF-IDF-ICD and DR Optimization

After the optimized feature extraction with TF-IDF-ICD and the ICD-based dimensionality reduction, we feed the optimized feature vector into the text classifier. We use the linear SVM algorithm to train the classifier, and use the K-fold cross validation to evaluate the classification performance.

4 Experimental Evaluation

The corpus used in this paper are labeled news reports. We collect 10000 articles from Sina News. The articles are labeled with 10 categories and 1000 news in each category. In the experiments, we uses 10-fold cross-validation to assess the classification results.

Recall Sect. 3.2 and Eq. 1, the parameter \(\alpha \) is used as the probability for contrasting ICD with TF-IDF. Another important parameter in our approach is the dimensionality reduction (DR) threshold parameter \(\mu \%\). Both \(\alpha \) and \(\mu \%\) are the tunable parameters in the experiments. To determine the best value of \(\alpha \), we fix \(\mu \%\) to \(20\%\) and vary \(\alpha \) from 0 to 4 by increasing 0.1 each time. The result is shown in Fig. 1. The overall trend of F-Measure increases as \(\alpha \) increases from 0 to 4. When \(\alpha \) increases to 1.4, F-Measure reaches the maximum value of 0.9490. As \(\alpha \) continues to increase from 1.5 to 4.0, F-Measure first experiences a slight decrease and then becomes relatively stable. In Sina News Corpus, when \(\alpha \) is 1.4, the F-measure accuracy is increased by 0.0237, compared to the conventional TD-IDF method without using ICD weighting factor (i.e., \(\alpha = 0\), F-Measure: 0.9253).

\(\mu \%\) represents the portion of Feature Terms. A higher \(\mu \%\) results in a larger dimension of feature space which results in higher computational complexity. We set the \(\alpha \) value by 1.4, which produces the best F-Measure as shown in the previous experiments, and report the F-measure results by varying \(\mu \%\) from \(5\%\) to \(100\%\) at \(5\%\) each time. Figure 2 shows the results.

Fig. 1.
figure 1

F-Measure vs \(\alpha \) (\(\mu \%=20\%\)).

Fig. 2.
figure 2

F-Measure vs \(\mu \%\) (\(\alpha =1.4\)).

We observe that the F-Measure increases very fast as \(\mu \%\) increases to \(20\%\). This is because the feature size increases exponentially with the growth of \(\mu \%\) and the increase ratio gets slower as \(\mu \%\) continues to increase. Hence, \(\mu \%\) is chosen to be \(20\%\) as a good trade-off setting between high accuracy and complexity.

We next compare the proposed TF-IDF-ICD method with and without DR (\(\alpha = 1.4\), \(\mu \% = 20\%\)) with the traditional TF-IDF based method. To be fair, we also use the Dimensionality Reduction (DR) based on TF-IDF (\(\mu \% = 20\%\)) when comparing with our approach with DR. Table 1 shows the experimental results.

Table 1. Comparison between the improved method and TF-IDF.

As shown in the table, the proposed TF-IDF-ICD method has better accuracy performances than the traditional TF-IDF based method, no matter whether DR is used or not. Moreover, the accuracy performances of the proposed TF-IDF-ICD method is less affected by DR than the TF-IDF based method. In the case of using DR in both methods, the proposed method improves the F-Measure performance by \(7.84\%\) more than the TF-IDF based method does. The precision and recall performances are also improved by \(6.69\%\) and \(8.96\%\) respectively.

Using the dimension and training time of non-DR solution (i.e., \(\mu \%=100\%\)) as the baseline, we also investigate the relative dimension of the vector space and the relative training time of the classifier. Figure 3 shows that the relative dimension increases with \(\mu \) exponentially. When \(\mu \%=20\%\), the dimension is about \(\frac{1}{20}\) of the non-DR value. Figure 4 shows that the relative training time reduces significantly with a smaller \(\mu \). When \(\mu \%=20\%\), the text classification using TF-IDF-ICD with DR reduces the training time to \(43.19\%\) of the time in non-DR solution.

Fig. 3.
figure 3

Relative dimension vs \(\mu \%\) (\(\alpha =1.4\), non-DR solution as baseline).

Fig. 4.
figure 4

Relative training time vs \(\mu \%\) (\(\alpha =1.4\), non-DR solution as baseline).

It is worth noting that the ICD integration operation brings overhead of memory consumption and runtime increasing in document preprocessing. But based on our measurement in the experiment, the increased overhead is almost negligible compared to the original TF-IDF computation cost.

5 Conclusion

We have presented an efficient text classification scheme with three optimizations: (1) synonym fusion in the text preprocessing, (2) the enhanced feature extraction based on the Inter-Category Distributions (ICD) of both terms and documents, and (3) the dimensionality reduction based on the TF-IDF-ICD threshold based critical feature selection. We evaluate the performance of the proposed approach for text classification using a corpus of 10, 000 text documents. The experimental evaluation results (\(\alpha = 1.4\), \(\mu \% = 20\%\)) show that our method outperforms the conventional TF-IDF based solution by \(7.84\%\) and spends only about \(43.19\%\) of the training time required by the TF-IDF based method.