SVOIS: Support Vector Oriented Instance Selection for text classification
Introduction
The number and size of online information collections are increasing rapidly, meaning that text classification (or categorization) has become one of the major techniques for managing large scale text repositories. The aim of text classification is to automatically classify documents into a fixed set of pre-defined categories, in which text documents are first processed by the natural language processing technique, and then each document is represented by a d-dimensional feature vector. Next, specific classification techniques, such as support vector machines, can be used for model learning and classifying of text documents [18], [20], [27], [31].
Data pre-processing is one of the most critical steps in data mining and knowledge discovery in database (KDD) techniques, performed to ensure good quality data mining. Feature selection (or dimensionality reduction) has been extensively studied in the text classification literature; for examples see [10], [32]. This is because the dimensionality of the extracted textual features that represent text documents (i.e., tf-idf) is usually very large, say 10,000.
However, the size of today's text collections often exceeds the size of the datasets which the current software and/or hardware handle properly. In spite of this, there have been few studies focused on instance selection (or data reduction) for text classification. That is, if too many instances (i.e., documents) are adopted, it can result in large memory requirements and slow execution speed, and can cause over-sensitivity to noise [21], [30]. Furthermore, one problem with using the original data points is that there may not be any located at the precise points that would make for the most accurate and concise concept description [23].
In addition to feature selection, instance selection (or data reduction) is another important data pre-processing step in the KDD process. The aim of instance selection is to reduce the data size by filtering out noisy data from a given dataset, which would otherwise increase the likelihood of degrading the mining performance. In particular, instance selection is used to shrink the amount of data, so data mining algorithms can be applied to the reduced dataset. Sufficient results can be achieved if the selection strategy is appropriate [26].
Outlier detection involves the finding of observations that lie an abnormal distance from other values in a random sample from a given population. Outliers have traditionally been defined as unusual observations (or bad data points) that are far removed from the mass of data [1], [3]. Consequently, classifiers trained by selected instances as a subset of original instances can provide relatively good performances. Outlier detection is also a critical KDD task [19], and filtering out the detected outliers is very useful for obtaining good mining results. From the data mining perspective, the aim of instance selection is similar to that of outlier detection [22].
In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS) is proposed for text classification. SVOIS mainly borrows the idea of support vector machines (SVM). The support vectors in SVM are used for binary classification decisions [28]. Specifically, given a training dataset, each training vector is associated with one of two different classes. During the training stage, the input vectors are mapped into a new higher dimensional feature space. Then, an optimal separating hyperplane is constructed in the new feature space. All vectors lying on one side of the hyperplane can be regarded as class 1, and all vectors lying on another side are class 2. The training instances that lie closest to the hyperplane in the transformed space are called support vectors. The number of these support vectors is usually small compared to the size of the training set and they determine the margin of the hyperplane, and thus the decision surface. In order to produce good generalizations, the SVM maximizes the margin of the hyperplane and diminishes the number of support vectors for it.
However, the major limitations of the SVM are speed and size, both in training and testing [6], [7]. In other words, the computational cost necessary to identify the hyperplane and support vectors in a new and very high dimensional feature space is excessive.
Unlike the SVM, the SVOIS attempts to find the support vectors in the original feature space through a linear regression plane, where the instances to be selected as the support vectors need to satisfy two criteria. The first one is that the distances between the original instances and their class centers need to be smaller than a pre-defined value. Then, the instances fulfilling this criterion are regarded as the regression data in order to identify a regression plane. The second criterion is based on the distances between the regression data and the regression plane, which is like the margin in the SVM. These distances need to be larger than a pre-defined value. The regression data fulfilling this criterion are called support vectors and are used for classifier training and classification. Specifically, these two types of distances should be neither too long, so that all instances are selected, nor too short, leading to very few support vectors (cf. Section 3).
To the best of our knowledge, SVOIS is the first method to select redundant instances (i.e., documents) for text classification. Current state-of-the-art instance selection techniques (cf. Section 2) are only assessed for classification performance over the datasets containing very low dimensionality. They perform poorly in the context of text classification. In contrast, using SVOIS to filter out unimportant documents from the given training dataset allows two well-known classifiers (i.e., SVM and k-NN) to perform better than the ones without instance selection and the same ones followed by state-of-the-art instance selection techniques.
The rest of this paper is organized as follows. Section 2 provides an overview of related literatures, including the concept of instance selection and four well-known instance selection algorithms, which are ENN, IB3, DROP3, and ICF. The proposed SVOIS method for text classification is introduced in Section 3. In Section 4, the experimental results based on a public text classification dataset are present. Finally, some conclusions are offered in Section 5.
Section snippets
Instance selection
Instance selection can be defined as follows. Given a dataset D composed of a training set T and testing set U, let be the ith instance in D, where which contains m different features. Let be the subset of selected instances resulting from the execution of an instance selection algorithm. Then, U is used to test a classification technique trained by S [8], [12].
There are a number of studies in the literature related to proposing instance selection methods to obtain better
Support Vector Oriented Instance Selection
In this section, we describe our proposed instance selection method, namely Support Vector Oriented Instance Selection (SVOIS). SVOIS is different from the related selection techniques described in Section 2, which are mainly designed for the k-NN classification method. In SVOIS, there are four steps for the selection of support vectors, i.e., important and representative instances. Given a training dataset T containing a set of n data samples, let be the ith instance in T, where i=1, 2,…, n
The dataset
The dataset used to evaluate the performance of SVOIS is based on the TechTC-100 dataset [11], [14]. It includes 100 different two-class datasets, in which the largest and smallest datasets contain 165 and 125 documents respectively. On average, each pair contains 149 documents. In addition, it ranges from pairs of easy categories, such as Games/Video_Games/Shooter and Recreation/Autos/Makes_and_Models/Vokswagen, to medium difficulty ones with Arts/Music/Bands_and_Artists vs. Arts/Celebrities,
Conclusion
In this paper, we introduce a novel instance selection method, namely Support Vector Oriented Instance Selection (SVOIS), for text classification. This approach is simple and computationally efficient enough to filter out some redundant data from a given training dataset. SVOIS is compared with four state-of-the-art algorithms including ENN, IB3, ICF, and DROP3 using a public text document corpus. We observe that the state-of-the-art algorithms cannot select high dimensional data well, which
References (31)
- et al.
Adaptive data reduction for large-scale transaction data
European Journal of Operational Research
(2008) - et al.
Outlier detection for high dimensional data
Proceedings of the ACM SIGMOD Conference
(2001) - et al.
Instance-based learning algorithms
Machine Learning
(1991) - et al.
Outliers in Statistical Data
(1994) - J. Brank, M. Grobelnik, N. Milic-Frayling, D. Mladenic, Interaction of feature selection methods and linear...
- et al.
Advances in instance selection for instance-based learning algorithms
Data Mining and Knowledge Discovery
(2002) A tutorial on support vector machines for pattern recognition
Data Mining and Knowledge Discovery
(1998)- et al.
A survey on pattern recognition applications of support vector machines
International Journal of Pattern Recognition and Artificial Intelligence
(2003) - et al.
Using evolutionary algorithms as instance selection for data reduction: an experimental study
IEEE Transactions on Evolutionary Computation
(2003) - et al.
Multidimensional Scaling
(2001)
A survey on evolutionary instance selection and generation
International Journal of Applied Metaheuristic Computing
An extensive empirical study of feature selection metrics for text classification
Journal of Machine Learning Research
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5
International Conference on Machine Learning
Cited by (29)
Instance selection in medical datasets: A divide-and-conquer framework
2021, Computers and Electrical EngineeringCitation Excerpt :According to García et al. [19], these algorithms are hybrid methods, which can provide relatively large data reduction rates and allow the classifier to perform better than the ones without instance selection. In addition, other related works also use IB3 and DROP3 as the representative baselines in their studies, such as Tsai and Chang [2], Tsai et al. [4], Huang et al. [7], and Tsai and Chen [9]. Therefore, DCIS based on IB3 and DROP3 are compared with the original IB3 and DROP3 algorithms without considering the DCIS approach.
A recent overview of the state-of-the-art elements of text classification
2018, Expert Systems with ApplicationsInstance selection for regression: Adapting DROP
2016, NeurocomputingCitation Excerpt :Noisy instances have a profound impact on how instances are ordered at the beginning of these algorithms. A noise instance means that its neighbours are considered part of the class boundary, and they can be kept in the selected set that has been filtered even after the noisy instance has been removed [39]. Decremental Reduction Optimization Procedure 1 (DROP1).
An Advanced Multi Class Instance Selection based Support Vector Machine for Text Classification
2015, Procedia Computer ScienceRobust and stable feature selection by integrating ranking methods and wrapper technique in genetic data classification
2014, Biochemical and Biophysical Research CommunicationsCitation Excerpt :In this condition, detection of minority class will be incorrect. To solve this problem, various methods have been proposed [21,22], An approach is to add samples to data points of the minority class. The method which is proposed in this paper to solve the problem of imbalanced classes based on the removal of the data points of the majority class.
Evolutionary instance selection for text classification
2014, Journal of Systems and SoftwareCitation Excerpt :The performances of related instance selection algorithms over high dimensional text datasets have not fully examined. On the other hand, Tsai and Chang (2013) propose a novel instance selection algorithm for text classification. It is based on the idea of support vector machines where the selected instances are determined by the ‘support vectors’ lying on the hyperplane.