Elsevier

Information Systems

Volume 38, Issue 8, November 2013, Pages 1070-1083
Information Systems

SVOIS: Support Vector Oriented Instance Selection for text classification

https://doi.org/10.1016/j.is.2013.05.001Get rights and content

Highlights

  • A novel Support Vector Oriented Instance Selection (SVOIS) approach is introduced.

  • SVOIS is particularly proposed for high dimensional text classification.

  • SVOIS has shown its outperformance over state-of-the-art algorithms.

  • In addition, state-of-the-art algorithms are not good at high dimensional data reduction.

Abstract

Automatic text classification is usually based on models constructed through learning from training examples. However, as the size of text document repositories grows rapidly, the storage requirements and computational cost of model learning is becoming ever higher. Instance selection is one solution to overcoming this limitation. The aim is to reduce the amount of data by filtering out noisy data from a given training dataset. A number of instance selection algorithms have been proposed in the literature, such as ENN, IB3, ICF, and DROP3. However, all of these methods have been developed for the k-nearest neighbor (k-NN) classifier. In addition, their performance has not been examined over the text classification domain where the dimensionality of the dataset is usually very high. The support vector machines (SVM) are core text classification techniques. In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS), is proposed. First of all, a regression plane in the original feature space is identified by utilizing a threshold distance between the given training instances and their class centers. Then, another threshold distance, between the identified data (forming the regression plane) and the regression plane, is used to decide on the support vectors for the selected instances. The experimental results based on the TechTC-100 dataset show the superior performance of SVOIS over other state-of-the-art algorithms. In particular, using SVOIS to select text documents allows the k-NN and SVM classifiers perform better than without instance selection.

Introduction

The number and size of online information collections are increasing rapidly, meaning that text classification (or categorization) has become one of the major techniques for managing large scale text repositories. The aim of text classification is to automatically classify documents into a fixed set of pre-defined categories, in which text documents are first processed by the natural language processing technique, and then each document is represented by a d-dimensional feature vector. Next, specific classification techniques, such as support vector machines, can be used for model learning and classifying of text documents [18], [20], [27], [31].

Data pre-processing is one of the most critical steps in data mining and knowledge discovery in database (KDD) techniques, performed to ensure good quality data mining. Feature selection (or dimensionality reduction) has been extensively studied in the text classification literature; for examples see [10], [32]. This is because the dimensionality of the extracted textual features that represent text documents (i.e., tf-idf) is usually very large, say 10,000.

However, the size of today's text collections often exceeds the size of the datasets which the current software and/or hardware handle properly. In spite of this, there have been few studies focused on instance selection (or data reduction) for text classification. That is, if too many instances (i.e., documents) are adopted, it can result in large memory requirements and slow execution speed, and can cause over-sensitivity to noise [21], [30]. Furthermore, one problem with using the original data points is that there may not be any located at the precise points that would make for the most accurate and concise concept description [23].

In addition to feature selection, instance selection (or data reduction) is another important data pre-processing step in the KDD process. The aim of instance selection is to reduce the data size by filtering out noisy data from a given dataset, which would otherwise increase the likelihood of degrading the mining performance. In particular, instance selection is used to shrink the amount of data, so data mining algorithms can be applied to the reduced dataset. Sufficient results can be achieved if the selection strategy is appropriate [26].

Outlier detection involves the finding of observations that lie an abnormal distance from other values in a random sample from a given population. Outliers have traditionally been defined as unusual observations (or bad data points) that are far removed from the mass of data [1], [3]. Consequently, classifiers trained by selected instances as a subset of original instances can provide relatively good performances. Outlier detection is also a critical KDD task [19], and filtering out the detected outliers is very useful for obtaining good mining results. From the data mining perspective, the aim of instance selection is similar to that of outlier detection [22].

In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS) is proposed for text classification. SVOIS mainly borrows the idea of support vector machines (SVM). The support vectors in SVM are used for binary classification decisions [28]. Specifically, given a training dataset, each training vector is associated with one of two different classes. During the training stage, the input vectors are mapped into a new higher dimensional feature space. Then, an optimal separating hyperplane is constructed in the new feature space. All vectors lying on one side of the hyperplane can be regarded as class 1, and all vectors lying on another side are class 2. The training instances that lie closest to the hyperplane in the transformed space are called support vectors. The number of these support vectors is usually small compared to the size of the training set and they determine the margin of the hyperplane, and thus the decision surface. In order to produce good generalizations, the SVM maximizes the margin of the hyperplane and diminishes the number of support vectors for it.

However, the major limitations of the SVM are speed and size, both in training and testing [6], [7]. In other words, the computational cost necessary to identify the hyperplane and support vectors in a new and very high dimensional feature space is excessive.

Unlike the SVM, the SVOIS attempts to find the support vectors in the original feature space through a linear regression plane, where the instances to be selected as the support vectors need to satisfy two criteria. The first one is that the distances between the original instances and their class centers need to be smaller than a pre-defined value. Then, the instances fulfilling this criterion are regarded as the regression data in order to identify a regression plane. The second criterion is based on the distances between the regression data and the regression plane, which is like the margin in the SVM. These distances need to be larger than a pre-defined value. The regression data fulfilling this criterion are called support vectors and are used for classifier training and classification. Specifically, these two types of distances should be neither too long, so that all instances are selected, nor too short, leading to very few support vectors (cf. Section 3).

To the best of our knowledge, SVOIS is the first method to select redundant instances (i.e., documents) for text classification. Current state-of-the-art instance selection techniques (cf. Section 2) are only assessed for classification performance over the datasets containing very low dimensionality. They perform poorly in the context of text classification. In contrast, using SVOIS to filter out unimportant documents from the given training dataset allows two well-known classifiers (i.e., SVM and k-NN) to perform better than the ones without instance selection and the same ones followed by state-of-the-art instance selection techniques.

The rest of this paper is organized as follows. Section 2 provides an overview of related literatures, including the concept of instance selection and four well-known instance selection algorithms, which are ENN, IB3, DROP3, and ICF. The proposed SVOIS method for text classification is introduced in Section 3. In Section 4, the experimental results based on a public text classification dataset are present. Finally, some conclusions are offered in Section 5.

Section snippets

Instance selection

Instance selection can be defined as follows. Given a dataset D composed of a training set T and testing set U, let Xi be the ith instance in D, where Xi=(X1,X2,,Xm) which contains m different features. Let ST be the subset of selected instances resulting from the execution of an instance selection algorithm. Then, U is used to test a classification technique trained by S [8], [12].

There are a number of studies in the literature related to proposing instance selection methods to obtain better

Support Vector Oriented Instance Selection

In this section, we describe our proposed instance selection method, namely Support Vector Oriented Instance Selection (SVOIS). SVOIS is different from the related selection techniques described in Section 2, which are mainly designed for the k-NN classification method. In SVOIS, there are four steps for the selection of support vectors, i.e., important and representative instances. Given a training dataset T containing a set of n data samples, let xi be the ith instance in T, where i=1, 2,…, n

The dataset

The dataset used to evaluate the performance of SVOIS is based on the TechTC-100 dataset [11], [14]. It includes 100 different two-class datasets, in which the largest and smallest datasets contain 165 and 125 documents respectively. On average, each pair contains 149 documents. In addition, it ranges from pairs of easy categories, such as Games/Video_Games/Shooter and Recreation/Autos/Makes_and_Models/Vokswagen, to medium difficulty ones with Arts/Music/Bands_and_Artists vs. Arts/Celebrities,

Conclusion

In this paper, we introduce a novel instance selection method, namely Support Vector Oriented Instance Selection (SVOIS), for text classification. This approach is simple and computationally efficient enough to filter out some redundant data from a given training dataset. SVOIS is compared with four state-of-the-art algorithms including ENN, IB3, ICF, and DROP3 using a public text document corpus. We observe that the state-of-the-art algorithms cannot select high dimensional data well, which

References (31)

  • X.-B. Li et al.

    Adaptive data reduction for large-scale transaction data

    European Journal of Operational Research

    (2008)
  • C.C. Aggarwal et al.

    Outlier detection for high dimensional data

    Proceedings of the ACM SIGMOD Conference

    (2001)
  • D.W. Aha et al.

    Instance-based learning algorithms

    Machine Learning

    (1991)
  • V. Barnett et al.

    Outliers in Statistical Data

    (1994)
  • J. Brank, M. Grobelnik, N. Milic-Frayling, D. Mladenic, Interaction of feature selection methods and linear...
  • H. Brighton et al.

    Advances in instance selection for instance-based learning algorithms

    Data Mining and Knowledge Discovery

    (2002)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • H. Byun et al.

    A survey on pattern recognition applications of support vector machines

    International Journal of Pattern Recognition and Artificial Intelligence

    (2003)
  • J.R. Cano et al.

    Using evolutionary algorithms as instance selection for data reduction: an experimental study

    IEEE Transactions on Evolutionary Computation

    (2003)
  • T.F. Cox et al.

    Multidimensional Scaling

    (2001)
  • A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, M.W. Mahoney, Feature selection methods for text classification, in:...
  • D. Davidov, E. Gabrilovich, S. Markovitch, Parameterized generation of labeled datasets for text categorization based...
  • J. Derrac et al.

    A survey on evolutionary instance selection and generation

    International Journal of Applied Metaheuristic Computing

    (2010)
  • G. Forman

    An extensive empirical study of feature selection metrics for text classification

    Journal of Machine Learning Research

    (2003)
  • E. Gabrilovich et al.

    Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

    International Conference on Machine Learning

    (2004)
  • Cited by (29)

    • Instance selection in medical datasets: A divide-and-conquer framework

      2021, Computers and Electrical Engineering
      Citation Excerpt :

      According to García et al. [19], these algorithms are hybrid methods, which can provide relatively large data reduction rates and allow the classifier to perform better than the ones without instance selection. In addition, other related works also use IB3 and DROP3 as the representative baselines in their studies, such as Tsai and Chang [2], Tsai et al. [4], Huang et al. [7], and Tsai and Chen [9]. Therefore, DCIS based on IB3 and DROP3 are compared with the original IB3 and DROP3 algorithms without considering the DCIS approach.

    • Instance selection for regression: Adapting DROP

      2016, Neurocomputing
      Citation Excerpt :

      Noisy instances have a profound impact on how instances are ordered at the beginning of these algorithms. A noise instance means that its neighbours are considered part of the class boundary, and they can be kept in the selected set that has been filtered even after the noisy instance has been removed [39]. Decremental Reduction Optimization Procedure 1 (DROP1).

    • Robust and stable feature selection by integrating ranking methods and wrapper technique in genetic data classification

      2014, Biochemical and Biophysical Research Communications
      Citation Excerpt :

      In this condition, detection of minority class will be incorrect. To solve this problem, various methods have been proposed [21,22], An approach is to add samples to data points of the minority class. The method which is proposed in this paper to solve the problem of imbalanced classes based on the removal of the data points of the majority class.

    • Evolutionary instance selection for text classification

      2014, Journal of Systems and Software
      Citation Excerpt :

      The performances of related instance selection algorithms over high dimensional text datasets have not fully examined. On the other hand, Tsai and Chang (2013) propose a novel instance selection algorithm for text classification. It is based on the idea of support vector machines where the selected instances are determined by the ‘support vectors’ lying on the hyperplane.

    View all citing articles on Scopus
    View full text