Elsevier

Applied Soft Computing

Volume 37, December 2015, Pages 296-302
Applied Soft Computing

On learning dual classifiers for better data classification

https://doi.org/10.1016/j.asoc.2015.08.038Get rights and content

Highlights

  • A dual classification (DuC) approach is presented to deal with the potential drawback of instance selection.

  • During training, two classifiers are trained by a ‘good’ and ‘noisy’ sets respectively after performing instance selection.

  • Experiments are conducted used 50 small scale and 4 large scale datasets.

  • The results show that the DuC approach outperforms three state-of-the-art instance selection algorithms.

Abstract

Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.

Introduction

Developing effective classification (or prediction) models is usually the key to success to most data mining or pattern recognition problems. Effective classification models are those which provide high classification accuracy or other related measures, such as the ROC curve [20] and F-score [1]. Generally, in order to examine the performance of classification models, a given testing set with the ground truth answer is used. The models that produce more correct answers corresponding to the testing data are regarded as performing better than the ones that produce fewer correct answers. Reducing the classification error rate is one of the key issues in most data mining and pattern recognition research strategies.

However, despite the fact that well-known supervised learning (or classification) techniques are used, including support vector machines, decision trees, or k-nearest neighbors [2], the developed classifiers will inevitably produce a certain proportion of incorrect answers in many problem domains. In other words, the development of a classifier with 100% classification accuracy over various testing sets is unlikely.

In order to improve the classification accuracy (or reduce the classification error), one of the most popular solutions is to design a more sophisticated classification model based on novel hybrid and ensemble method [3], [4], [5], [6]. In addition, these kinds of models have been shown to outperform single learning based models.

Since the data samples in the feature space are usually nonlinear and very complex, the training effectiveness of the learning models over a given training set is limited. In other words, a given training set generally contains a certain amount of noisy data, which is likely to degrade the performance of the trained models. Instance selection, a data pre-processing step during knowledge discovery in the databases (KDD) process, aims at reducing the dataset size by filtering out noisy data from a given dataset [7], [8].

In this paper, we propose a dual classification (DuC) approach, which aims at improving the training effectiveness of the learning models through instance selection. In particular, performing instance selection for a given dataset can result in the selected subset containing ‘good’ training data and a filtered out subset of ‘noisy’ training data. Two different classifiers are then respectively trained using these two subsets. For the testing step, the k-nearest neighbor similarity measure is used to respectively compare the testing data sample with the data in the two subsets. The resultant training sample is the nearest to the testing sample that can be identified. Thus, depending on which subset the training sample belongs to the testing sample is then input into its corresponding classifier. Finally, the classification results are obtained from the outputs of the two classifiers.

This proposed DuC approach is different from conventional instance selection in that it omits the filtered out data samples during the model training step. This is because it is hard to define the noise level of the data samples (or outliers), meaning that over- or under-selection result may be produced which is likely to degrade the model performance (c.f. Section 2.2).

The rest of this paper is organized as follows. Section 2 briefly describes the basic concepts of pattern classification and instance selection. Section 3 introduces the proposed approach. Section 4 presents the experimental results and conclusions are provided in Section 5.

Section snippets

Pattern classification

The goal of pattern classification is to allocate an object represented by a number of measurements (i.e. feature vectors) into one of a finite set of classes. Supervised learning can be thought of as learning by example or learning with a teacher. The teacher has knowledge of the environment which is represented by a set of input–output examples. In order to classify unknown patterns, a certain number of training samples are available for each class, and they are used to train the classifier

The proposed DuC approach

The aim of the proposed DuC (Dual Classification) approach is based on the divide-and-conquer principle where a complex training problem is divided into two sub-problems (i.e. simpler tasks) thereby making the classifier training step more effective. In particular, each classifier is only responsible for its own specific tasks.

Fig. 2 shows the classifier training and testing processes of the proposed approach. For the training process, given a training set D, which is divided into a training

Experimental setup

Two experimental studies were conducted, based on small and large scale datasets respectively. For small scale datasets, 50 UCI datasets are used2 [17], which contain various amounts of data samples, feature dimensionalities, etc.

On the other hand, the second study is based on three large scale datasets, which are the KDD Cup 20043 (Protein homology prediction), KDD Cup 20084

Conclusion

This paper presents a novel approach, namely dual classification (DuC), which aims at improving classification performance through instance selection. In particular, DuC is based on training two classifiers over both good and noisy subsets, respectively, identified by a specific instance selection algorithm, and the testing set is also divided into good and noisy subsets to be input into their corresponding classifiers. The experimental results over small and large scale datasets show that DuC

References (20)

There are more references available in the full text version of this article.
View full text