On learning dual classifiers for better data classification
Graphical abstract
Introduction
Developing effective classification (or prediction) models is usually the key to success to most data mining or pattern recognition problems. Effective classification models are those which provide high classification accuracy or other related measures, such as the ROC curve [20] and F-score [1]. Generally, in order to examine the performance of classification models, a given testing set with the ground truth answer is used. The models that produce more correct answers corresponding to the testing data are regarded as performing better than the ones that produce fewer correct answers. Reducing the classification error rate is one of the key issues in most data mining and pattern recognition research strategies.
However, despite the fact that well-known supervised learning (or classification) techniques are used, including support vector machines, decision trees, or k-nearest neighbors [2], the developed classifiers will inevitably produce a certain proportion of incorrect answers in many problem domains. In other words, the development of a classifier with 100% classification accuracy over various testing sets is unlikely.
In order to improve the classification accuracy (or reduce the classification error), one of the most popular solutions is to design a more sophisticated classification model based on novel hybrid and ensemble method [3], [4], [5], [6]. In addition, these kinds of models have been shown to outperform single learning based models.
Since the data samples in the feature space are usually nonlinear and very complex, the training effectiveness of the learning models over a given training set is limited. In other words, a given training set generally contains a certain amount of noisy data, which is likely to degrade the performance of the trained models. Instance selection, a data pre-processing step during knowledge discovery in the databases (KDD) process, aims at reducing the dataset size by filtering out noisy data from a given dataset [7], [8].
In this paper, we propose a dual classification (DuC) approach, which aims at improving the training effectiveness of the learning models through instance selection. In particular, performing instance selection for a given dataset can result in the selected subset containing ‘good’ training data and a filtered out subset of ‘noisy’ training data. Two different classifiers are then respectively trained using these two subsets. For the testing step, the k-nearest neighbor similarity measure is used to respectively compare the testing data sample with the data in the two subsets. The resultant training sample is the nearest to the testing sample that can be identified. Thus, depending on which subset the training sample belongs to the testing sample is then input into its corresponding classifier. Finally, the classification results are obtained from the outputs of the two classifiers.
This proposed DuC approach is different from conventional instance selection in that it omits the filtered out data samples during the model training step. This is because it is hard to define the noise level of the data samples (or outliers), meaning that over- or under-selection result may be produced which is likely to degrade the model performance (c.f. Section 2.2).
The rest of this paper is organized as follows. Section 2 briefly describes the basic concepts of pattern classification and instance selection. Section 3 introduces the proposed approach. Section 4 presents the experimental results and conclusions are provided in Section 5.
Section snippets
Pattern classification
The goal of pattern classification is to allocate an object represented by a number of measurements (i.e. feature vectors) into one of a finite set of classes. Supervised learning can be thought of as learning by example or learning with a teacher. The teacher has knowledge of the environment which is represented by a set of input–output examples. In order to classify unknown patterns, a certain number of training samples are available for each class, and they are used to train the classifier
The proposed DuC approach
The aim of the proposed DuC (Dual Classification) approach is based on the divide-and-conquer principle where a complex training problem is divided into two sub-problems (i.e. simpler tasks) thereby making the classifier training step more effective. In particular, each classifier is only responsible for its own specific tasks.
Fig. 2 shows the classifier training and testing processes of the proposed approach. For the training process, given a training set D, which is divided into a training
Experimental setup
Two experimental studies were conducted, based on small and large scale datasets respectively. For small scale datasets, 50 UCI datasets are used2 [17], which contain various amounts of data samples, feature dimensionalities, etc.
On the other hand, the second study is based on three large scale datasets, which are the KDD Cup 20043 (Protein homology prediction), KDD Cup 20084
Conclusion
This paper presents a novel approach, namely dual classification (DuC), which aims at improving classification performance through instance selection. In particular, DuC is based on training two classifiers over both good and noisy subsets, respectively, identified by a specific instance selection algorithm, and the testing set is also divided into good and noisy subsets to be input into their corresponding classifiers. The experimental results over small and large scale datasets show that DuC
References (20)
- et al.
EUBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
Pattern Recognit.
(2013) - et al.
A novel hybrid CNN-SVM classifier for recognizing handwritten digits
Pattern Recognit.
(2012) - et al.
Adaptive data reduction for large-scale transaction data
Eur. J. Op. Res.
(2008) - et al.
Framework for efficient feature selection in genetic algorithm based data mining
Eur. J. Op. Res.
(2007) - et al.
A recent advance in data analysis: clustering objects into classes characterized by conjunctive concepts
Information Retrieval
(1979)- et al.
Top 10 algorithms in data mining
Knowl. Inf. Syst.
(2008) - et al.
Hybrid and ensemble methods in machine learning
J. Univers. Comput. Sci.
(2013) - et al.
Cluster-oriented ensemble classifier: impact of multicluster characterization on ensemble classifier learning
IEEE Trans. Knowl. Data Eng.
(2012) - et al.
Reduction techniques for instance-based learning algorithms
Mach. Learn.
(2000)