1 Introduction

Machine learning classification algorithms have been used in many business applications such as credit risk prediction and wait-time prediction for patients in emergency department waiting room [1, 2]. Hence, the data quality has an enormous effect on the accuracy and efficiency of these algorithms [3]. There are two main concerns relating to data quality and they are imbalance and noisy data. With respect to imbalanced data which the instances of one class considered as the majority class or negative class overwhelmed the other class which is the minority class or positive class. Negative class can lead to ignoring the positive class examples as noise and they could be wrongly discarded by the classifier [3, 4]. On the other hand, there are two types of noise which can reduce the accuracy of system performance and they are class noise and attribute noise. Attribute noise is due to errors in the data attributes such as missing values and redundant data, while class noise is due to errors of instances. For example, similar instances are considered as difference classes and instances are classified into wrong class [5].

A study by Garcia, Lorena and Carvalho [6] used consensus and majority voting strategies to identify class mislabeling from ensemble classifiers. The results showed that consensus voting was unable to identify most of noisy datasets whereas majority voting was more successful. Although, the ensemble technique performed well and could handle noisy data, however, it only provides good results for some datasets. Sluban, Gamberger and Lavrac [7] proposed high agreement random forest filter for noise detection. It performed better than other classification filters such as Naïve Bayes and SVM but not for all datasets. Verbaeten and Van Asshe [8] used ensemble methods such as committees, bagging and boosting to identify noisy and removing outliers from training set. The results showed that the cost of finding new training instance was high on small dataset. Jeatrakul, Wong and Fung [9] presented Complementary Neural Network (CMTNN) techniques to identify noisy data and enhance the performance of a neural network classifier. This technique was implemented to address both binary and multiclass classification problems. It can also handle misclassification problem and improve the prediction accuracy. However, the networks function of a trained network is black boxes whose rules of operation are completely unknown. Lee, Taur and Tao [10] proposed the outlier detection with Fuzzy Support Vector machine (FSVM). The results found that FSVM was more robust against outliers.

In order to handle noise and outlier problem, this study proposed an extension of the Complementary (CMT) technique by combining CMT with FSVM. The structure of this paper is as follows. Section 2 provided background on the complementary technique and fuzzy support vector machine. Section 3 presented the proposed methodology and evaluation method used in this study. The results and conclusion are presented in Sect. 4 and 5 respectively.

2 Background

In this section, the concept of complementary, data cleaning and evaluation techniques are described.

2.1 Complementary Neural Network

Complementary Neural Network (CMTNN) [11] is a misclassification analysis technique used to enhance the quality of training data by comparing the prediction results from both truth and false classified data. Truth Neural Network (Truth NN) and Falsity Neural Network (Falsity NN) form a pair of complementary feed-forward back-propagation neural network as shown in Fig. 1. Training data is trained by Truth NN and Falsity NN in order to predict the degree of the truth memberships and false memberships respectively. The architecture of the Truth NN and Falsity NN is similar except the target outputs of Falsity NN are complementary of the Truth NN target outputs. The differences between the Truth memberships and the False memberships values represent the uncertainty in the classification process.

Fig. 1.
figure 1

(Source: Advanced Computational Intelligence and Intelligent Informatics 14 (2010), p. 298)

Complementary Neural Network

2.2 Fuzzy Support Vector Machine

Support Vector Machine (SVM) [12] is widely used in machine learning and it works effectively with balanced datasets. It aims to find an optimal separating hyperplane using an expression as shown below:

$$ Min\left( {\frac{1}{2}\omega \cdot \omega + C\sum\nolimits_{i = 1}^{l} {\xi_{i} } } \right) $$
(1)

subject to \( y_{i} \left( {\omega \cdot {\Upphi}\left( {x_{i} } \right) + b} \right) \ge 1 - \xi_{i} , \xi_{i} \ge 0\;i = 1, \ldots , l \)

where y i is the class label, ω is the weighted normal vector, C is a penalty parameter, ξ is the slack variable for misclassified examples, Φ is a mapping function to transform data into higher dimensional feature space, and b is the bias.

However, SVM is sensitive to outlier and noise [13]. Therefore, Fuzzy Support Vector Machine (FSVM) has been proposed to handle the problems. Fuzzy membership value m i is applied to FSVM for each input point to represent their importance for their own class [14, 15] as shows in Eq. (2). The lower membership values are assigned to less important examples such as outlier and noise. Also, the detail of \( m_{i} \) computing is described in Sect. 2.3

$$ Min\left( {\frac{1}{2}\omega \cdot \omega + C\sum\nolimits_{i = 1}^{l} {m_{i} \xi_{i} } } \right) $$
(2)

Subject to \( y_{i} \left( {\omega \cdot {\Upphi }\left( {x_{i} } \right) + b} \right) \ge 1 \)

2.3 Complementary Fuzzy Support Vector Machine

Complementary Fuzzy Support Vector Machine (CMTFSVM) applies concepts of complementary (CMT) of Truth target output in CMTNN by using Fuzzy Support Vector Machine (FSVM) as a classifier to identify uncertainty data. The exponential-decaying function based on the distance from the actual hyperplane [13] is used in fuzzy membership value as follows:

$$ f\left( {x_{i} } \right) = \frac{2}{{1 + exp\left( {\beta d_{i}^{hyp} } \right)}} $$
(3)

where β is the steepness of the decay which β ∊ [0, 1], d hyp i is the functional margin for each example x i which is equivalent to the absolute value of the SVM decision value and it is defined in Eq. (4).

$$ d_{i}^{hyp} = y_{i} \left( {\omega \cdot {\Upphi}\left( {x_{i} } \right) + b} \right) $$
(4)

2.4 Cleaning Techniques

First the truth and falsity membership values are trained using FSVM. Secondly, the prediction outputs from both truth and falsity membership values are compared with the actual outputs. The misclassification patterns (MTruth, MFalsity) are detected. Finally, the misclassification instances are eliminated from the training data (T) and the new training data set is created. There are two kinds of new training data: CMT1 and CMT2 that can be created depend on the elimination techniques used as follows:

$$ CMT1 = T - (M_{Truth} \cup M_{Falsity} ) $$
(5)
$$ CMT2 = T - \left( {M_{Truth} \cap M_{Falsity}} \right) $$
(6)

CMT1 training data set is constructed by eliminating all misclassification instances by truth and falsity membership respectively. On the other hand, misclassification instances in CMT2 are eliminated if it appeared in both truth and falsity membership.

2.5 Evaluation Method

The confusion matrix [16] as shown in Table 1 is a typical measurement to assess the classification performance in order to record the results of correctly and incorrectly recognised class. The accuracy is defined in Eq. (7).

Table 1. Confusion matrix terminology
$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
(7)

3 Proposed Methodology

3.1 Datasets

In this paper, the binary classification problem is considered. Two benchmark imbalanced datasets from the UCI machine learning repository [17] and KEEL (Knowledge Extraction based on Evolutionary Learning) [18] are used. They are showed in Table 2.

Table 2. Detail of datasets

3.2 Experimental Processes

The experimental processes in this work are based on the Complementary technique (CMT) to handle the noise and outlier problems. There are three processes: (1) data cleaning with CMT technique, (2) classification with FSVM, and (3) evaluation and comparing results. The detail steps are presented in Fig. 2.

Fig. 2.
figure 2

Comparison of outlier and noise handling with CMTNN and CMTFSVM techniques

In the first step of this study, the original datasets are normalised using range transformation method. They are divided randomly into training and testing data sets with a ratio of 80 % and 20 % respectively. In this process, a 10-fold cross validation method is applied for each dataset. Once the samples are divided, the training samples are replicated into falsity and truth data. The complementary technique is then applied to the falsity data by complementing the target output of truth data. After that, both falsity data and truth data are classified by back-propagation neural network (NN) using CMT technique known as CMTNN method and by a CMT fuzzy support vector machine (FSVM) known as CMTFSVM method. CMTFSVM in this experimental uses exponentially decaying membership function and radial basis function kernel. The optimal value of β is chosen from a range of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} for each falsity and truth data. The results from falsity and truth datasets are compared. Following this process, the new training CMTNN1 and CMTFSVM1 are created by removing the detected misclassification instances from the falsity and truth methods. Also, CMTNN2 and CMTFSVM2 are generated by eliminating misclassification instances which are detected by both falsity and truth methods. The new training datasets are then classified by FSVM. Finally, the testing data are evaluated by comparing the results. In order to further assess the proposed data cleaning techniques, a level of 40 % of the class noise is randomly selected from the original training datasets. The class values are changed to the compliment of the original class before they are injected into the original training datasets. The data cleaning techniques, CMTNN and CMTFSVM are then applied to the noisy datasets and classified by Neural Network and FSVM respectively.

4 Experiment Results

Datasets with binary classification problem from the UCI and KEEL are used in the experiment. A comparison of the results obtained by CMTNN and CMTFSVM techniques is shown in Table 3.

Table 3. Comparing the accuracy of results between CMTNN and CMTFSVM techniques.

It is observed that CMTFSVM1 technique gave the better results over CMTNN1 by approximately 2 %. Moreover considering on CMTNN2 and CMTFSVM2 techniques, CMTFSVM2 presented a higher accuracy over CMTNN2 for all datasets. Finally, an analysis on CMTFSVM1 and CMTFSVM2 indicated that CMTFSVM2 performed well with approximately 2 %, 1 %, 0.3 % and 2 % for German, Ionosphere, Pima and Yeast3 respectively. Overall, these results indicated that CMTNN2 and CMTFSVM2 gave better results over CMTNN1 and CMTFSVM1. In additional, CMTFSVM2 performed better than CMTNN2 for all datasets. The next experiment is to investigate the performance of data cleaning techniques CMTNN2 and CMTFSVM2 by adding 40 % of noise into the datasets. The results are then compared with results from Neural Network (NN) and Fuzzy Support Vector Machine (FSVM). The results are shown in Table 4.

Table 4. Comparison of accuracy after injection of 40 % of noise into the datasets

It can be seen that FSVM gave better results over NN. Also, FSVM showed the best result approximately 3 % over NN on the original datasets. When applying cleaning techniques into the original datasets, CMTFSVM2 gave approximately 1.5 % higher than CMTNN2 although CMTFSVM2 showed the lower accuracy comparing with FSVM. However after adding 40 % of noise into the original datasets, it was obvious that CMTFSVM2 could handle noisy datasets better than FSVM. Moreover, the results presented that the accuracy of CMTFSVM2 was higher than CMTNN2 by approximately 8 %, 19 %, 5 % and 6 % on German, Ionosphere, Pima and Yeast3 respectively.

5 Conclusion

In this study, the CMTFSVM data cleaning technique is proposed to eliminate outlier and class noise. 40 % of the noise is added to the training dataset. The classification based on FSVM classifier is then compare with NN classifier. Accuracy is used to evaluate the system performance. The four well-known datasets from UCI and KEEL repositories are analysed. The classification accuracy showed that CMTFSVM is robust and perform better with noisy data in the comparison study. Moreover, it can enhance the classification performance in term of accuracy by approximately 10 % for all datasets.