Abstract
In this paper, a Complementary Fuzzy Support Vector Machine (CMTFSVM) technique is proposed to handle outlier and noise in classification problems. Fuzzy membership values are applied for each input point to reflect the degree of importance of the instances. Datasets from the UCI and KEEL are used for the comparison. In order to confirm the proposed methodology, 40 % random noise is added to the datasets. The experiment results of CMTFSVM are analysed and compared with the Complementary Neural Network (CMTNN). The outcome indicated that the combined CMTFSVM outperformed the CMTNN approach.
1 Introduction
Machine learning classification algorithms have been used in many business applications such as credit risk prediction and wait-time prediction for patients in emergency department waiting room [1, 2]. Hence, the data quality has an enormous effect on the accuracy and efficiency of these algorithms [3]. There are two main concerns relating to data quality and they are imbalance and noisy data. With respect to imbalanced data which the instances of one class considered as the majority class or negative class overwhelmed the other class which is the minority class or positive class. Negative class can lead to ignoring the positive class examples as noise and they could be wrongly discarded by the classifier [3, 4]. On the other hand, there are two types of noise which can reduce the accuracy of system performance and they are class noise and attribute noise. Attribute noise is due to errors in the data attributes such as missing values and redundant data, while class noise is due to errors of instances. For example, similar instances are considered as difference classes and instances are classified into wrong class [5].
A study by Garcia, Lorena and Carvalho [6] used consensus and majority voting strategies to identify class mislabeling from ensemble classifiers. The results showed that consensus voting was unable to identify most of noisy datasets whereas majority voting was more successful. Although, the ensemble technique performed well and could handle noisy data, however, it only provides good results for some datasets. Sluban, Gamberger and Lavrac [7] proposed high agreement random forest filter for noise detection. It performed better than other classification filters such as Naïve Bayes and SVM but not for all datasets. Verbaeten and Van Asshe [8] used ensemble methods such as committees, bagging and boosting to identify noisy and removing outliers from training set. The results showed that the cost of finding new training instance was high on small dataset. Jeatrakul, Wong and Fung [9] presented Complementary Neural Network (CMTNN) techniques to identify noisy data and enhance the performance of a neural network classifier. This technique was implemented to address both binary and multiclass classification problems. It can also handle misclassification problem and improve the prediction accuracy. However, the networks function of a trained network is black boxes whose rules of operation are completely unknown. Lee, Taur and Tao [10] proposed the outlier detection with Fuzzy Support Vector machine (FSVM). The results found that FSVM was more robust against outliers.
In order to handle noise and outlier problem, this study proposed an extension of the Complementary (CMT) technique by combining CMT with FSVM. The structure of this paper is as follows. Section 2 provided background on the complementary technique and fuzzy support vector machine. Section 3 presented the proposed methodology and evaluation method used in this study. The results and conclusion are presented in Sect. 4 and 5 respectively.
2 Background
In this section, the concept of complementary, data cleaning and evaluation techniques are described.
2.1 Complementary Neural Network
Complementary Neural Network (CMTNN) [11] is a misclassification analysis technique used to enhance the quality of training data by comparing the prediction results from both truth and false classified data. Truth Neural Network (Truth NN) and Falsity Neural Network (Falsity NN) form a pair of complementary feed-forward back-propagation neural network as shown in Fig. 1. Training data is trained by Truth NN and Falsity NN in order to predict the degree of the truth memberships and false memberships respectively. The architecture of the Truth NN and Falsity NN is similar except the target outputs of Falsity NN are complementary of the Truth NN target outputs. The differences between the Truth memberships and the False memberships values represent the uncertainty in the classification process.
2.2 Fuzzy Support Vector Machine
Support Vector Machine (SVM) [12] is widely used in machine learning and it works effectively with balanced datasets. It aims to find an optimal separating hyperplane using an expression as shown below:
subject to \( y_{i} \left( {\omega \cdot {\Upphi}\left( {x_{i} } \right) + b} \right) \ge 1 - \xi_{i} , \xi_{i} \ge 0\;i = 1, \ldots , l \)
where y i is the class label, ω is the weighted normal vector, C is a penalty parameter, ξ is the slack variable for misclassified examples, Φ is a mapping function to transform data into higher dimensional feature space, and b is the bias.
However, SVM is sensitive to outlier and noise [13]. Therefore, Fuzzy Support Vector Machine (FSVM) has been proposed to handle the problems. Fuzzy membership value m i is applied to FSVM for each input point to represent their importance for their own class [14, 15] as shows in Eq. (2). The lower membership values are assigned to less important examples such as outlier and noise. Also, the detail of \( m_{i} \) computing is described in Sect. 2.3
Subject to \( y_{i} \left( {\omega \cdot {\Upphi }\left( {x_{i} } \right) + b} \right) \ge 1 \)
2.3 Complementary Fuzzy Support Vector Machine
Complementary Fuzzy Support Vector Machine (CMTFSVM) applies concepts of complementary (CMT) of Truth target output in CMTNN by using Fuzzy Support Vector Machine (FSVM) as a classifier to identify uncertainty data. The exponential-decaying function based on the distance from the actual hyperplane [13] is used in fuzzy membership value as follows:
where β is the steepness of the decay which β ∊ [0, 1], d hyp i is the functional margin for each example x i which is equivalent to the absolute value of the SVM decision value and it is defined in Eq. (4).
2.4 Cleaning Techniques
First the truth and falsity membership values are trained using FSVM. Secondly, the prediction outputs from both truth and falsity membership values are compared with the actual outputs. The misclassification patterns (MTruth, MFalsity) are detected. Finally, the misclassification instances are eliminated from the training data (T) and the new training data set is created. There are two kinds of new training data: CMT1 and CMT2 that can be created depend on the elimination techniques used as follows:
CMT1 training data set is constructed by eliminating all misclassification instances by truth and falsity membership respectively. On the other hand, misclassification instances in CMT2 are eliminated if it appeared in both truth and falsity membership.
2.5 Evaluation Method
The confusion matrix [16] as shown in Table 1 is a typical measurement to assess the classification performance in order to record the results of correctly and incorrectly recognised class. The accuracy is defined in Eq. (7).
3 Proposed Methodology
3.1 Datasets
In this paper, the binary classification problem is considered. Two benchmark imbalanced datasets from the UCI machine learning repository [17] and KEEL (Knowledge Extraction based on Evolutionary Learning) [18] are used. They are showed in Table 2.
3.2 Experimental Processes
The experimental processes in this work are based on the Complementary technique (CMT) to handle the noise and outlier problems. There are three processes: (1) data cleaning with CMT technique, (2) classification with FSVM, and (3) evaluation and comparing results. The detail steps are presented in Fig. 2.
In the first step of this study, the original datasets are normalised using range transformation method. They are divided randomly into training and testing data sets with a ratio of 80 % and 20 % respectively. In this process, a 10-fold cross validation method is applied for each dataset. Once the samples are divided, the training samples are replicated into falsity and truth data. The complementary technique is then applied to the falsity data by complementing the target output of truth data. After that, both falsity data and truth data are classified by back-propagation neural network (NN) using CMT technique known as CMTNN method and by a CMT fuzzy support vector machine (FSVM) known as CMTFSVM method. CMTFSVM in this experimental uses exponentially decaying membership function and radial basis function kernel. The optimal value of β is chosen from a range of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} for each falsity and truth data. The results from falsity and truth datasets are compared. Following this process, the new training CMTNN1 and CMTFSVM1 are created by removing the detected misclassification instances from the falsity and truth methods. Also, CMTNN2 and CMTFSVM2 are generated by eliminating misclassification instances which are detected by both falsity and truth methods. The new training datasets are then classified by FSVM. Finally, the testing data are evaluated by comparing the results. In order to further assess the proposed data cleaning techniques, a level of 40 % of the class noise is randomly selected from the original training datasets. The class values are changed to the compliment of the original class before they are injected into the original training datasets. The data cleaning techniques, CMTNN and CMTFSVM are then applied to the noisy datasets and classified by Neural Network and FSVM respectively.
4 Experiment Results
Datasets with binary classification problem from the UCI and KEEL are used in the experiment. A comparison of the results obtained by CMTNN and CMTFSVM techniques is shown in Table 3.
It is observed that CMTFSVM1 technique gave the better results over CMTNN1 by approximately 2 %. Moreover considering on CMTNN2 and CMTFSVM2 techniques, CMTFSVM2 presented a higher accuracy over CMTNN2 for all datasets. Finally, an analysis on CMTFSVM1 and CMTFSVM2 indicated that CMTFSVM2 performed well with approximately 2 %, 1 %, 0.3 % and 2 % for German, Ionosphere, Pima and Yeast3 respectively. Overall, these results indicated that CMTNN2 and CMTFSVM2 gave better results over CMTNN1 and CMTFSVM1. In additional, CMTFSVM2 performed better than CMTNN2 for all datasets. The next experiment is to investigate the performance of data cleaning techniques CMTNN2 and CMTFSVM2 by adding 40 % of noise into the datasets. The results are then compared with results from Neural Network (NN) and Fuzzy Support Vector Machine (FSVM). The results are shown in Table 4.
It can be seen that FSVM gave better results over NN. Also, FSVM showed the best result approximately 3 % over NN on the original datasets. When applying cleaning techniques into the original datasets, CMTFSVM2 gave approximately 1.5 % higher than CMTNN2 although CMTFSVM2 showed the lower accuracy comparing with FSVM. However after adding 40 % of noise into the original datasets, it was obvious that CMTFSVM2 could handle noisy datasets better than FSVM. Moreover, the results presented that the accuracy of CMTFSVM2 was higher than CMTNN2 by approximately 8 %, 19 %, 5 % and 6 % on German, Ionosphere, Pima and Yeast3 respectively.
5 Conclusion
In this study, the CMTFSVM data cleaning technique is proposed to eliminate outlier and class noise. 40 % of the noise is added to the training dataset. The classification based on FSVM classifier is then compare with NN classifier. Accuracy is used to evaluate the system performance. The four well-known datasets from UCI and KEEL repositories are analysed. The classification accuracy showed that CMTFSVM is robust and perform better with noisy data in the comparison study. Moreover, it can enhance the classification performance in term of accuracy by approximately 10 % for all datasets.
References
Twala, B.: Impact of noise on credit risk prediction: does data quality really matter? J. Intell. Data Anal. 17, 1115–1134 (2013)
Ang, E., Kwansnick, S., Bayati, M., Plambeck, E.L., Aratow, M.: Accurate emergency department wait time prediction. J. Manufact. Serv. Oper. Manage. 18, 141–156 (2015)
Sessions, V., Valtorta, M.: The effects of data quality on machine learning algorithms. In: The 11th International Conference on Information Quality (ICIQ-06) (2006)
López, V., et al.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. J. Artif. Intell. Rev. 22, 177–210 (2004)
Garcia, L.P.F., Lorena, A.C., Carvalho, A.C.P.L.F.: A study on class noise detection and elimination. In: The 2012 Brazilian Symposium on Neural Networks (SBRN), pp. 13–18. IEEE, Curitiba (2012)
Sluban, B., Gamberger, D., Lavra, N.: Advances in class noise detection. In: 19th European Conference on Artificial Intelligence, pp. 1105–1106. IOS Press, The Netherlands (2010)
Verbaeten, S., Van Assche, A.: Ensemble methods for noise elimination in classification problems. In: Windeatt, T., Roli, F. (eds.) Multiple Classifier Systems, vol. 2709, pp. 317–325. Springer, Heidelberg (2003)
Jeatrakul, P., Wong, K.W., Fung, C.C.: Using misclassification analysis for data cleaning. In: International Workshop on Advanced Computational Intelligence and Intelligent Informatics (2009)
Lee, G.H., Taur, J.S., Tao, C.W.: A robust fuzzy support vector machine for two-class pattern classification. Int. J. Fuzzy Syst. 8, 76–86 (2006)
Jeatrakul, P., Wong, K.W., Fung, C.C.: Data cleaning for classification using misclassification analysis. J. Adv. Comput. Intelligence Intell. Inform. 14, 297–302 (2010)
Jing, R., Zhang, Y.: A view of support vector machines algorithm on classification problems. In: International Conference on Multimedia Communications (MEDIACOM), pp. 13-16, Hong Kong (2010)
Batuwita, R., Palade, V.: FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst. 18, 558–571 (2010)
Lin, C.-F., Wang, S.-D.: Fuzzy support vector machines. IEEE Trans. Neural Netw. 13, 464–471 (2002)
Samma, H., Lim, C.P., Ngah, U.K.: A hybrid PSO-FSVM model and its application to imbalanced classification of mammograms. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013, Part I. LNCS, vol. 7802, pp. 275–284. Springer, Heidelberg (2013)
Ramyachitra, D., Manikandan, P.: Imbalanced dataset classification and solutions: a review. Int. J. Comput. Bus. Res. (IJCBR) 5(4), 1–29 (2014)
Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences (2013)
Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Pruengkarn, R., Wong, K.W., Fung, C.C. (2016). Data Cleaning Using Complementary Fuzzy Support Vector Machine Technique. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds) Neural Information Processing. ICONIP 2016. Lecture Notes in Computer Science(), vol 9948. Springer, Cham. https://doi.org/10.1007/978-3-319-46672-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-46672-9_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46671-2
Online ISBN: 978-3-319-46672-9
eBook Packages: Computer ScienceComputer Science (R0)