Data Cleaning Using Complementary Fuzzy Support Vector Machine Technique

Pruengkarn, Ratchakoon; Wong, Kok Wai; Fung, Chun Che

doi:10.1007/978-3-319-46672-9_19

Data Cleaning Using Complementary Fuzzy Support Vector Machine Technique

Ratchakoon Pruengkarn¹⁹,
Kok Wai Wong¹⁹ &
Chun Che Fung¹⁹

Conference paper
First Online: 30 September 2016

2895 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9948))

Abstract

In this paper, a Complementary Fuzzy Support Vector Machine (CMTFSVM) technique is proposed to handle outlier and noise in classification problems. Fuzzy membership values are applied for each input point to reflect the degree of importance of the instances. Datasets from the UCI and KEEL are used for the comparison. In order to confirm the proposed methodology, 40 % random noise is added to the datasets. The experiment results of CMTFSVM are analysed and compared with the Complementary Neural Network (CMTNN). The outcome indicated that the combined CMTFSVM outperformed the CMTNN approach.

Download conference paper PDF

1 Introduction

Machine learning classification algorithms have been used in many business applications such as credit risk prediction and wait-time prediction for patients in emergency department waiting room [1, 2]. Hence, the data quality has an enormous effect on the accuracy and efficiency of these algorithms [3]. There are two main concerns relating to data quality and they are imbalance and noisy data. With respect to imbalanced data which the instances of one class considered as the majority class or negative class overwhelmed the other class which is the minority class or positive class. Negative class can lead to ignoring the positive class examples as noise and they could be wrongly discarded by the classifier [3, 4]. On the other hand, there are two types of noise which can reduce the accuracy of system performance and they are class noise and attribute noise. Attribute noise is due to errors in the data attributes such as missing values and redundant data, while class noise is due to errors of instances. For example, similar instances are considered as difference classes and instances are classified into wrong class [5].

A study by Garcia, Lorena and Carvalho [6] used consensus and majority voting strategies to identify class mislabeling from ensemble classifiers. The results showed that consensus voting was unable to identify most of noisy datasets whereas majority voting was more successful. Although, the ensemble technique performed well and could handle noisy data, however, it only provides good results for some datasets. Sluban, Gamberger and Lavrac [7] proposed high agreement random forest filter for noise detection. It performed better than other classification filters such as Naïve Bayes and SVM but not for all datasets. Verbaeten and Van Asshe [8] used ensemble methods such as committees, bagging and boosting to identify noisy and removing outliers from training set. The results showed that the cost of finding new training instance was high on small dataset. Jeatrakul, Wong and Fung [9] presented Complementary Neural Network (CMTNN) techniques to identify noisy data and enhance the performance of a neural network classifier. This technique was implemented to address both binary and multiclass classification problems. It can also handle misclassification problem and improve the prediction accuracy. However, the networks function of a trained network is black boxes whose rules of operation are completely unknown. Lee, Taur and Tao [10] proposed the outlier detection with Fuzzy Support Vector machine (FSVM). The results found that FSVM was more robust against outliers.

In order to handle noise and outlier problem, this study proposed an extension of the Complementary (CMT) technique by combining CMT with FSVM. The structure of this paper is as follows. Section 2 provided background on the complementary technique and fuzzy support vector machine. Section 3 presented the proposed methodology and evaluation method used in this study. The results and conclusion are presented in Sect. 4 and 5 respectively.

2 Background

In this section, the concept of complementary, data cleaning and evaluation techniques are described.

2.1 Complementary Neural Network

Complementary Neural Network (CMTNN) [11] is a misclassification analysis technique used to enhance the quality of training data by comparing the prediction results from both truth and false classified data. Truth Neural Network (Truth NN) and Falsity Neural Network (Falsity NN) form a pair of complementary feed-forward back-propagation neural network as shown in Fig. 1. Training data is trained by Truth NN and Falsity NN in order to predict the degree of the truth memberships and false memberships respectively. The architecture of the Truth NN and Falsity NN is similar except the target outputs of Falsity NN are complementary of the Truth NN target outputs. The differences between the Truth memberships and the False memberships values represent the uncertainty in the classification process.

2.2 Fuzzy Support Vector Machine

Support Vector Machine (SVM) [12] is widely used in machine learning and it works effectively with balanced datasets. It aims to find an optimal separating hyperplane using an expression as shown below:

$$ Min\left( {\frac{1}{2}\omega \cdot \omega + C\sum\nolimits_{i = 1}^{l} {\xi_{i} } } \right) $$

(1)

subject to $ y_{i} \left( {\omega \cdot {\Upphi}\left( {x_{i} } \right) + b} \right) \ge 1 - \xi_{i} , \xi_{i} \ge 0\;i = 1, \ldots , l $

where y _i is the class label, ω is the weighted normal vector, C is a penalty parameter, ξ is the slack variable for misclassified examples, Φ is a mapping function to transform data into higher dimensional feature space, and b is the bias.

However, SVM is sensitive to outlier and noise [13]. Therefore, Fuzzy Support Vector Machine (FSVM) has been proposed to handle the problems. Fuzzy membership value m _i is applied to FSVM for each input point to represent their importance for their own class [14, 15] as shows in Eq. (2). The lower membership values are assigned to less important examples such as outlier and noise. Also, the detail of $ m_{i} $ computing is described in Sect. 2.3

$$ Min\left( {\frac{1}{2}\omega \cdot \omega + C\sum\nolimits_{i = 1}^{l} {m_{i} \xi_{i} } } \right) $$

(2)

Subject to $ y_{i} \left( {\omega \cdot {\Upphi }\left( {x_{i} } \right) + b} \right) \ge 1 $

2.3 Complementary Fuzzy Support Vector Machine

Complementary Fuzzy Support Vector Machine (CMTFSVM) applies concepts of complementary (CMT) of Truth target output in CMTNN by using Fuzzy Support Vector Machine (FSVM) as a classifier to identify uncertainty data. The exponential-decaying function based on the distance from the actual hyperplane [13] is used in fuzzy membership value as follows:

$$ f\left( {x_{i} } \right) = \frac{2}{{1 + exp\left( {\beta d_{i}^{hyp} } \right)}} $$

(3)

where β is the steepness of the decay which β ∊ [0, 1], d ^hyp_i is the functional margin for each example x _i which is equivalent to the absolute value of the SVM decision value and it is defined in Eq. (4).

$$ d_{i}^{hyp} = y_{i} \left( {\omega \cdot {\Upphi}\left( {x_{i} } \right) + b} \right) $$

(4)

2.4 Cleaning Techniques

First the truth and falsity membership values are trained using FSVM. Secondly, the prediction outputs from both truth and falsity membership values are compared with the actual outputs. The misclassification patterns (M_Truth, M_Falsity) are detected. Finally, the misclassification instances are eliminated from the training data (T) and the new training data set is created. There are two kinds of new training data: CMT1 and CMT2 that can be created depend on the elimination techniques used as follows:

$$ CMT1 = T - (M_{Truth} \cup M_{Falsity} ) $$

(5)

$$ CMT2 = T - \left( {M_{Truth} \cap M_{Falsity}} \right) $$

(6)

CMT1 training data set is constructed by eliminating all misclassification instances by truth and falsity membership respectively. On the other hand, misclassification instances in CMT2 are eliminated if it appeared in both truth and falsity membership.

2.5 Evaluation Method

The confusion matrix [16] as shown in Table 1 is a typical measurement to assess the classification performance in order to record the results of correctly and incorrectly recognised class. The accuracy is defined in Eq. (7).

Table 1. Confusion matrix terminology

Full size table

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

(7)

3 Proposed Methodology

3.1 Datasets

In this paper, the binary classification problem is considered. Two benchmark imbalanced datasets from the UCI machine learning repository [17] and KEEL (Knowledge Extraction based on Evolutionary Learning) [18] are used. They are showed in Table 2.

Table 2. Detail of datasets

Full size table

3.2 Experimental Processes

The experimental processes in this work are based on the Complementary technique (CMT) to handle the noise and outlier problems. There are three processes: (1) data cleaning with CMT technique, (2) classification with FSVM, and (3) evaluation and comparing results. The detail steps are presented in Fig. 2.

In the first step of this study, the original datasets are normalised using range transformation method. They are divided randomly into training and testing data sets with a ratio of 80 % and 20 % respectively. In this process, a 10-fold cross validation method is applied for each dataset. Once the samples are divided, the training samples are replicated into falsity and truth data. The complementary technique is then applied to the falsity data by complementing the target output of truth data. After that, both falsity data and truth data are classified by back-propagation neural network (NN) using CMT technique known as CMTNN method and by a CMT fuzzy support vector machine (FSVM) known as CMTFSVM method. CMTFSVM in this experimental uses exponentially decaying membership function and radial basis function kernel. The optimal value of β is chosen from a range of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} for each falsity and truth data. The results from falsity and truth datasets are compared. Following this process, the new training CMTNN1 and CMTFSVM1 are created by removing the detected misclassification instances from the falsity and truth methods. Also, CMTNN2 and CMTFSVM2 are generated by eliminating misclassification instances which are detected by both falsity and truth methods. The new training datasets are then classified by FSVM. Finally, the testing data are evaluated by comparing the results. In order to further assess the proposed data cleaning techniques, a level of 40 % of the class noise is randomly selected from the original training datasets. The class values are changed to the compliment of the original class before they are injected into the original training datasets. The data cleaning techniques, CMTNN and CMTFSVM are then applied to the noisy datasets and classified by Neural Network and FSVM respectively.

4 Experiment Results

Datasets with binary classification problem from the UCI and KEEL are used in the experiment. A comparison of the results obtained by CMTNN and CMTFSVM techniques is shown in Table 3.

Table 3. Comparing the accuracy of results between CMTNN and CMTFSVM techniques.

Full size table

It is observed that CMTFSVM1 technique gave the better results over CMTNN1 by approximately 2 %. Moreover considering on CMTNN2 and CMTFSVM2 techniques, CMTFSVM2 presented a higher accuracy over CMTNN2 for all datasets. Finally, an analysis on CMTFSVM1 and CMTFSVM2 indicated that CMTFSVM2 performed well with approximately 2 %, 1 %, 0.3 % and 2 % for German, Ionosphere, Pima and Yeast3 respectively. Overall, these results indicated that CMTNN2 and CMTFSVM2 gave better results over CMTNN1 and CMTFSVM1. In additional, CMTFSVM2 performed better than CMTNN2 for all datasets. The next experiment is to investigate the performance of data cleaning techniques CMTNN2 and CMTFSVM2 by adding 40 % of noise into the datasets. The results are then compared with results from Neural Network (NN) and Fuzzy Support Vector Machine (FSVM). The results are shown in Table 4.

Table 4. Comparison of accuracy after injection of 40 % of noise into the datasets

Full size table

It can be seen that FSVM gave better results over NN. Also, FSVM showed the best result approximately 3 % over NN on the original datasets. When applying cleaning techniques into the original datasets, CMTFSVM2 gave approximately 1.5 % higher than CMTNN2 although CMTFSVM2 showed the lower accuracy comparing with FSVM. However after adding 40 % of noise into the original datasets, it was obvious that CMTFSVM2 could handle noisy datasets better than FSVM. Moreover, the results presented that the accuracy of CMTFSVM2 was higher than CMTNN2 by approximately 8 %, 19 %, 5 % and 6 % on German, Ionosphere, Pima and Yeast3 respectively.

5 Conclusion

In this study, the CMTFSVM data cleaning technique is proposed to eliminate outlier and class noise. 40 % of the noise is added to the training dataset. The classification based on FSVM classifier is then compare with NN classifier. Accuracy is used to evaluate the system performance. The four well-known datasets from UCI and KEEL repositories are analysed. The classification accuracy showed that CMTFSVM is robust and perform better with noisy data in the comparison study. Moreover, it can enhance the classification performance in term of accuracy by approximately 10 % for all datasets.

References

Twala, B.: Impact of noise on credit risk prediction: does data quality really matter? J. Intell. Data Anal. 17, 1115–1134 (2013)
Google Scholar
Ang, E., Kwansnick, S., Bayati, M., Plambeck, E.L., Aratow, M.: Accurate emergency department wait time prediction. J. Manufact. Serv. Oper. Manage. 18, 141–156 (2015)
Google Scholar
Sessions, V., Valtorta, M.: The effects of data quality on machine learning algorithms. In: The 11th International Conference on Information Quality (ICIQ-06) (2006)
Google Scholar
López, V., et al.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. J. Artif. Intell. Rev. 22, 177–210 (2004)
Article MATH Google Scholar
Garcia, L.P.F., Lorena, A.C., Carvalho, A.C.P.L.F.: A study on class noise detection and elimination. In: The 2012 Brazilian Symposium on Neural Networks (SBRN), pp. 13–18. IEEE, Curitiba (2012)
Google Scholar
Sluban, B., Gamberger, D., Lavra, N.: Advances in class noise detection. In: 19th European Conference on Artificial Intelligence, pp. 1105–1106. IOS Press, The Netherlands (2010)
Google Scholar
Verbaeten, S., Van Assche, A.: Ensemble methods for noise elimination in classification problems. In: Windeatt, T., Roli, F. (eds.) Multiple Classifier Systems, vol. 2709, pp. 317–325. Springer, Heidelberg (2003)
Chapter Google Scholar
Jeatrakul, P., Wong, K.W., Fung, C.C.: Using misclassification analysis for data cleaning. In: International Workshop on Advanced Computational Intelligence and Intelligent Informatics (2009)
Google Scholar
Lee, G.H., Taur, J.S., Tao, C.W.: A robust fuzzy support vector machine for two-class pattern classification. Int. J. Fuzzy Syst. 8, 76–86 (2006)
Google Scholar
Jeatrakul, P., Wong, K.W., Fung, C.C.: Data cleaning for classification using misclassification analysis. J. Adv. Comput. Intelligence Intell. Inform. 14, 297–302 (2010)
Article Google Scholar
Jing, R., Zhang, Y.: A view of support vector machines algorithm on classification problems. In: International Conference on Multimedia Communications (MEDIACOM), pp. 13-16, Hong Kong (2010)
Google Scholar
Batuwita, R., Palade, V.: FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst. 18, 558–571 (2010)
Article Google Scholar
Lin, C.-F., Wang, S.-D.: Fuzzy support vector machines. IEEE Trans. Neural Netw. 13, 464–471 (2002)
Article Google Scholar
Samma, H., Lim, C.P., Ngah, U.K.: A hybrid PSO-FSVM model and its application to imbalanced classification of mammograms. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013, Part I. LNCS, vol. 7802, pp. 275–284. Springer, Heidelberg (2013)
Chapter Google Scholar
Ramyachitra, D., Manikandan, P.: Imbalanced dataset classification and solutions: a review. Int. J. Comput. Bus. Res. (IJCBR) 5(4), 1–29 (2014)
Google Scholar
Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences (2013)
Google Scholar
Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Information Technology, Murdoch University, Perth, Australia
Ratchakoon Pruengkarn, Kok Wai Wong & Chun Che Fung

Authors

Ratchakoon Pruengkarn
View author publications
You can also search for this author in PubMed Google Scholar
Kok Wai Wong
View author publications
You can also search for this author in PubMed Google Scholar
Chun Che Fung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ratchakoon Pruengkarn .

Editor information

Editors and Affiliations

The University of Tokyo, Tokyo, Japan
Akira Hirose
Kobe University, Kobe, Japan
Seiichi Ozawa
Okinawa Institute of Science and Technology Graduate University, Onna, Japan
Kenji Doya
Nara Institute of Science and Technology, Ikoma, Japan
Kazushi Ikeda
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Chinese Academy of Sciences, Beijing, China
Derong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pruengkarn, R., Wong, K.W., Fung, C.C. (2016). Data Cleaning Using Complementary Fuzzy Support Vector Machine Technique. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds) Neural Information Processing. ICONIP 2016. Lecture Notes in Computer Science(), vol 9948. Springer, Cham. https://doi.org/10.1007/978-3-319-46672-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-46672-9_19
Published: 30 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46671-2
Online ISBN: 978-3-319-46672-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Background

2.1 Complementary Neural Network

2.2 Fuzzy Support Vector Machine

2.3 Complementary Fuzzy Support Vector Machine

2.4 Cleaning Techniques

2.5 Evaluation Method

3 Proposed Methodology

3.1 Datasets

3.2 Experimental Processes

4 Experiment Results

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation