Abstract
Finding and removing misclassified instances are important steps in data mining and machine learning that affect the performance of the data mining algorithm in general. In this paper, we propose a C-Support Vector Classification Filter (C-SVCF) to identify and remove the misclassified instances (outliers) in breast cancer survivability samples collected from Srinagarind hospital in Thailand, to improve the accuracy of the prediction models. Only instances that are correctly classified by the filter are passed to the learning algorithm. Performance of the proposed technique is measured with accuracy and area under the receiver operating characteristic curve (AUC), as well as compared with several popular ensemble filter approaches including AdaBoost, Bagging and ensemble of SVM with AdaBoost and Bagging filters. Our empirical results indicate that C-SVCF is an effective method for identifying misclassified outliers. This approach significantly benefits ongoing research of developing accurate and robust prediction models for breast cancer survivability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tsumoto, S.: Problems with Mining Medical Data. In: The Twenty-Fourth Annual International Conference on Computer Software and Applications, pp. 467–468 (2000)
Li, J., Fu, A.W.-C., He, H., Chen, J., Kelman, C.: Mining Risk Patterns in Medical Data. In: Proc. the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 770–775 (2005)
Verbaeten, S., Assche, A.V.: Ensemble Methods for Noise Elimination in Classification Problems. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 317–325. Springer, Heidelberg (2003)
Brodley, C.E., Friedl, M.A.: Identifying and Eliminating Mislabeled Training Instances. J. Artificial Intelligence Research 1 (1996)
Brodley, C.E., Friedl, M.A.: Identifying Mislabeled Training Data. J. Artificial Intelligence Research. 11, 131–167 (1999)
John, G.H.: Robust Decision Trees: Removing Outliers from Databases. In: Proc. the First International Conference on Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press, Menlo Park (1995)
Teng, C.M.: Applying Noise Handling Techniques to Genomic Data: A Case Study. In: Proc. the Third IEEE International Conference on Data Mining, p. 743 (2003)
Muhlenbach, F., Lallich, S., Zighed, D.A.: Identifying and Handling Mislabelled Instances. J. Intelligent Information Systems. 22(1), 89–109 (2004)
Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Using Literature-Based Discovery to Identify Disease Candidate Genes. J. Medical Informatics. 74, 289–298 (2005)
Blanco, Á., Ricket, A.M., Martín-Merino, M.: Combining SVM Classifiers for Email Anti-Spam Filtering. In: Proc. the Ninth International Work-Conference on Artificial Neural Networks, pp. 903–910. Springer, Heidelberg (2007)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, Elsevier Science, San Francisco (2006)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Yin, Z., Yin, P., Sun, F., Wu, H.: A Writer Recognition Approach Based on SVM. In: Multi Conference on Computational Engineering in Systems Applications, pp. 581–586 (2006)
Lallich, S., Muhlenbach, F., Zighed, D.A.: Improving Classification by Removing or Relabeling Mislabeled Instances. In: Proc. the Thirteen International Symposium on Foundations of Intelligent Systems, pp. 5–15 (2002)
Sun, J.-w., Zhao, F.-y., Wang, C.-j., Chen, S.-f.: Identifying and Correcting Mislabeled Training Instances. In: Future Generation Communication and Networking, pp. 244–250 (2007)
Xiao, Y., Khoshgoftaar, T.M., Seliya, N.: The Partitioning- and Rule-Based Filter for Noise Detection. In: Proc. IEEE International Conference on Information Reuse and Integration, pp. 205–210 (2005)
Yi, W., Fuyong, W.: Breast Cancer Diagnosis Via Support Vector Machines. In: Proc. the Twenty Fifth Chinese Control Conference, pp. 1853–1856 (2006)
Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. the International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)
Thongkam, J., Xu, G., Zhang, Y.: An Analysis of Data Selection Methods on Classifiers Accuracy Measures. J. Korn Ken University (2007)
Huang, J., Ling, C.X.: Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on Knowledge and Data Engineering, 299–310 (2005)
Hand, D.J., Till, J.R.: A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems J. Machine Learning 45, 171–186 (2001)
He, X., Frey, E.C.: Three-Class ROC Analysis-the Equal Error Utility Assumption and the Optimality of Three-Class ROC Surface Using the Ideal Observer. IEEE Transactions on Medical Imaging, 979–986 (2006)
Woods, K., Bowyer, K.W.: Generating ROC Curves for Artificial Neural Networks. IEEE Transactions on Medical Imaging, 329–337 (1997)
Jiang, Y.: Uncertainty in the Output of Artificial Neural Networks. In: International Joint Conference on Neural Networks, pp. 2551–2556 (2007)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Chang, C.-C., Lin, C.-J.: Libsvm–a Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Gamberger, D., Šmuc, T., Marić, I.: Noise Detection and Elimination in Data Preprocessing Experiments in Medical Domains. J. Applied Artificial Intelligence. 14, 205–223 (2000)
Khoshgoftaar, T.M., Seliya, N., Gao, K.: Rule-Based Noise Detection for Software Measurement Data. In: Proc. IEEE International Conference on Information Reuse and Integration, pp. 302–307 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thongkam, J., Xu, G., Zhang, Y., Huang, F. (2008). Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction. In: Ishikawa, Y., et al. Advanced Web and Network Technologies, and Applications. APWeb 2008. Lecture Notes in Computer Science, vol 4977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89376-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-89376-9_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89375-2
Online ISBN: 978-3-540-89376-9
eBook Packages: Computer ScienceComputer Science (R0)