Skip to main content

Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction

  • Conference paper
Advanced Web and Network Technologies, and Applications (APWeb 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4977))

Included in the following conference series:

Abstract

Finding and removing misclassified instances are important steps in data mining and machine learning that affect the performance of the data mining algorithm in general. In this paper, we propose a C-Support Vector Classification Filter (C-SVCF) to identify and remove the misclassified instances (outliers) in breast cancer survivability samples collected from Srinagarind hospital in Thailand, to improve the accuracy of the prediction models. Only instances that are correctly classified by the filter are passed to the learning algorithm. Performance of the proposed technique is measured with accuracy and area under the receiver operating characteristic curve (AUC), as well as compared with several popular ensemble filter approaches including AdaBoost, Bagging and ensemble of SVM with AdaBoost and Bagging filters. Our empirical results indicate that C-SVCF is an effective method for identifying misclassified outliers. This approach significantly benefits ongoing research of developing accurate and robust prediction models for breast cancer survivability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tsumoto, S.: Problems with Mining Medical Data. In: The Twenty-Fourth Annual International Conference on Computer Software and Applications, pp. 467–468 (2000)

    Google Scholar 

  2. Li, J., Fu, A.W.-C., He, H., Chen, J., Kelman, C.: Mining Risk Patterns in Medical Data. In: Proc. the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 770–775 (2005)

    Google Scholar 

  3. Verbaeten, S., Assche, A.V.: Ensemble Methods for Noise Elimination in Classification Problems. In: Windeatt, T., Roli, F. (eds.) MCS 2003. LNCS, vol. 2709, pp. 317–325. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  4. Brodley, C.E., Friedl, M.A.: Identifying and Eliminating Mislabeled Training Instances. J. Artificial Intelligence Research 1 (1996)

    Google Scholar 

  5. Brodley, C.E., Friedl, M.A.: Identifying Mislabeled Training Data. J. Artificial Intelligence Research. 11, 131–167 (1999)

    MATH  Google Scholar 

  6. John, G.H.: Robust Decision Trees: Removing Outliers from Databases. In: Proc. the First International Conference on Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press, Menlo Park (1995)

    Google Scholar 

  7. Teng, C.M.: Applying Noise Handling Techniques to Genomic Data: A Case Study. In: Proc. the Third IEEE International Conference on Data Mining, p. 743 (2003)

    Google Scholar 

  8. Muhlenbach, F., Lallich, S., Zighed, D.A.: Identifying and Handling Mislabelled Instances. J. Intelligent Information Systems. 22(1), 89–109 (2004)

    Article  Google Scholar 

  9. Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Using Literature-Based Discovery to Identify Disease Candidate Genes. J. Medical Informatics. 74, 289–298 (2005)

    Article  Google Scholar 

  10. Blanco, Á., Ricket, A.M., Martín-Merino, M.: Combining SVM Classifiers for Email Anti-Spam Filtering. In: Proc. the Ninth International Work-Conference on Artificial Neural Networks, pp. 903–910. Springer, Heidelberg (2007)

    Google Scholar 

  11. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, Elsevier Science, San Francisco (2006)

    Google Scholar 

  12. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  13. Yin, Z., Yin, P., Sun, F., Wu, H.: A Writer Recognition Approach Based on SVM. In: Multi Conference on Computational Engineering in Systems Applications, pp. 581–586 (2006)

    Google Scholar 

  14. Lallich, S., Muhlenbach, F., Zighed, D.A.: Improving Classification by Removing or Relabeling Mislabeled Instances. In: Proc. the Thirteen International Symposium on Foundations of Intelligent Systems, pp. 5–15 (2002)

    Google Scholar 

  15. Sun, J.-w., Zhao, F.-y., Wang, C.-j., Chen, S.-f.: Identifying and Correcting Mislabeled Training Instances. In: Future Generation Communication and Networking, pp. 244–250 (2007)

    Google Scholar 

  16. Xiao, Y., Khoshgoftaar, T.M., Seliya, N.: The Partitioning- and Rule-Based Filter for Noise Detection. In: Proc. IEEE International Conference on Information Reuse and Integration, pp. 205–210 (2005)

    Google Scholar 

  17. Yi, W., Fuyong, W.: Breast Cancer Diagnosis Via Support Vector Machines. In: Proc. the Twenty Fifth Chinese Control Conference, pp. 1853–1856 (2006)

    Google Scholar 

  18. Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. the International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)

    Google Scholar 

  19. Thongkam, J., Xu, G., Zhang, Y.: An Analysis of Data Selection Methods on Classifiers Accuracy Measures. J. Korn Ken University (2007)

    Google Scholar 

  20. Huang, J., Ling, C.X.: Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on Knowledge and Data Engineering, 299–310 (2005)

    Google Scholar 

  21. Hand, D.J., Till, J.R.: A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems J. Machine Learning 45, 171–186 (2001)

    Article  MATH  Google Scholar 

  22. He, X., Frey, E.C.: Three-Class ROC Analysis-the Equal Error Utility Assumption and the Optimality of Three-Class ROC Surface Using the Ideal Observer. IEEE Transactions on Medical Imaging, 979–986 (2006)

    Google Scholar 

  23. Woods, K., Bowyer, K.W.: Generating ROC Curves for Artificial Neural Networks. IEEE Transactions on Medical Imaging, 329–337 (1997)

    Google Scholar 

  24. Jiang, Y.: Uncertainty in the Output of Artificial Neural Networks. In: International Joint Conference on Neural Networks, pp. 2551–2556 (2007)

    Google Scholar 

  25. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  26. Chang, C.-C., Lin, C.-J.: Libsvm–a Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm

  27. Gamberger, D., Šmuc, T., Marić, I.: Noise Detection and Elimination in Data Preprocessing Experiments in Medical Domains. J. Applied Artificial Intelligence. 14, 205–223 (2000)

    Article  Google Scholar 

  28. Khoshgoftaar, T.M., Seliya, N., Gao, K.: Rule-Based Noise Detection for Software Measurement Data. In: Proc. IEEE International Conference on Information Reuse and Integration, pp. 302–307 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Thongkam, J., Xu, G., Zhang, Y., Huang, F. (2008). Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction. In: Ishikawa, Y., et al. Advanced Web and Network Technologies, and Applications. APWeb 2008. Lecture Notes in Computer Science, vol 4977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89376-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89376-9_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89375-2

  • Online ISBN: 978-3-540-89376-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics