Summary
The Pima Indian diabetes (PID) dataset [1], originally donated by Vincent Sigillito from the Applied Physics Laboratory at the Johns Hopkins University, is one of the most well-known datasets for testing classification algorithms. This dataset consists of records describing 786 female patients of Pima Indian heritage which are at least 21 years old living near Phoenix, Arizona, USA. The problem is to predict whether a new patient would test positive for diabetes. However, the correct classification percentage of current algorithms on this dataset is oftentimes coincidental. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification algorithm when it is processing a dataset. Although the above situation is of fundamental importance in data mining, it has not been studied from a comprehensive point of view. Thus, this paper describes a new approach, called the Homogeneity- Based Algorithm (or HBA) as developed by Pham and Triantaphyllou in [2-3], to optimally control the overfitting and overgeneralization behaviors of classification on this dataset. The HBA is used in conjunction with traditional classification approaches (such as Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), or Decision Trees (DTs)) to enhance their classification accuracy. Some computational results seem to indicate that the proposed approach significantly outperforms current approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Asuncion, A., Newman, D.J.: UCI-Machine Learning Repository. School of Information and Computer Sciences. University of California, Irvine, California, USA (2007)
Pham, H.N.A., Triantaphyllou, E.: The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining. In: Maimon, O., Rokach, L. (eds.) Soft Computing for Knowledge Discovery and Data Mining, Part 4, ch. 5, pp. 391–431. Springer, Heidelberg (2007)
Pham, H.N.A., Triantaphyllou, E.: An Optimization Approach for Improving Accuracy by Balancing Overfitting and Overgeneralization in Data Mining (January 2008) (submitted for publication)
American Diabetes Association (2007), http://www.diabetes.org/home.jsp
World Health Organization, Diabetes Mellitus: Report of a WHO Study Group. Geneva: WHO, Technical Report Series 727 (1985)
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of 12th Symposium on Computer Applications and Medical Care, Los Angeles, California, USA, pp. 261–265 (1988)
Jankowski, N., Kadirkamanathan, V.: Statistical control of RBF-like networks for classification. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 385–390. Springer, Heidelberg (1997)
Au, W.H., Chan, K.C.C.: Classification with degree of membership: A fuzzy approach. In: Proceedings of the 1st IEEE Int’l Conference on Data Mining, San Jose, California, USA, pp. 35–42 (2001)
Rutkowski, L., Cpalka, K.: Flexible neuro-fuzzy systems. IEEE Transactions on Neural Networks 14, 554–574 (2003)
Davis IV, W.L.: Enhancing Pattern Classification with Relational Fuzzy Neural Networks and Square BKProducts. PhD Dissertation in Computer Science, pp. 71 - 74 (2006)
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification, ch. 9. Series Artificial Intelligence, pp. 157–160. Prentice Hall, Englewood Cliffs (1994)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, pp. 56–64. Wiley Publisher, Chichester (1973)
Artificial Neural Network Toolbox 6.0 and Statistics Toolbox 6.0, Matlab Version 7.0, http://www.mathworks.com/products/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Pham, H.N.A., Triantaphyllou, E. (2008). Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization. In: Lee, R., Kim, HK. (eds) Computer and Information Science. Studies in Computational Intelligence, vol 131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79187-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-79187-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79186-7
Online ISBN: 978-3-540-79187-4
eBook Packages: EngineeringEngineering (R0)