Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization

Pham, Huy Nguyen Anh; Triantaphyllou, Evangelos

doi:10.1007/978-3-540-79187-4_2

Huy Nguyen Anh Pham¹ &
Evangelos Triantaphyllou¹

Part of the book series: Studies in Computational Intelligence ((SCI,volume 131))

619 Accesses
12 Citations

Summary

The Pima Indian diabetes (PID) dataset [1], originally donated by Vincent Sigillito from the Applied Physics Laboratory at the Johns Hopkins University, is one of the most well-known datasets for testing classification algorithms. This dataset consists of records describing 786 female patients of Pima Indian heritage which are at least 21 years old living near Phoenix, Arizona, USA. The problem is to predict whether a new patient would test positive for diabetes. However, the correct classification percentage of current algorithms on this dataset is oftentimes coincidental. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification algorithm when it is processing a dataset. Although the above situation is of fundamental importance in data mining, it has not been studied from a comprehensive point of view. Thus, this paper describes a new approach, called the Homogeneity- Based Algorithm (or HBA) as developed by Pham and Triantaphyllou in [2-3], to optimally control the overfitting and overgeneralization behaviors of classification on this dataset. The HBA is used in conjunction with traditional classification approaches (such as Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), or Decision Trees (DTs)) to enhance their classification accuracy. Some computational results seem to indicate that the proposed approach significantly outperforms current approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asuncion, A., Newman, D.J.: UCI-Machine Learning Repository. School of Information and Computer Sciences. University of California, Irvine, California, USA (2007)
Google Scholar
Pham, H.N.A., Triantaphyllou, E.: The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining. In: Maimon, O., Rokach, L. (eds.) Soft Computing for Knowledge Discovery and Data Mining, Part 4, ch. 5, pp. 391–431. Springer, Heidelberg (2007)
Google Scholar
Pham, H.N.A., Triantaphyllou, E.: An Optimization Approach for Improving Accuracy by Balancing Overfitting and Overgeneralization in Data Mining (January 2008) (submitted for publication)
Google Scholar
American Diabetes Association (2007), http://www.diabetes.org/home.jsp
World Health Organization, Diabetes Mellitus: Report of a WHO Study Group. Geneva: WHO, Technical Report Series 727 (1985)
Google Scholar
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of 12th Symposium on Computer Applications and Medical Care, Los Angeles, California, USA, pp. 261–265 (1988)
Google Scholar
Jankowski, N., Kadirkamanathan, V.: Statistical control of RBF-like networks for classification. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 385–390. Springer, Heidelberg (1997)
Chapter Google Scholar
Au, W.H., Chan, K.C.C.: Classification with degree of membership: A fuzzy approach. In: Proceedings of the 1st IEEE Int’l Conference on Data Mining, San Jose, California, USA, pp. 35–42 (2001)
Google Scholar
Rutkowski, L., Cpalka, K.: Flexible neuro-fuzzy systems. IEEE Transactions on Neural Networks 14, 554–574 (2003)
Article Google Scholar
Davis IV, W.L.: Enhancing Pattern Classification with Relational Fuzzy Neural Networks and Square BKProducts. PhD Dissertation in Computer Science, pp. 71 - 74 (2006)
Google Scholar
Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification, ch. 9. Series Artificial Intelligence, pp. 157–160. Prentice Hall, Englewood Cliffs (1994)
Google Scholar
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, pp. 56–64. Wiley Publisher, Chichester (1973)
MATH Google Scholar
Artificial Neural Network Toolbox 6.0 and Statistics Toolbox 6.0, Matlab Version 7.0, http://www.mathworks.com/products/

Download references

Author information

Authors and Affiliations

Department of Computer Science, Louisiana State University, 298 Coates Hall, Baton Rouge, LA 70803
Huy Nguyen Anh Pham & Evangelos Triantaphyllou

Authors

Huy Nguyen Anh Pham
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Triantaphyllou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Roger Lee Haeng-Kon Kim

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pham, H.N.A., Triantaphyllou, E. (2008). Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization. In: Lee, R., Kim, HK. (eds) Computer and Information Science. Studies in Computational Intelligence, vol 131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79187-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-79187-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79186-7
Online ISBN: 978-3-540-79187-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics