Abstract
There is huge growth in the amount of patient survey data being generated in healthcare industries and hospitals. Curse of dimensionality is a barrier to extracting useful information from patient survey data which can help in the treatment and care of patients. It is paramount to have methods to find importance of features based on such huge volumes of stored information for the desired outputs. The health-related quality of life (HRQOL) is a powerful paradigm to help reaching such a desired output, measuring as patient satisfaction. In such scenarios, it is difficult to investigate the features, out of such high-dimensional data, that could best represent desired output and explain them so that such features can be used in the future at the point f care. In this paper we propose a Cluster-based Random Forest (CB-RF) method to particularly exploit the most important features for the desired output, which is Expanded Prostate Index Composite-26 (EPIC-26) domain scores. EPIC-26 is being used for assessing a range of HRQOL issues related to the diagnosis and treatment of prostate cancer. Different feature extraction methods are applied to extract features and the best method is the proposed CB-RF model which could find the most important features (10 or less) out of over 1500 features that can be used to accurately estimate patient with their EPIC-26 values with on average 85% coefficient of correlation between predicted and observed values of real dataset including 5093 patients.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
APCARI: Home-apcari. https://apcari.ca/
Basch, E., et al.: Adverse symptom event reporting by patients vs clinicians: relationships with clinical outcomes. J. Natl. Cancer Inst. 101(23), 1624–1632 (2009)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman and Hall, New York (1984)
Canadian Cancer Society: Prostate cancer statistics - Canadian Cancer Society. http://www.cancer.ca/en/cancer-information/cancer-type/prostate/statistics/?region=ab
Chan, J.C.W., Paelinckx, D.: Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery. Remote Sens. Environ. 112(6), 2999–3011 (2008)
Garson, G.D.: Interpreting neural-network connection weights. AI Expert 6(4), 46–51 (1991)
Gedeon, T.D.: Data mining of inputs: analysing magnitude and functional measures. Int. J. Neural Syst. 8(02), 209–218 (1997)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
H2O.ai: Home - h2o.ai. https://www.h2o.ai/
Harris, P.A., Taylor, R., Thielke, R., Payne, J., Gonzalez, N., Conde, J.G.: Research electronic data capture (REDcap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 42(2), 377–381 (2009)
Henry, J., Pylypchuk, Y., Searcy, T., Patel, V.: Adoption of electronic health record systems among us non-federal acute care hospitals: 2008–2015. ONC Data Brief 35, 1–9 (2016)
Herschorn, S., Gajewski, J., Schulz, J., Corcos, J.: A population-based study of urinary symptoms and incontinence: the Canadian urinary bladder survey. BJU Int. 101(1), 52–58 (2008)
Korfage, I.J., Essink-Bot, M.L., Janssens, A.C.J.W., Schröder, F.H., De Koning, H.J.: Anxiety and depression after prostate cancer diagnosis and treatment: 5-year follow-up. Br. J. Cancer 94(8), 1093 (2006)
Memorial Sloan Kettering Cancer Center: Prostate cancer nomograms | memorial sloan kettering cancer center. https://www.mskcc.org/nomograms/prostate
Michaelson, M.D., Cotter, S.E., Gargollo, P.C., Zietman, A.L., Dahl, D.M., Smith, M.R.: Management of complications of prostate cancer treatment. CA: A Cancer J. Clin. 58(4), 196–213 (2008)
Office of National Coordinator: Office of the national coordinator for health information technology (2016). https://dashboard.healthit.gov/quickstats/pages/FIG-Hospital-Progress-to-Meaningful-Use-by-size-practice-setting-area-type.php
Ng, A.: Clustering with the k-means algorithm. Machine Learning (2012)
Rosenblatt, F.: Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Cornell Aeronautical Lab Inc., Buffalo, NY (1961)
Sanda, M., Wei, J., Litwin, M.: Scoring instructions for the expanded prostate cancer index composite short form (EPIC-26). https://medicine.umich.edu/sites/default/files/content/downloads.Scoring%20Instructions%20for%20the%20EPIC%2026
Stokes, M.E., Black, L., Benedict, A., Roehrborn, C.G., Albertsen, P.: Long-term medical-care costs related to prostate cancer: estimates from linked seer-medicare data. Prostate Cancer Prostatic Dis. 13(3), 278 (2010)
Szymanski, K.M., Wei, J.T., Dunn, R.L., Sanda, M.G.: Development and validation of an abbreviated version of the expanded prostate cancer index composite instrument for measuring health-related quality of life among prostate cancer survivors. Urology 76(5), 1245–1250 (2010)
Velikova, G., et al.: Measuring quality of life in routine oncology practice improves communication and patient well-being: a randomized controlled trial. J. Clin. Oncol. 22(4), 714–724 (2004)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)
Wei, J.T., Dunn, R.L., Litwin, M.S., Sandler, H.M., Sanda, M.G.: Development and validation of the expanded prostate cancer index composite (EPIC) for comprehensive assessment of health-related quality of life in men with prostate cancer. Urology 56(6), 899–905 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sharifi, F., Mohammed, E., Crump, T., Far, B.H. (2019). A Cluster-Based Machine Learning Model for Large Healthcare Data Analysis. In: Younas, M., Awan, I., Benbernou, S. (eds) Big Data Innovations and Applications. Innovate-Data 2019. Communications in Computer and Information Science, vol 1054. Springer, Cham. https://doi.org/10.1007/978-3-030-27355-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-27355-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27354-5
Online ISBN: 978-3-030-27355-2
eBook Packages: Computer ScienceComputer Science (R0)