skip to main content
research-article

Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

Published: 08 February 2017 Publication History

Abstract

Electronic health records (EHR) provide a rich source of temporal data that present a unique opportunity to characterize disease patterns and risk of imminent disease. While many data-mining tools have been adopted for EHR-based disease early detection, linear discriminant analysis (LDA) is one of the most commonly used statistical methods. However, it is difficult to train an accurate LDA model for early disease diagnosis when too few patients are known to have the target disease. Furthermore, EHR data are heterogeneous with significant noise. In such cases, the covariance matrices used in LDA are usually singular and estimated with a large variance.
This article presents Daehr, an extension of the LDA framework using electronic health record data to address these issues. Beyond existing LDA analyzers, we propose Daehr to (1) eliminate the data noise caused by the manual encoding of EHR data and (2) lower the variance of parameter (covariance matrices) estimation for LDA models when only a few patients’ EHR are available for training. To achieve these two goals, we designed an iterative algorithm to improve the covariance matrix estimation with embedded data-noise/parameter-variance reduction for LDA. We evaluated Daehr extensively using the College Health Surveillance Network, a large, real-world EHR dataset. Specifically, our experiments compared the performance of LDA to three baselines (i.e., LDA and its derivatives) in identifying college students at high risk for mental health disorders from 23 U.S. universities. Experimental results demonstrate Daehr significantly outperforms the three baselines by achieving 1.4%--19.4% higher accuracy and a 7.5%--43.5% higher F1-score.

Supplementary Material

a47-xiong-apndx.pdf (xiong.zip)
Supplemental movie, appendix, image and software files for, Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

References

[1]
2012. CMS: Electronic Health Records. Retrieved from https://www.cms.gov/Medicare/E-health/EHealthRecords/index.html.
[2]
2015. Any Mental Illness (AMI) Among Adults. NIH National Institute of Mental Health. Retrieved from http://www.nimh.nih.gov/.
[3]
Ruben Amarasingham, Billy J. Moore, Ying P. Tabak, Mark H. Drazner, Christopher A. Clark, Song Zhang, W. Gary Reed, Timothy S. Swanson, Ying Ma, and Ethan A. Halm. 2010. An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Med. Care 48, 11 (2010), 981--988.
[4]
American College Health Association. 2014. American college health association national college health assessment. Spring 2014 Reference Group Executive Summary. Retrieved from http://www.ijme.net/archive/2/communication-training-and-perceived-patient-similarity/.
[5]
James O. Berger. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science 8 Business Media.
[6]
Lev M. Bregman. 1967. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 3 (1967), 200--217.
[7]
T. Tony Cai and Harrison H. Zhou. 2012. Minimax estimation of large covariance matrices under l1 norm. Stat. Sin. 22, 4 (2012), 1319--1378.
[8]
Luca Cazzanti and Maya R. Gupta. 2007. Local similarity discriminant analysis. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, New York, NY, 137--144.
[9]
Ward Cheney and Allen A. Goldstein. 1959. Proximity maps for convex sets. Proc. Am. Math. Soc. 10, 3 (1959), 448--450.
[10]
Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. 2011. Sparse discriminant analysis. Technometrics 53, 4 (2011), 406--413.
[11]
Ralph B. D’Agostino, Scott Grundy, Lisa M. Sullivan, and Peter Wilson. 2001. Validation of the Framingham coronary heart disease prediction scores: Results of a multiple ethnic groups investigation. J. Am. Med. Am. 286, 2 (2001), 180--187.
[12]
Erik R. Dubberke, Kimberly A. Reske, L. Clifford McDonald, and Victoria J. Fraser. 2006. ICD-9 codes and surveillance for clostridium difficile--associated disease. Emerg. Infect. Dis. 12, 10 (2006), 1576.
[13]
René Escalante and Marcos Raydan. 2011. Alternating Projection Methods. Vol. 8. SIAM.
[14]
Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188.
[15]
Hui Gao and James W. Davis. 2006. Why direct LDA is not equivalent to LDA. Pattern Recogn. 39, 5 (2006), 1002--1006.
[16]
E. Gil-Herrera, G. Aden-Buie, A. Yalcin, A. Tsalatsanis, L. E. Barnes, and B. Djulbegovic. 2015. Rough set theory based prognostic classification models for hospice referral. BMC Med. Inform. Dec. Making 15, 1 (2015), 98.
[17]
Benjamin A. Goldstein, Ann Marie Navar, Michael J. Pencina, and John P. A. Ioannidis. 2016. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. (2016).
[18]
David Gotz, Fei Wang, and Adam Perer. 2014. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. J. Biomed. Inform. 48 (April 2014), 148--159.
[19]
HCUP. 2014. Appendix A - Clinical Classification Software-DIAGNOSES (January 1980 through September 2014). Retrieved from https://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt.
[20]
Nicholas J. Higham. 2002. Computing the nearest correlation matrixa problem from finance. IMA J. Numer. Anal. 22, 3 (2002), 329--343.
[21]
Pao-Lu Hsu and Herbert Robbins. 1947. Complete convergence and the law of large numbers. Proc. Natl. Acad. Sci. U.S.A. 33, 2 (1947), 25.
[22]
Rui Huang, Qingshan Liu, Hanqing Lu, and Songde Ma. 2002. Solving the small sample size problem of LDA. In Proceedings of the 16th International Conference on Pattern Recognition, 2002, Vol. 3. IEEE, 29--32.
[23]
Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014a. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (2014), 1069--1075.
[24]
Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014b. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (Dec. 2014), 1069--1075.
[25]
Peter B. Jensen, Lars J. Jensen, and Søren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet. 13, 6 (2012), 395--405.
[26]
Susan Jensen and UK SPSS. 2001. Mining medical data for predictive and sequential patterns: PKDD 2001. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases.
[27]
Jan Kalina, Libor Seidl, Karel Zvára, Hana Grünfeldová, Dalibor Slovák, and Jana Zvárová. 2013. Selecting relevant information for medical decision support with application to cardiology. Eur. J. Biomed. Inform. 9, 1 (2013), 2--6.
[28]
Isak Karlsson and Henrik Bostrom. 2014. Handling sparsity with random forests when predicting adverse drug events from electronic health records. In Proceedings of the 2014 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 17--22.
[29]
Kenneth S. Kendler, John M. Hettema, Frank Butera, Charles O. Gardner, and Carol A. Prescott. 2003. Life event dimensions of loss, humiliation, entrapment, and danger in the prediction of onsets of major depression and generalized anxiety. Arch. Gen. Psychiat. 60, 8 (2003), 789--796.
[30]
Kurt Kroenke and Robert L. Spitzer. 2002. The PHQ-9: A new depression diagnostic and severity measure. Psychiatr. Ann. 32, 9 (2002), 1--7.
[31]
Jaana Lindstrom and Jaakko Tuomilehto. 2003. The diabetes risk score: A practical tool to predict type 2 diabetes risk. Diabetes Care 26, 3 (2003), 725--731.
[32]
Chuanren Liu, Fei Wang, Jianying Hu, and Hui Xiong. 2015. Temporal phenotyping from longitudinal electronic health records: A graph based framework. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, New York, NY, 705--714.
[33]
Juwei Lu, Kostantinos N. Plataniotis, and Anastasios N. Venetsanopoulos. 2003. Face recognition using LDA-based algorithms. IEEE Trans. Neur. Netw. 14, 1 (2003), 195--200.
[34]
Joo Maroco, Dina Silva, Ana Rodrigues, Manuela Guerreiro, Isabel Santana, and Alexandre de Mendona. 2011. Data mining methods in the prediction of dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res. Notes 4, 1 (Aug. 2011), 299.
[35]
Geoffrey McLachlan. 2004. Discriminant Analysis and Statistical Pattern Recognition. Vol. 544. John Wiley 8 Sons.
[36]
S. Mitchell, K. Schinkel, Y. Song, Y. Wang, J. Ainsworth, T. Halbert, S. Strong, J. Zhang, C. C. Moore, and L. E. Barnes. 2016. Optimization of sepsis risk assessment for ward patients. In Proceedings of the IEEE Systems and Information Engineering Design Symposium (SIEDS). 107--112.
[37]
Yurii Nesterov. 2004. Introductory Lectures on Convex Optimization. Vol. 87. Springer Science 8 Business Media.
[38]
Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015a. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summit on Clinical Research Informatics (CRI) (2015), 132.
[39]
Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015b. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summits on Translational Science Proceedings 2015 (March 2015), 132--136.
[40]
Alicia Nobles, Ketki Vilankar, Hao Wu, and Laura Barnes. 2015. Evaluation of data quality of multisite electronic health record data for secondary analysis. In Proceedings of the 2015 International Conference on Big Data (Workshop). IEEE.
[41]
Adam Perer and Fei Wang. 2014. Frequence: Interactive mining and visualization of temporal frequent event sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces. ACM, 153--162.
[42]
Adam Perer, Fei Wang, and Jianying Hu. 2015. Mining and exploring care pathways from electronic medical records with visual analytics. J. Biomed. Inform. 56 (Aug. 2015), 369--378.
[43]
Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang Horng, Skye H. Cheng, Mei-Hua Tsou, Chii-Ming Chen, Andrea Bild, Edwin S. Iversen, Andrew T. Huang, and others. 2004. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl Acad. Sci. U.S.A. 101, 22 (2004), 8431--8436.
[44]
Zhihua Qiao, Lan Zhou, and Jianhua Z. Huang. 2008. Effective linear discriminant analysis for high dimensional, low sample size data. In Proceeding of the World Congress on Engineering, Vol. 2. Citeseer, 2--4.
[45]
Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, and others. 2011. Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Stat. 39, 2 (2011), 1241--1265.
[46]
George C. M. Siontis, Ioanna Tzoulaki, Konstantinos C. Siontis, and John P. A. Ioannidis. 2012. Comparisons of established risk prediction models for cardiovascular disease: Systematic review. Br. Med. J. 344 (2012).
[47]
Jimeng Sun, Fei Wang, Jianying Hu, and Shahram Edabollahi. 2012. Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explor. Newslett. 14, 1 (2012), 16--24.
[48]
Barbara G. Tabachnick, Linda S. Fidell, and others. 2001. Using multivariate statistics. (2001). 530--538.
[49]
James C. Turner and Adrienne Keller. 2015. College health surveillance network: Epidemiology and health care utilization of college students at U.S. 4-year universities. J. Am. College Health: J. ACH (June 2015).
[50]
A. Tylee and P. Gandhi. 2005. The importance of somatic symptoms in depression in primary care. Prim. Care Compan. J. Clin. Psychiatr. 7, 4 (2005), 167--176.
[51]
John Von Neumann. 1951. Functional Operators: The Geometry of Orthogonal Spaces. Princeton University Press.
[52]
Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, and Shahram Ebadollahi. 2012a. Towards heterogeneous temporal clinical event pattern discovery: A convolutional approach. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 453--461.
[53]
Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, Shahram Ebadollahi, and A. Laine. 2012b. A framework for mining signatures from event sequences and its applications in healthcare data. 272--285.
[54]
Fei Wang and Jimeng Sun. 2015. PSF: A unified patient similarity evaluation framework through metric learning with weak supervision. IEEE J. Biomed. Health Inform. 19, 3 (May 2015), 1053--1060.
[55]
Fei Wang, Ping Zhang, Xiang Wang, and Jianying Hu. 2014. Clinical risk prediction by exploring high-order feature correlations. In AMIA Annual Symposium Proceedings, Vol. 2014. American Medical Informatics Association, 1170.
[56]
H.-U. Wittchen, S. Mhlig, and Beesdo K. 2003. Mental disorders in primary care. Dialog. Clin. Neurosci. 5, 2 (2003), 115--128.
[57]
Hsien-Chung Wu. 2009. The Karush--Kuhn--Tucker optimality conditions in multiobjective programming problems with interval-valued objective functions. Eur. J. Operat. Res. 196, 1 (2009), 49--60.
[58]
Lingzhou Xue, Shiqian Ma, and Hui Zou. 2012. Positive-definite 1-penalized estimation of large covariance matrices. J. Am. Statist. Assoc. 107, 500 (2012), 1480--1491.
[59]
Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park. 2004. An optimization criterion for generalized discriminant analysis on undersampled problems. IEEE Trans. Pattern Anal. Mach. Intell. 26, 8 (2004), 982--994.
[60]
Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov, Keila Pena-Hernandez, Rajitha Gopidi, Jia-Fu Chang, and Lei Hua. 2011. Data mining in healthcare and biomedicine: A survey of the literature. J. Med. Syst. 36, 4 (May 2011), 2431--2448.
[61]
Jinghe Zhang, Haoyi Xiong, Yu Huang, Hao Wu, Kevin Leach, and Laura E. Barnes. 2015. MSEQ: Early detection of anxiety and depression via temporal orders of diagnoses in electronic health data. In 2015 International Conference on Big Data (Workshop). IEEE.
[62]
Bichen Zheng, Jinghe Zhang, Sang Won Yoon, Sarah S. Lam, Mohammad Khasawneh, and Srikanth Poranki. 2015. Predictive modeling of hospital readmissions using metaheuristics and data mining. Expert Syst. Appl. 42, 20 (Nov. 2015), 7110--7120.
[63]
Eric R. Ziegel. 2003. Modern applied statistics with S. Technometrics 45, 1 (2003), 111.

Cited By

View all
  • (2024)On the Equivalence of Linear Discriminant Analysis and Least Squares RegressionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.3208944(1-11)Online publication date: 2024
  • (2023)Adaptive Integration of Categorical and Multi-relational Ontologies with EHR Data for Medical Concept EmbeddingACM Transactions on Intelligent Systems and Technology10.1145/362522414:6(1-20)Online publication date: 14-Nov-2023
  • (2023)Knowledge-aware patient representation learning for multiple disease subtypesJournal of Biomedical Informatics10.1016/j.jbi.2023.104292138:COnline publication date: 1-Feb-2023
  • Show More Cited By

Index Terms

  1. Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 8, Issue 3
    Special Issue: Mobile Social Multimedia Analytics in the Big Data Era and Regular Papers
    May 2017
    320 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/3040485
    • Editor:
    • Yu Zheng
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 February 2017
    Accepted: 01 October 2016
    Revised: 01 July 2016
    Received: 01 December 2015
    Published in TIST Volume 8, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Predictive models
    2. anxiety/depression
    3. early detection
    4. electronic health data
    5. temporal order

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • University of Virginia Hobby Postdoctoral and Predoctoral Fellowships in Computational Science

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On the Equivalence of Linear Discriminant Analysis and Least Squares RegressionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.3208944(1-11)Online publication date: 2024
    • (2023)Adaptive Integration of Categorical and Multi-relational Ontologies with EHR Data for Medical Concept EmbeddingACM Transactions on Intelligent Systems and Technology10.1145/362522414:6(1-20)Online publication date: 14-Nov-2023
    • (2023)Knowledge-aware patient representation learning for multiple disease subtypesJournal of Biomedical Informatics10.1016/j.jbi.2023.104292138:COnline publication date: 1-Feb-2023
    • (2022)Overview of the role of big data in mental health: A scoping reviewComputer Methods and Programs in Biomedicine Update10.1016/j.cmpbup.2022.1000762(100076)Online publication date: 2022
    • (2021)Sampling Sparse Representations with Randomized Measurement Langevin DynamicsACM Transactions on Knowledge Discovery from Data10.1145/342758515:2(1-21)Online publication date: 10-Feb-2021
    • (2021)“It cannot do all of my work”: Community Health Worker Perceptions of AI-Enabled Mobile Health Applications in Rural IndiaProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445420(1-20)Online publication date: 6-May-2021
    • (2020)Self-weighted Robust LDA for Multiclass Classification with Edge ClassesACM Transactions on Intelligent Systems and Technology10.1145/341828412:1(1-19)Online publication date: 22-Dec-2020
    • (2020)Finding discriminatory features from electronic health records for depression predictionJournal of Intelligent Information Systems10.1007/s10844-020-00611-y55:2(371-396)Online publication date: 24-Jul-2020
    • (2019)A Survey on Prediction Using Big Data AnalyticsConsumer-Driven Technologies in Healthcare10.4018/978-1-5225-6198-9.ch019(371-383)Online publication date: 2019
    • (2019)IFFLC: an Integrated Framework of Feature Learning and Classification for Multiple Diagnosis Codes AssignmentIEEE Access10.1109/ACCESS.2019.2902467(1-1)Online publication date: 2019
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media