ABSTRACT
Background Although accurate identification of gender identity in the electronic health record (EHR) is crucial for providing equitable health care, particularly for transgender and gender diverse (TGD) populations, it remains a challenging task due to incomplete gender information in structured EHR fields.
Objective To develop a deep learning classifier to accurately identify patient gender identity using patient-level EHR data, including free-text notes.
Methods This study included adult patients in a large healthcare system in Boston, MA, between 4/1/2017 to 4/1/2022. To identify relevant information from massive clinical notes and to denoise, we compiled a list of gender-related keywords through expert curation, literature review, and expansion via a fine-tuned BioWordVec model. This keyword list was used to pre-screen potential TGD individuals and create two datasets for model training, testing, and validation. Dataset I was a balanced dataset that contained clinician-confirmed TGD patients and cases without keywords. Dataset II contained cases with keywords. The performance of the deep learning model was compared to traditional machine learning and rule-based algorithms.
Results The final keyword list consists of 109 keywords, of which 58 (53.2%) were expanded by the BioWordVec model. Dataset I contained 3,150 patients (50% TGD) while Dataset II contained 200 patients (90% TGD). On Dataset I the deep learning model achieved a F1 score of 0.917, sensitivity of 0.854, and a precision of 0.980; and on Dataset II a F1 score of 0.969, sensitivity of 0.967, and precision of 0.972. The deep learning model significantly outperformed rule-based algorithms.
Conclusion This is the first study to show that deep learning algorithms can accurately identify gender identity using EHR data. Future work should leverage and evaluate additional diverse data sources to generate more generalizable algorithms.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was supported by a research grant from CRICO, the medical malpractice insurance organization.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Mass General Brigham IRB #2021P001964
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
yining_hua{at}hms.harvard.edu; lwang{at}bwh.harvard.edu; vnguyen31{at}bwh.harvard.edu; dbates{at}bwh.harvard.edu; dfoer{at}bwh.harvard.edu; lzhou{at}bwh.harvard.edu;amcdowell4{at}mgh.harvard.edu; mrieuwerden{at}mgh.harvard.edu
Data Availability
The data sets used for training and evaluation in this study are available upon reasonable request from the corresponding author, pending the necessary institutional reviews and approvals.
Abbreviations
- BERT
- Bidirectional Encoder Representations from Transformers
- EHR
- Electronic Health Records
- MGB
- Mass General Brigham
- NLP
- Natural Language Processing
- TGD
- Transgender and Gender Diverse
- SVM
- Support Vector Machine
- TF-IDF
- Term Frequency-Inverse Document Frequency