Purpose
To (1) use All of Us ( AoU ) data to validate a previously published single-center model predicting the need for surgery among individuals with glaucoma, (2) train new models using AoU data, and (3) share insights regarding this novel data source for ophthalmic research.
Design
Development and evaluation of machine learning models.
Methods
Electronic health record data were extracted from AoU for 1,231 adults diagnosed with primary open-angle glaucoma. The single-center model was applied to AoU data for external validation. AoU data were then used to train new models for predicting the need for glaucoma surgery using multivariable logistic regression, artificial neural networks, and random forests. Five-fold cross-validation was performed. Model performance was evaluated based on area under the receiver operating characteristic curve (AUC), accuracy, precision, and recall.
Results
The mean (standard deviation) age of the AoU cohort was 69.1 (10.5) years, with 57.3% women and 33.5% black, significantly exceeding representation in the single-center cohort ( P = .04 and P < .001, respectively). Of 1,231 participants, 286 (23.2%) needed glaucoma surgery. When applying the single-center model to AoU data, accuracy was 0.69 and AUC was only 0.49. Using AoU data to train new models resulted in superior performance: AUCs ranged from 0.80 (logistic regression) to 0.99 (random forests).
Conclusions
Models trained with national AoU data achieved superior performance compared with using single-center data. Although AoU does not currently include ophthalmic imaging, it offers several strengths over similar big-data sources such as claims data. AoU is a promising new data source for ophthalmic research.
The digitization of health care data offers an outstanding opportunity to better understand the complex relationships between systemic disease and glaucoma, the leading cause of irreversible blindness globally. , Understanding these relationships is critical to enabling precision treatment of patients with glaucoma, who are often elderly and have comorbid conditions such as hypertension and diabetes. Several studies suggested that systemic diseases and medications may influence the development or progression of glaucoma. Electronic health records (EHRs) contain a vast amount of clinical data that may help further our understanding of these associations. EHR data have been employed to develop predictive models in a wide range of clinical applications. , Within ophthalmology, several models have been developed to predict glaucoma onset and progression based on structural and functional data related to the eye, but few have used systemic data from the EHR. To address this gap, we previously developed a single-center model predicting glaucoma progression using systemic data from the EHR at our institution. Its predictive performance suggested that systemic EHR data could help predict patients at risk for progressing into glaucoma surgery, even in the absence of ophthalmic data. The results of that study also provided the rationale for further investigation and model refinement, particularly because the initial model was derived from a small sample.
To test generalizability, in this study, we leveraged nationwide data from the All of Us Research Program ( All of Us [ AoU ]) to further examine the utility of systemic EHR data for prediction of glaucoma progression. Motivated by the success of other large-scale national-level cohort studies such as the UK Biobank, the Million Veteran Program, and the China Kadoorie Biobank, AoU aims to collect data for 1 million persons living in the United States to advance precision medicine. Data collected include health questionnaires, EHRs, physical measurements (PMs), and data derived from digital health technology. The program also performs collection and analysis of biospecimens. The program emphasizes enrollment of diverse participants traditionally underrepresented in biomedical research. Enrollment opened in May 2018; as of June 2020, AoU had enrolled >345,000 participants from both clinic-based and community-based recruitment sites. In October 2019, the program began an alpha demonstration phase for the Researcher Workbench, allowing selected research teams to have access to some participant data within the AoU Researcher Workbench as the Workbench was being refined. Ours was the only alpha demonstration project related to ophthalmology.
The aims of our study were to: (1) externally validate our single-center model’s performance with AoU data, (2) develop models trained by the AoU data and compare their performance to our single-center model, and (3) share insights from our experience using AoU data and the Researcher Workbench with other ophthalmology researchers who may be interested in using this novel data source.
METHODS
Study Population, Data Source, and Demonstration Project Information
The methods underlying our initial single-center model were described in detail previously. In short, we examined data from a cohort of adult participants from a single academic center with primary open-angle glaucoma over a 5-year period; we extracted their systemic EHR data from our institution’s clinical data warehouse and developed predictive models using multivariable logistic regression, random forests, and artificial neural networks (ANNs).
The goals, recruitment methods and sites, and scientific rationale for AoU have been described previously. Demonstration projects were designed to describe the cohort, replicate previous findings for validation, and avoid novel discovery in line with the program value to ensure equal access by researchers to the data. The work described here was proposed by Consortium members, reviewed and overseen by the program’s Science Committee, and was confirmed as a meeting criterion for nonhuman subjects research by the AoU Institutional Review Board. The initial release of data and tools used in this work was published recently. Results reported are in compliance with the AoU Data and Statistics Dissemination Policy disallowing disclosure of group counts under 20.
This work was performed on data collected by the previously described AoU Research Program using the AoU Researcher Workbench, a cloud-based platform where approved researchers can access and analyze AoU data. The AoU data currently include surveys, EHRs, and PMs. The details of the surveys are available in the Survey Explorer found in the Research Hub, a website designed to support researchers. Each survey includes branching logic, and all questions are optional and may be skipped by the participant. PMs recorded at enrollment include systolic and diastolic blood pressure, height, weight, heart rate, waist and hip measurement, wheelchair use, and current pregnancy status. EHR data were linked for those consented participants. All 3 datatypes (survey, PM, and EHR) are mapped to the Observational Health and Medicines Outcomes Partnership (OMOP) common data model v 5.2 maintained by the Observational Health and Data Sciences Initiative collaborative. To protect participant privacy, a series of data transformations were applied. These included data suppression of codes with a high risk of identification such as military status; generalization of categories, including age, sex at birth, gender identity, sexual orientation, and race; and date shifting by a random (less than 1 year) number of days, implemented consistently across each participant record. Documentation on privacy implementation and creation of the Curated Data Repository is available in the AoU Registered Tier Curated Data Repository Data Dictionary. The Researcher Workbench currently offers tools with a user interface built for selecting groups of participants (Cohort Builder), creating datasets for analysis (Dataset Builder), and Workspaces with Jupyter Notebooks (Notebooks) to analyze data. The Notebooks enable use of saved datasets and direct query using R and Python 3 programming languages.
Figure 1 depicts our study workflow. At the time of the alpha demonstration phase, there were 242,070 adult participants in AoU . Queries using Systematized Nomenclature of Medicine (SNOMED) codes for “Glaucoma” and subsequently “Primary open-angle glaucoma” narrowed the cohort. Finally, to maintain consistency in cohort definition with the initial single-center model, our final study cohort consisted of adult (aged 18 years and above) participants with International Classification of Diseases (ICD)-9 code 365.11 or any variants of ICD-10 code H40.11, derived from 26 distinct enrollment sites.
Data Processing
We used the Researcher Workbench to extract relevant data for the analysis. Within the Workbench, we first defined the cohort according to the criteria mentioned above. Next, we built concept sets for the outcome and each predictor in the Workbench by selecting relevant codes (eg, ICD and/or SNOMED codes for conditions, Logical Observation Identifiers Names and Codes [LOINC] for measurements and observations, RxNorm codes for medications, and Current Procedure Terminology codes for procedures). Because a key aim was to validate the prior single-center model, we extracted the same data types for this cohort. For example, we defined the outcome of interest equivalently to the initial single-center model (ie, need for glaucoma procedural intervention—laser or surgery—within 6 months of diagnosis) based on qualifying Current Procedure Terminology codes. The rationale for this was that the need for surgery serves as a surrogate for advancing/progressive disease. This approach was also used by Zheng and associates in examining associations between systemic medications and glaucoma. We then built concept sets for all predictors, such as vital signs (pulse, blood pressure), PMs (height, weight, body mass index), and comorbid conditions (ICD and SNOMED codes in categories included in the prior single-center model). These concept sets were linked to the study cohort to create “datasets,” or analysis-ready tables linking participants with the values of the selected concept sets. For participants who underwent glaucoma surgery intervention, data regarding predictors were restricted to the time period before the occurrence of the qualifying procedure code. In other words, all data with a timestamp occurring after the glaucoma procedure were censored and not included in modeling procedures in order to establish an appropriate temporal relationship between predictors and the outcome. We then exported these datasets to a Python 3.0 notebook within the AoU Workbench environment to conduct the analyses. During the export process, the Workbench generated structured query language (SQL) codes to extract the data of interest for the selected cohort and populated these directly into the notebook. All data extraction and cleaning procedures can be found in the referenced Python 3.0 notebook in our publicly available workspace.
Of note, the Observational Health and Medicines Outcomes Partnership Visits table was not included in the Workbench interface during the alpha demonstration phase. Therefore, to extract data on predictors related to visits (eg, total days of contact with the health care system, which was a predictor in our original model), we manually constructed a custom SQL query within the notebook. Similarly, we performed a custom SQL query to extract all medications in order to group them within therapeutic classes that we used in the original model. The Workbench interface allowed search and selection of individual medications, but not of medication classes.
The original single-center model did not include ophthalmic data, as the focus was on examining the predictive value of systemic data alone. Here, we also did not include ophthalmic data. One reason was to maintain the same data structure as the original model. Another was that ophthalmic data (beyond diagnosis codes) were sparse in the current AoU data repository. As an illustration, despite 6,665 AoU participants having a SNOMED code including “glaucoma,” structured intraocular pressure (IOP) data (LOINC code 56844-4) were available for less than 20 participants. Ophthalmic data coverage in AoU is detailed in [CR] .
Data Analysis and Modeling
General cohort characteristics
We generated descriptive statistics of the AoU study cohort for age, gender, race, and need for glaucoma surgery ( Table 1 ). These characteristics were compared with the initial single-center cohort using t tests for continuous variables and χ 2 tests for categorical variables. We considered P values <.05 as statistically significant.
Single-Center Cohort (N = 385) | All of Us Cohort (N = 1,231) | P Value | |
---|---|---|---|
Age (y), mean (SD) | 73.1 (12.2) | 69.1 (10.5) | <.001 |
Female | 198 (51.4) | 705 (57.3) | .04 |
Self-reported race | <.001 | ||
White | 214 (55.6) | 508 (41.3) | |
Black or African American | 23 (6.0) | 412 (33.5) | |
Asian | 49 (12.7) | 27 (2.2) | |
Other race or mixed race | 70 (18.2) | 21 (1.7) | |
None or skipped | 29 (7.5) | 263 (21.4) | |
Participants who underwent glaucoma surgery | 174 (45.2) | 286 (23.2) | <.001 |
Using AoU data to externally validate the published single-center model
Out of the models developed in the initial study, the best-performing and most interpretable model was the multivariable logistic regression model ( [CR] ). We therefore developed a dataset from AoU with the top 15 predictors from the single-center model and used it as an external validation set for this regression model. To evaluate performance, we measured area under the receiver operating characteristic curve (AUC) as well as accuracy, sensitivity, and specificity. These performance measures were compared with the performance measures achieved in the original analysis of the single-center datasets, which were generated with leave-one-out cross-validation. This validation was performed in R in the referenced notebook using tidyverse and dplyr .
Development of new models trained by AoU data
Next, we developed a broader dataset from AoU using 56 predictors, rather than limiting to only the 15 predictors included in the single-center model. This dataset was used to train numerous models, including multivariable logistic regression, random forests, and ANNs, using the scikit-learn and keras libraries in Python. , Data were split into 80% for training and 20% for testing. The test dataset was separated before any training procedures, and 5-fold cross-validation was used for all modeling approaches to reduce the risk of overfitting. We evaluated various feature selection techniques using Pearson correlation, backward elimination, recursive feature elimination, and Lasso regularization. Because the feature selection methods did not improve performance, the full complement of features was used for subsequent modeling. Using the test dataset, we evaluated performance metrics such as AUC, accuracy, precision (also known as positive predictive value), and recall (also known as sensitivity). For ANNs, we compared a variety of neural network architectures (eg, varying the number of epochs, the number of hidden layers, and the number of nodes within each layer) using a grid search method. Because of class imbalance of the outcome label in the dataset (286 participants with surgery out of 1,231 total in the cohort), we also evaluated models incorporating minority upsampling or majority downsampling for balancing the classes. Variables of importance were evaluated for the best-performing random forest model.
We also evaluated models trained with data from AoU that were limited to only the 15 predictors available in the original single-center model. Again, we developed models using multivariable logistic regression, random forests, and ANNs using an 80%/20% training/testing split and 5-fold cross-validation. The same ANN architecture was used as the best-performing ANN developed using the broader dataset. Details of all modeling procedures and execution are described in the referenced Python 3.0 notebook.
RESULTS
General Cohort Characteristics
There were significant differences between the cohorts of participants with glaucoma from the initial single-center study with that using AoU ( Table 1 ). The AoU cohort was over triple the size of the single-center cohort and consisted of EHR data derived from 26 distinct enrollment sites. The mean (standard deviation) age was 69.1 (10.5) years, which was significantly younger than the single-center cohort, where the mean (standard deviation) age was 73.1 (12.2) years ( P < .001). Participants identifying as female comprised more than half of both cohorts and were significantly better represented in the AoU cohort (57.3% vs 51.4% in the single-center cohort, P = .04). The proportion of participants identifying as black or African American was 33.5% in the AoU cohort, over 5 times that in the single-center cohort. However, the single-center cohort had a much higher proportion of Asian participants (12.7% compared with 2.2% in AoU ). In both cohorts, approximately a quarter of participants did not indicate a single racial category, for example, answering “other” or “mixed” or not providing any answer. In total, 286 (23.2%) participants in the AoU cohort underwent some kind of glaucoma procedure. This was a significantly lower percentage than the 45.2% of participants in the single-center cohort ( P < .001).
External Validation of the Single-Center Model
To externally validate a previously published single-center model, we used data from AoU as an independent test set using the same coefficients included in the initial model ( [CR] ). The overall accuracy of the model when validated on AoU data was 0.69, exceeding the accuracy of the single-center model when using leave-one-out cross-validation (0.62). The overall discriminative ability (demonstrated by the AUC) of the single-center model when applied to the AoU cohort was only 0.49, indicating no discrimination between those who progressed and did not progress to surgery. This was lower than the AUC of the single-center model when previously validated with data from the same center (0.67).
Development of New Models Using AoU Data
We then used AoU data with an expanded set of predictors to train new models predicting the need for glaucoma surgery using multivariable logistic regression, random forests, and ANNs. The predictive performance of multivariable logistic regression models trained with AoU data was better than the performance of the model trained with single-center data, with the best-performing logistic regression model trained with AoU achieving an AUC of 0.80 and accuracy of 0.87, although recall was only 0.51 ( Table 2 ). Variables in the logistic regression model are detailed in [CR] . The best-performing ANN was a deep learning network consisting of 4 dense layers trained over 25 epochs on a minority upsampled dataset, resulting in an AUC of 0.93.
AUC | Precision | Recall | Accuracy | |
---|---|---|---|---|
Models trained with a dataset containing a broad set of predictor variables (56 predictors) | ||||
Multivariable logistic regression | 0.80 | 0.90 | 0.51 | 0.87 |
Artificial neural networks | 0.93 | 0.83 | 0.84 | 0.92 |
Random forests | 0.99 | 1.00 | 0.88 | 0.97 |
Models trained with a dataset containing predictor variables restricted to those from the original single-center model (15 predictors) | ||||
Multivariable logistic regression | 0.81 | 0.40 | 0.75 | 0.68 |
Artificial neural networks | 0.94 | 0.84 | 0.74 | 0.91 |
Random forests | 0.97 | 0.91 | 0.75 | 0.93 |
Predictive modeling using random forests yielded the best performance. With minority upsampling as a class-balancing procedure on the training dataset, random forests achieved an AUC of 0.99 in identifying participants who needed glaucoma surgery ( Figure 2 ). In addition, the random forests model had excellent accuracy, precision, and recall ( Table 2 ).
Based on an analysis of feature importance in the random forest model, predictors related to days of contact with the health care system, systolic blood pressure, diastolic blood pressure, pulse, and body measurements (eg, body mass index) were of highest relative importance ( [CR] ). These carried more importance in predictions than comorbid conditions or medications.
To evaluate whether predictive performance could be maintained with a narrower set of predictor variables, we also trained models using data from AoU using only the 15 statistically significant predictor variables from the original single-center model. Even with a narrower training dataset, these models achieved comparable AUCs (0.81 for logistic regression, 0.94 for ANNs, and 0.97 for random forests). However, their precision, recall, and accuracy were generally not as strong as the models trained with the broader dataset ( Table 2 ).
DISCUSSION
Here, we report findings from a demonstration project leveraging early access to the AoU Research Program, a large-scale prospective nationwide cohort study, centered on predictive modeling of need for glaucoma surgery among adults with primary open-angle glaucoma. This is the first analysis of data related to ophthalmology from AoU.
Our first key finding was that AoU offered a larger number and generally greater diversity of participants compared with our cohort for the original model, which was derived from a single academic center. Notably, the representation of women and of participants identifying as black or African American was significantly higher. This is highly relevant for glaucoma, because women and minorities—and particularly individuals of African descent—bear a disproportionate burden of disease and blindness. However, approximately a quarter of each cohort (for both the single-center and AoU ) had incomplete race information, that is, “other race” or not indicated/completed. Missing data in EHRs are a well-known limitation. For studies aimed at understanding health care disparities or differential risk based on race, excluding individuals with incomplete race information may negatively affect the generalizability of results. The inclusion of genomic data in AoU , anticipated for release at a future date, will enable determination of racial admixtures via sequencing analyses rather than by self-report and may help address some of these gaps.
The initial model trained with data from a single academic center demonstrated weaker performance when using AoU data for external validation, compared with the internal validation. This demonstrated that the model developed with data sampled from one population could not necessarily be applied to patients from another population. This was not surprising for several reasons. First, the initial model was based on a relatively small number of participants originating from a single center. Second, comparing the 2 cohorts revealed significant differences in demographics and incidence of glaucoma surgery. The single-center cohort was older and had higher rates of surgery, which was expected for a clinic-based population at a tertiary referral center. In contrast, AoU includes community-based recruitment sites in addition to clinic-based sites, reflecting an overall younger and healthier population. Using data from this population may therefore offer additional insights for primary prevention efforts, by including individuals beyond clinic- or hospital-based populations alone. Finally, decreased performance of predictive models on external validation is not uncommon, especially when the sample size for developing the model is small. However, a systematic review of 120 clinical risk prediction models found that there is a relative dearth of well-conducted and clearly reported external validation studies on independent data, which limits the translation of predictive models to clinical guidelines and practice. AoU can serve as a data source for external validation studies to help address this issue.
Models trained with AoU data achieved superior predictive performance than the single-center models, even when trained with data limited to only 15 predictor variables from the original single-center model. The best-performing model overall was the random forest model trained with the broader AoU dataset, achieving an AUC of 0.99. Variables in this model with the highest relative importance for driving predictions included measurements related to blood pressure. This supports findings from the initial single-center model, where predictions were also heavily influenced by blood pressure measurements. Several prior studies have demonstrated relationships between blood pressure, IOP, and glaucoma. , , However, these relationships are complex and not completely understood. With rapidly evolving blood pressure management guidelines recommending more aggressive blood pressure reduction, and the potential for increased glaucoma progression events secondary to optic nerve hypoperfusion, improving understanding of these relationships is critically important. Our study results provide additional support for further investigation in this area.
Another advantage of models such as these is that they facilitate integration of systemic data into clinical decision-making. Ophthalmology is known to be a high-volume specialty that demands high efficiency during patient encounters. As such, time spent in reviewing the medical record for each patient is minimal, which has been demonstrated in several prior studies. , In light of these time pressures and the lack of detailed medical record review, clinical decision support tools that are embedded into EHR systems and have programmed logic to automatically extract relevant data elements (such as medications, comorbidities, and blood pressure measurements) and calculate risk scores directly for ophthalmologists to view would enable these important factors to be considered in the assessment and management of patients. Therefore, ophthalmologists would not need to extract these data elements themselves or perform any manual calculations. For a given patient, they could theoretically view the risk calculated by the model based on these systemic factors in the EHR and incorporate that information into their overall assessment of the patient. In general, clinical decision support tools that calculate risk scores in real time within EHRs to enable predictive analytics are increasingly common, although several implementation challenges are present. , Moreover, a recent systematic review did not find any publications describing predictive models related to ophthalmology that had been embedded into EHRs for real-time use. Therefore, developing best practices for implementing these models for routine use by ophthalmologists is a nascent area ripe for ongoing investigation.
Although we have highlighted some of the strengths of AoU data in the context of our study findings above, there are certainly limitations as well. Data on ophthalmic observations (eg, IOP) are sparse ( [CR] ), and this could be secondary to many ophthalmology observations being captured in clinical notes rather than in discrete data fields mapped to LOINC codes. Although free-text clinical notes are not currently available in AoU , future inclusion of these notes and the use of natural language processing techniques may allow more ophthalmic data to be extracted and analyzed. Although AoU includes procedure codes for imaging and visual fields, the data repository does not currently include images or visual fields themselves. Figure 3 delineates current strengths and limitations of AoU data for ophthalmic research in greater detail.