权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data

在存在缺失和有偏差的电子健康记录数据的情况下，利用深度学习方法进行个性化风险预测

基本信息

批准号：
10646324
负责人：
Padhraic Smyth
金额：
$ 33.07万
依托单位：
NEW YORK UNIVERSITY SCHOOL OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-06 至 2025-05-31
项目状态：
未结题

项目摘要

Abstract Since 2010, clinical medicine has benefited from a rapid surge of clinical research on chronic diseases using data from electronic health records (EHRs). EHRs are appealing because they can offer large sample sizes, timely information, and a wealth of clinical information beyond that obtained from either health surveys or administrative data. However, while millions of patient records are included in large EHR records, they are not population-representative random samples, a constraint that potentially biases inferences based on such data and, therefore, has limited their utility for population health research. EHR data typically contain multiple types of biases, particularly: 1) sampling inclusion bias: EHR data only include information on patients visiting participating medical systems, and they primarily capture data when patients are ill. Even among populations with a particular disease, patients represented in EHRs tend to over-represent individuals who are sicker and have higher health care utilization; 2) sampling frequency bias: the numbers of patients’ encounters and features in EHRs are at various frequencies and these frequencies correlate with both patients’ characteristics and outcomes; and 3) institution bias: EHR samples of any hospital reflect the characteristics of patients population served by that specific hospital. Consequently, EHR-based risk prediction models will have 1) biases in risk factor selection and estimation for population inferences; 2) disparate mistreatment (unfairness) in terms of variation in a model’s prediction accuracy across patient subgroups (such as gender, race, and age) with various sampling inclusion probabilities or frequencies; 3) biased prediction model to reflect characteristics of patients served by the local hospitals. We propose to develop: 1) effective sample-weighting method to correct biases in risk factor selection and estimation for population inferences (Aim 1), 2) flexible deep learning method for EHR personalized risk prediction with fairness criteria (Aim 2); and 3) innovative calibration method to improve reproducibility of EHR-based risk models between institutions (Aim 3). We will predict risk of subsequent incident cardiovascular disease (CVD) in patients with type 2 diabetes (T2DM) as a demonstration of methodology development. Broader use of these methods will be generally applicable to other diseases outcomes and population of interest. To develop and validate these methods, we propose to analyze three unique datasets: 1) the New York University Langone Health EHR data (NYU-CDRN, 2009 to now) including demographics, vitals, diagnoses, lab results, prescriptions, and procedures; 2) the New York City Clinical Data Research Network (NYC-CDRN)—an EHR network comprising 20 NYC healthcare institutions, including the NYU-CDRN, with longitudinally linked data on >12 million patient encounters under a Common Data Model, and 3) the Health and Retirement Survey (HRS, begun in 1992 and ongoing), as a benchmark population- based cohort, that has nationally representative health interview data for over 20 years, as well as biomarkers, physical assessment information, prescription drug data, and claims linkages.

摘要自2010年以来，临床医学受益于对慢性病的临床研究的快速增长，来自电子健康记录（EHR）的数据。EHR很有吸引力，因为它们可以提供大的样本量，及时的信息，以及丰富的临床信息，这些信息超出了从健康调查或行政数据。然而，尽管数百万的患者记录包含在大型EHR记录中，人口代表性随机样本，这是一种可能会使基于此类数据的推断产生偏差的约束因此限制了它们在人群健康研究中的应用。EHR数据通常包含多种类型偏差，特别是：1）抽样纳入偏差：EHR数据仅包括患者访问的信息参与的医疗系统，他们主要是在病人生病时捕获数据。即使在人群中对于特定疾病，EHR中的患者往往过度代表病情较重的个体，有较高的医疗服务利用; 2）抽样频率偏差：患者的接触次数和 EHR中的特征处于不同的频率，这些频率与两个患者的特征相关机构偏倚：任何医院的EHR样本都反映了患者的特征该医院的服务对象。因此，基于EHR的风险预测模型将具有1）风险因素选择和人口推断估计的偏差; 2）不同的虐待（不公平）就模型在患者亚组（如性别、种族和年龄）之间的预测准确性变化而言具有不同的抽样包含概率或频率; 3）有偏预测模型，以反映特征当地医院服务的病人。我们建议发展：1）有效的样本加权方法，纠正风险因素选择和人口推断估计中的偏差（目标1），2）灵活的深度学习一种基于公平性准则的EHR个性化风险预测方法（目标2）; 3）创新的校准方法提高机构间基于EHR的风险模型的可重复性（目标3）。我们将预测 2型糖尿病（T2 DM）患者随后发生的心血管疾病（CVD）作为证据方法论的发展。这些方法的更广泛应用将普遍适用于其他疾病结果和关注人群。为了开发和验证这些方法，我们建议分析三个独特的数据集：1）纽约大学Langone Health EHR数据（NYU-CDRN，2009年至今），包括人口统计学、生命体征、诊断、实验室结果、处方和程序; 2）纽约市临床数据研究网络（NYC-CDRN）-一个EHR网络，由20个纽约市医疗机构组成，包括 NYU-CDRN，在通用数据模型下拥有超过1200万例患者的纵向链接数据，以及3）健康和退休调查（HRS，1992年开始并持续进行），作为基准人口- 基于队列，具有20多年的全国代表性健康访谈数据，以及生物标志物，身体评估信息、处方药数据和索赔联系。