Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data
在存在缺失和有偏差的电子健康记录数据的情况下,利用深度学习方法进行个性化风险预测
基本信息
- 批准号:10646324
- 负责人:
- 金额:$ 33.07万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-08-06 至 2025-05-31
- 项目状态:未结题
- 来源:
- 关键词:AgeAlgorithmsBenchmarkingBiological MarkersCalibrationCardiovascular DiseasesCharacteristicsChronic DiseaseClinicalClinical DataClinical MedicineClinical ResearchComputersComputing MethodologiesDataData SetDevelopmentDiagnosisDiseaseDisease OutcomeDisparateDrug PrescriptionsElectronic Health RecordFrequenciesGenderHealthHealth SurveysHealthcareHospitalsIndividualInstitutionInterviewLinkMedicalMethodologyMethodsModelingNeural Network SimulationNew YorkNew York CityNon-Insulin-Dependent Diabetes MellitusOutcomePatientsPhysical assessmentPopulationProbabilityProceduresRaceRecordsReproducibilityResearchResearch PersonnelRetirementRiskRisk FactorsSample SizeSamplingScientistSiteSoftware ToolsStatistical MethodsSurveysSystemUniversitiesValidationVariantVisitcohortdata modelingdata standardsdeep learningdemographicselectronic health dataexperienceflexibilityhealth care service utilizationimprovedinnovationinterestlearning strategymaltreatmentpatient populationpatient subsetspersonalized risk predictionpopulation basedpopulation healthpredictive modelingrecurrent neural networkrisk predictionrisk prediction modelweb app
项目摘要
Abstract
Since 2010, clinical medicine has benefited from a rapid surge of clinical research on chronic diseases using
data from electronic health records (EHRs). EHRs are appealing because they can offer large sample sizes,
timely information, and a wealth of clinical information beyond that obtained from either health surveys or
administrative data. However, while millions of patient records are included in large EHR records, they are not
population-representative random samples, a constraint that potentially biases inferences based on such data
and, therefore, has limited their utility for population health research. EHR data typically contain multiple types
of biases, particularly: 1) sampling inclusion bias: EHR data only include information on patients visiting
participating medical systems, and they primarily capture data when patients are ill. Even among populations
with a particular disease, patients represented in EHRs tend to over-represent individuals who are sicker and
have higher health care utilization; 2) sampling frequency bias: the numbers of patients’ encounters and
features in EHRs are at various frequencies and these frequencies correlate with both patients’ characteristics
and outcomes; and 3) institution bias: EHR samples of any hospital reflect the characteristics of patients
population served by that specific hospital. Consequently, EHR-based risk prediction models will have 1)
biases in risk factor selection and estimation for population inferences; 2) disparate mistreatment (unfairness)
in terms of variation in a model’s prediction accuracy across patient subgroups (such as gender, race, and age)
with various sampling inclusion probabilities or frequencies; 3) biased prediction model to reflect characteristics
of patients served by the local hospitals. We propose to develop: 1) effective sample-weighting method to
correct biases in risk factor selection and estimation for population inferences (Aim 1), 2) flexible deep learning
method for EHR personalized risk prediction with fairness criteria (Aim 2); and 3) innovative calibration method
to improve reproducibility of EHR-based risk models between institutions (Aim 3). We will predict risk of
subsequent incident cardiovascular disease (CVD) in patients with type 2 diabetes (T2DM) as a demonstration
of methodology development. Broader use of these methods will be generally applicable to other diseases
outcomes and population of interest. To develop and validate these methods, we propose to analyze three
unique datasets: 1) the New York University Langone Health EHR data (NYU-CDRN, 2009 to now) including
demographics, vitals, diagnoses, lab results, prescriptions, and procedures; 2) the New York City Clinical Data
Research Network (NYC-CDRN)—an EHR network comprising 20 NYC healthcare institutions, including the
NYU-CDRN, with longitudinally linked data on >12 million patient encounters under a Common Data Model,
and 3) the Health and Retirement Survey (HRS, begun in 1992 and ongoing), as a benchmark population-
based cohort, that has nationally representative health interview data for over 20 years, as well as biomarkers,
physical assessment information, prescription drug data, and claims linkages.
摘要
自2010年以来,临床医学受益于对慢性病的临床研究的快速增长,
来自电子健康记录(EHR)的数据。EHR很有吸引力,因为它们可以提供大的样本量,
及时的信息,以及丰富的临床信息,这些信息超出了从健康调查或
行政数据。然而,尽管数百万的患者记录包含在大型EHR记录中,
人口代表性随机样本,这是一种可能会使基于此类数据的推断产生偏差的约束
因此限制了它们在人群健康研究中的应用。EHR数据通常包含多种类型
偏差,特别是:1)抽样纳入偏差:EHR数据仅包括患者访问的信息
参与的医疗系统,他们主要是在病人生病时捕获数据。即使在人群中
对于特定疾病,EHR中的患者往往过度代表病情较重的个体,
有较高的医疗服务利用; 2)抽样频率偏差:患者的接触次数和
EHR中的特征处于不同的频率,这些频率与两个患者的特征相关
机构偏倚:任何医院的EHR样本都反映了患者的特征
该医院的服务对象。因此,基于EHR的风险预测模型将具有1)
风险因素选择和人口推断估计的偏差; 2)不同的虐待(不公平)
就模型在患者亚组(如性别、种族和年龄)之间的预测准确性变化而言
具有不同的抽样包含概率或频率; 3)有偏预测模型,以反映特征
当地医院服务的病人。我们建议发展:1)有效的样本加权方法,
纠正风险因素选择和人口推断估计中的偏差(目标1),2)灵活的深度学习
一种基于公平性准则的EHR个性化风险预测方法(目标2); 3)创新的校准方法
提高机构间基于EHR的风险模型的可重复性(目标3)。我们将预测
2型糖尿病(T2 DM)患者随后发生的心血管疾病(CVD)作为证据
方法论的发展。这些方法的更广泛应用将普遍适用于其他疾病
结果和关注人群。为了开发和验证这些方法,我们建议分析三个
独特的数据集:1)纽约大学Langone Health EHR数据(NYU-CDRN,2009年至今),包括
人口统计学、生命体征、诊断、实验室结果、处方和程序; 2)纽约市临床数据
研究网络(NYC-CDRN)-一个EHR网络,由20个纽约市医疗机构组成,包括
NYU-CDRN,在通用数据模型下拥有超过1200万例患者的纵向链接数据,
以及3)健康和退休调查(HRS,1992年开始并持续进行),作为基准人口-
基于队列,具有20多年的全国代表性健康访谈数据,以及生物标志物,
身体评估信息、处方药数据和索赔联系。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Padhraic Smyth其他文献
Padhraic Smyth的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Padhraic Smyth', 18)}}的其他基金
Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data
在存在缺失和有偏差的电子健康记录数据的情况下,利用深度学习方法进行个性化风险预测
- 批准号:
10463550 - 财政年份:2021
- 资助金额:
$ 33.07万 - 项目类别:
相似海外基金
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
- 批准号:
2337776 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
- 批准号:
2338816 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
- 批准号:
2338846 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
- 批准号:
2348261 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
- 批准号:
2348346 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
- 批准号:
2348457 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
- 批准号:
2404989 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
- 批准号:
2339310 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
- 批准号:
2339669 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Continuing Grant
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
- 批准号:
EP/Y029089/1 - 财政年份:2024
- 资助金额:
$ 33.07万 - 项目类别:
Research Grant