权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

EAGER: Collaborative Research: Advanced Machine Learning for Prediction of Preterm Birth

EAGER：协作研究：用于预测早产的先进机器学习

基本信息

批准号：
1454814
负责人：
Anita Raja
金额：
$ 3.06万
依托单位：
Cooper Union
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2014
资助国家：
美国
起止时间：
2014-09-01 至 2017-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1454814&HistoricalAwards=false
关键词：
EAGER Collaborative Research Advanced Machine

项目摘要

The United States spends over 26 billion dollars per annum on the delivery and care of the 12-13% of infants who are born preterm. As preterm birth (PTB) is a major public health problem with profound implications on society, there would be extreme value in being able to identify women at risk of preterm birth during the course of their pregnancy. Previous predictive approaches have been largely unsuccessful since they have focused on a limited number of well described risk factors known to be correlated with preterm birth (e.g. prior preterm birth, race, and infection) and less on combining multiple factors. The latter approach is necessary to understand the complex etiologies of preterm birth. While identifying individual PTB risk factors has brought insight into the problem and has led in some cases to successful treatments such as progesterone for women with a previous preterm birth, this has only a limited impact on the overall frequency since many at risk patients, such as first time mothers (nulliparous), go untreated. Today, there is still no widely tested prediction system that combines well-known PTB factors and is clinically useful. There is, however, a global awareness of the need to discover and integrate the complex etiologies of prematurity in order to predict women at risk. Significant efforts have been made in the last couple of decades to collect large curated datasets of pregnant women. Previous studies on these datasets used relatively straightforward biostatitistical methodologies such as relative risk assessments to measure associations between factors and PTB. However, risk factors are studied independently of each other, which does not account for the multifactorial complexity of PTB. This exploratory project aims to investigate the value of more advanced machine learning methods by simultaneously considering all the factors, to develop better predictive methods. The PTB data acquired in the context of this project brings together Electronic Health Records (EHRs) for mothers and their babies along with well-curated NIH data. The data is rich with structured clinical data and unstructured free text that require manual feature extraction. This project, largely motivated by the PTB problem, has two main goals: (1) Improving the quality and aggregation the annotations for heterogeneous data. The researchers aim to capture socioeconomic, psychological and behavioral risk factors documented in the text of clinical notes via studying the process of manual feature extraction by human annotators. State-of-the-art methods either rely on the expertise of the annotator and/or the difficulty of the instance but ignore the variability in the quality of labeling over time due to fatigue, boredom, or knowledge. To improve the annotations, the project will develop a novel Bayesian framework for human labeling of unstructured data. The Bayesian model will embed a complete set of parameters including the prevalence of each class, difficulty of the instance and variability in the quality of annotation during the process. If the model construction is successful, then the developed framework will replace ad-hoc heuristics into a well-designed process for producing high quality annotations. This framework would allow extracting reliable features from the clinical text for subsequent analyses in devising PTB prediction models.(2) Developing predictive models for multiple data spaces. To leverage all of the existing data, the project will investigate the value of using Vapnik's paradigm of Learning Using Privileged Information (LUPI) in the context of preterm birth. Privileged information is a data that is available for training models but is not available for test examples. Data in this project come with two potential privileged information spaces namely the clinical notes and the space of future events. NICU data is an example of future event privileged information, which is only available for a subset of the examples (only premature babies requiring intensive care stay in the NICU). It has been shown that LUPI not only induces a better decision rule, it also increases the rate of convergence of the algorithm, hence requiring fewer training examples. This is a compelling property in the case of PTB prediction because of the rate of PTB. The project will extend LUPI into a powerful and applicable framework to handle the two spaces of privileged information, while developing spline-generating kernels, to manage LUPI's high computational cost. If successful, this proof-of-concept is expected to yield efficient and widely applicable LUPI algorithms in domains where privileged information is available, such as the financial domain and many other medical applications. The developed software, publications and datasets resulting from this project will be made publicly available to the research community through the project website (http://www1.ccls.columbia.edu/~ansaf/CING/PTB/).

美国每年花费超过260亿美元用于12-13%的早产婴儿的分娩和护理。由于早产是一个对社会具有深远影响的重大公共卫生问题，因此能够在怀孕期间确定有早产风险的妇女将具有极大的价值。以前的预测方法在很大程度上是不成功的，因为它们关注的是有限数量的已知与早产相关的、描述良好的风险因素（例如，先前的早产、种族和感染），而不是综合多种因素。后一种方法是必要的，以了解早产的复杂病因。虽然确定结核病的个别危险因素使人们对这一问题有了更深入的了解，并在某些情况下导致了成功的治疗，如对以前早产的妇女使用黄体酮，但这对总体发病率的影响有限，因为许多高危患者，如首次分娩的母亲（未分娩），没有得到治疗。今天，仍然没有广泛测试的预测系统，结合众所周知的PTB因素，并在临床上有用。然而，全球都认识到需要发现和综合早产的复杂病因，以便预测处于危险中的妇女。在过去的几十年里，人们已经做出了巨大的努力来收集大量的孕妇数据集。以前对这些数据集的研究使用相对直接的生物统计学方法，如相对风险评估来衡量因素与肺结核之间的关系。然而，危险因素是相互独立的研究，这并不能说明PTB的多因素复杂性。这个探索性项目旨在通过同时考虑所有因素来研究更先进的机器学习方法的价值，以开发更好的预测方法。在该项目背景下获得的肺结核数据将母亲及其婴儿的电子健康记录（EHRs）与精心整理的NIH数据结合在一起。该数据具有丰富的结构化临床数据和需要人工特征提取的非结构化自由文本。这个项目的主要动机是PTB问题，它有两个主要目标：(1)提高异构数据注释的质量和聚合。研究人员旨在通过研究人类注释者手动特征提取的过程，捕捉临床笔记文本中记录的社会经济、心理和行为风险因素。最先进的方法要么依赖于注释者的专业知识，要么依赖于实例的难度，但忽略了由于疲劳、无聊或知识而导致的标注质量随时间的变化。为了改进注释，该项目将开发一种新的贝叶斯框架，用于对非结构化数据进行人工标记。贝叶斯模型将嵌入一套完整的参数，包括每个类的流行程度、实例的难度和注释质量的可变性。如果模型构建成功，那么所开发的框架将取代特定的启发式方法，变成一个设计良好的过程，用于生成高质量的注释。该框架将允许从临床文本中提取可靠的特征，用于设计PTB预测模型的后续分析。(2)开发多数据空间的预测模型。为了利用所有现有数据，该项目将调查在早产背景下使用Vapnik的使用特权信息学习范式（LUPI）的价值。特权信息是一种可用于训练模型但不可用于测试示例的数据。这个项目中的数据有两个潜在的特权信息空间，即临床记录和未来事件的空间。新生儿重症监护病房的数据是未来事件特权信息的一个例子，它只适用于其中的一个子集（只有需要重症监护的早产儿留在新生儿重症监护病房）。研究表明，LUPI不仅能产生更好的决策规则，还能提高算法的收敛速度，从而减少训练样例。由于PTB的发病率，这是PTB预测的一个令人信服的性质。该项目将把LUPI扩展为一个强大且适用的框架，以处理特权信息的两个空间，同时开发样条生成内核，以管理LUPI的高计算成本。如果成功，这一概念验证有望在特权信息可用的领域（如金融领域和许多其他医疗应用）产生高效且广泛适用的LUPI算法。本项目开发的软件、出版物和数据集将通过项目网站（http://www1.ccls.columbia.edu/~ansaf/CING/PTB/）向研究界公开。