权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: New data integration approaches for efficient and robust meta-estimation, model fusion and transfer learning

职业：新的数据集成方法，用于高效、稳健的元估计、模型融合和迁移学习

基本信息

批准号：
2337943
负责人：
Emily Hector
金额：
$ 45万
依托单位：
North Carolina State University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-06-01 至 2029-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2337943&HistoricalAwards=false
关键词：
CAREER New data integration approaches

项目摘要

Statistical science aims to learn about natural phenomena by drawing generalizable conclusions from an aggregate of similar experimental observations. With the recent “Big Data” and “Open Science” revolutions, scientists have shifted their focus from aggregating individual observations to aggregating massive publicly available datasets. This endeavor is premised on the hope of improving the robustness and generalizability of findings by combining information from multiple datasets. For example, combining data on rare disease outcomes across the United States can paint a more reliable picture than basing conclusions only on a small number of cases in one hospital. Similarly, combining data on disease risk factors across the United States can distinguish local from national health trends. To date, statistical approaches to these data aggregation objectives have been limited to simple settings with limited practical utility. In response to this gap, this project develops new methods for aggregating information from multiple datasets in three distinct data integration problems grounded in scientific practice. The developed approaches are intuitive, principled and robust to substantial differences between datasets, and are broadly applicable in medical, economic and social sciences, among others. Among other applications, the project will deliver new tools to extract health insights from large electronic health records databases. The project will support undergraduate and graduate student training, course development, and the recruitment and professional mentoring of under-represented minorities in statistics. Further, the project will impact STEM education through a data science teacher training program in underserved communities.This project develops intuitive, principled, robust and efficient methods in three essential data integration problems: meta-analysis, model fusion and transfer learning. First, the project delivers a set of meta-analysis methods for privacy-preserving one-shot estimation and inference using a new notion of dataset similarity. The primary novelty in the approach is the joint estimation of both dataset-specific parameters and a combined parameter that bears some similarity to the classic meta-estimator. Second, the project establishes model fusion methods that learn the clustering of similar datasets. The methods’ unique feature is a model fusion that dials data integration along a spectrum of more to less fusion and thereby does not force model parameters from clustered datasets to be exactly equal. Third, the project develops flexible and robust transfer learning approaches that leverage historical information for improved statistical efficiency in a target dataset of interest. An important element of these approaches is a flexible specification of the type of models fit to the source datasets. All three sets of methods place a premium on interpretability, statistical efficiency and robustness of the inferential output. The project unifies the three sets of proposed methods under a formal data integration framework formulated around two axioms of data integration. Data integration ideas pervade every field of scientific study in which data are collected, and so the research contributes to scientific endeavors in the medical, economic and social sciences, among others.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

统计科学旨在通过从相似的实验观察中得出可推广的结论来了解自然现象。随着最近的“大数据”和“开放科学”革命，科学家们已经将他们的重点从聚合单个观察转移到聚合大量公开可用的数据集。这一奋进是为了通过结合来自多个数据集的信息来提高发现的鲁棒性和普遍性。例如，将美国各地罕见疾病结果的数据结合起来，可以描绘出比仅基于一家医院的少数病例得出结论更可靠的画面。同样，将美国各地的疾病风险因素数据结合起来，可以区分地方和国家的健康趋势。迄今为止，实现这些数据汇总目标的统计方法仅限于简单的设置，实用性有限。针对这一差距，该项目开发了新的方法，用于在三个不同的数据集成问题中聚合来自多个数据集的信息，这些问题基于科学实践。所开发的方法直观，原则性强，对数据集之间的实质性差异具有鲁棒性，并且广泛适用于医学，经济和社会科学等领域。在其他应用中，该项目将提供新的工具，从大型电子健康记录数据库中提取健康见解。该项目将支持本科生和研究生的培训、课程编制以及征聘和专业辅导在统计领域任职人数不足的少数群体。此外，该项目将通过在服务不足的社区开展数据科学教师培训计划来影响STEM教育。该项目在三个基本的数据集成问题上开发直观，原则性，稳健和高效的方法：元分析，模型融合和迁移学习。首先，该项目提供了一组元分析方法，用于使用数据集相似性的新概念进行隐私保护的一次性估计和推理。该方法的主要新奇之处是联合估计两个特定的参数和一个组合参数，具有一些相似的经典的元估计。其次，该项目建立了学习相似数据集聚类的模型融合方法。该方法的独特之处是一个模型融合，拨号数据集成沿着一个频谱的更多到更少的融合，从而不强迫模型参数从集群数据集是完全相等的。第三，该项目开发了灵活而强大的迁移学习方法，利用历史信息提高目标数据集的统计效率。这些方法的一个重要元素是灵活地指定适合源数据集的模型类型。所有这三套方法都重视推理输出的可解释性、统计效率和鲁棒性。该项目统一了三套提出的方法下，制定了正式的数据集成框架，围绕两个公理的数据集成。数据整合的理念贯穿于收集数据的科学研究的各个领域，因此，该研究为医学、经济和社会科学等领域的科学努力做出了贡献。该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。