CAREER: New data integration approaches for efficient and robust meta-estimation, model fusion and transfer learning

职业:新的数据集成方法,用于高效、稳健的元估计、模型融合和迁移学习

基本信息

  • 批准号:
    2337943
  • 负责人:
  • 金额:
    $ 45万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2024
  • 资助国家:
    美国
  • 起止时间:
    2024-06-01 至 2029-05-31
  • 项目状态:
    未结题

项目摘要

Statistical science aims to learn about natural phenomena by drawing generalizable conclusions from an aggregate of similar experimental observations. With the recent “Big Data” and “Open Science” revolutions, scientists have shifted their focus from aggregating individual observations to aggregating massive publicly available datasets. This endeavor is premised on the hope of improving the robustness and generalizability of findings by combining information from multiple datasets. For example, combining data on rare disease outcomes across the United States can paint a more reliable picture than basing conclusions only on a small number of cases in one hospital. Similarly, combining data on disease risk factors across the United States can distinguish local from national health trends. To date, statistical approaches to these data aggregation objectives have been limited to simple settings with limited practical utility. In response to this gap, this project develops new methods for aggregating information from multiple datasets in three distinct data integration problems grounded in scientific practice. The developed approaches are intuitive, principled and robust to substantial differences between datasets, and are broadly applicable in medical, economic and social sciences, among others. Among other applications, the project will deliver new tools to extract health insights from large electronic health records databases. The project will support undergraduate and graduate student training, course development, and the recruitment and professional mentoring of under-represented minorities in statistics. Further, the project will impact STEM education through a data science teacher training program in underserved communities.This project develops intuitive, principled, robust and efficient methods in three essential data integration problems: meta-analysis, model fusion and transfer learning. First, the project delivers a set of meta-analysis methods for privacy-preserving one-shot estimation and inference using a new notion of dataset similarity. The primary novelty in the approach is the joint estimation of both dataset-specific parameters and a combined parameter that bears some similarity to the classic meta-estimator. Second, the project establishes model fusion methods that learn the clustering of similar datasets. The methods’ unique feature is a model fusion that dials data integration along a spectrum of more to less fusion and thereby does not force model parameters from clustered datasets to be exactly equal. Third, the project develops flexible and robust transfer learning approaches that leverage historical information for improved statistical efficiency in a target dataset of interest. An important element of these approaches is a flexible specification of the type of models fit to the source datasets. All three sets of methods place a premium on interpretability, statistical efficiency and robustness of the inferential output. The project unifies the three sets of proposed methods under a formal data integration framework formulated around two axioms of data integration. Data integration ideas pervade every field of scientific study in which data are collected, and so the research contributes to scientific endeavors in the medical, economic and social sciences, among others.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
统计科学旨在通过从相似的实验观察中得出可推广的结论来了解自然现象。随着最近的“大数据”和“开放科学”革命,科学家们已经将他们的重点从聚合单个观察转移到聚合大量公开可用的数据集。这一奋进是为了通过结合来自多个数据集的信息来提高发现的鲁棒性和普遍性。例如,将美国各地罕见疾病结果的数据结合起来,可以描绘出比仅基于一家医院的少数病例得出结论更可靠的画面。同样,将美国各地的疾病风险因素数据结合起来,可以区分地方和国家的健康趋势。迄今为止,实现这些数据汇总目标的统计方法仅限于简单的设置,实用性有限。针对这一差距,该项目开发了新的方法,用于在三个不同的数据集成问题中聚合来自多个数据集的信息,这些问题基于科学实践。所开发的方法直观,原则性强,对数据集之间的实质性差异具有鲁棒性,并且广泛适用于医学,经济和社会科学等领域。在其他应用中,该项目将提供新的工具,从大型电子健康记录数据库中提取健康见解。该项目将支持本科生和研究生的培训、课程编制以及征聘和专业辅导在统计领域任职人数不足的少数群体。此外,该项目将通过在服务不足的社区开展数据科学教师培训计划来影响STEM教育。该项目在三个基本的数据集成问题上开发直观,原则性,稳健和高效的方法:元分析,模型融合和迁移学习。首先,该项目提供了一组元分析方法,用于使用数据集相似性的新概念进行隐私保护的一次性估计和推理。该方法的主要新奇之处是联合估计两个特定的参数和一个组合参数,具有一些相似的经典的元估计。其次,该项目建立了学习相似数据集聚类的模型融合方法。该方法的独特之处是一个模型融合,拨号数据集成沿着一个频谱的更多到更少的融合,从而不强迫模型参数从集群数据集是完全相等的。第三,该项目开发了灵活而强大的迁移学习方法,利用历史信息提高目标数据集的统计效率。这些方法的一个重要元素是灵活地指定适合源数据集的模型类型。所有这三套方法都重视推理输出的可解释性、统计效率和鲁棒性。该项目统一了三套提出的方法下,制定了正式的数据集成框架,围绕两个公理的数据集成。数据整合的理念贯穿于收集数据的科学研究的各个领域,因此,该研究为医学、经济和社会科学等领域的科学努力做出了贡献。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Emily Hector其他文献

Emily Hector的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

CAREER: New Frontiers of Private Learning and Synthetic Data
职业:私人学习和合成数据的新领域
  • 批准号:
    2339775
  • 财政年份:
    2024
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Bootstrapping Recognition from Little Data in New Domains
职业:从新领域的小数据中引导识别
  • 批准号:
    2144117
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Teaching Old Data New Tricks: Leveraging Legacy Field Data to Investigate Ice-stream Shut down and Inspire a New Generation of Cryospheric Scientists
职业:教授旧数据新技巧:利用遗留现场数据调查冰流关闭并激发新一代冰冻圈科学家
  • 批准号:
    2145407
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Standard Grant
CAREER: Advancing Fair Data Mining via New Robust and Explainable Algorithms and Human-Centered Approaches
职业:通过新的稳健且可解释的算法和以人为本的方法推进公平数据挖掘
  • 批准号:
    2146091
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Standard Grant
CAREER: New Frontiers In Large-Scale Spatiotemporal Data Analysis
职业:大规模时空数据分析的新领域
  • 批准号:
    2146343
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Developing New Computational Methods to Address the Missing Data Problem in Population Genomics
职业:开发新的计算方法来解决群体基因组学中的缺失数据问题
  • 批准号:
    2147812
  • 财政年份:
    2021
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Developing New Computational Methods to Address the Missing Data Problem in Population Genomics
职业:开发新的计算方法来解决群体基因组学中的缺失数据问题
  • 批准号:
    2042516
  • 财政年份:
    2021
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: New Frontiers in Computing on Private Data
职业:私有数据计算的新领域
  • 批准号:
    1942789
  • 财政年份:
    2020
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: New Change-Point Problems in Analyzing High-Dimensional and Non-Euclidean Data
职业:分析高维和非欧几里得数据的新变点问题
  • 批准号:
    1848579
  • 财政年份:
    2019
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
Mentoring to Support Designing and Launching of New Data Science Career Pathways at Community Colleges
指导支持社区学院设计和启动新的数据科学职业道路
  • 批准号:
    1902568
  • 财政年份:
    2019
  • 资助金额:
    $ 45万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了