CAREER: Mitigating the Lack of Labeled Training Data in Machine Learning Based on Multi-level Optimization

职业:基于多级优化缓解机器学习中标记训练数据的缺乏

基本信息

  • 批准号:
    2339216
  • 负责人:
  • 金额:
    $ 50万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2024
  • 资助国家:
    美国
  • 起止时间:
    2024-09-01 至 2029-08-31
  • 项目状态:
    未结题

项目摘要

Machine learning has demonstrated great success in numerous applications such as autonomous driving, early detection of diseases, drug design, etc. The accuracy of machine learning models highly depends on the accessibility of large-scale, human-labeled training data. However, such data is often very challenging to acquire in specialized domains such as healthcare, legislation, environmental sciences due to the high costs involved in obtaining high-grade human labels and data privacy concerns. This project will advance science by providing algorithms, software, and systems that can automatically generate high-quality labeled data to mitigate the lack of labeled training data in specific domains and and allow training of highly accurate machine learning models. The project will significantly broaden the applicability of machine learning across various application areas by lowering data barriers and will substantially reduce the labor costs of manual data annotation. For example, it will promote scientific discovery in structural biology and high-energy physics and streamline engineering design in wireless communication. It will facilitate the early detection of sepsis, lung cancer, Parkinson's disease, and sleep apnea, improving patient outcomes and quality of life. Applied to compound design and cement production, the developed technologies have the potential to expedite drug discovery and reduce energy consumption. To achieve the goal of creating high-quality labeled training data, this project will develop three complementary paradigms of novel approaches based on multi-level optimization and large language models, for: 1) end-to-end generation of labeled data; 2) annotation of unlabeled data; and, 3) example-specific adaptation/selection of labeled source data, respectively. First, the proposed data generation methods will leverage the worst-case and class-specific performance of downstream models to provide end-to-end and fine-grained guidance for generating data (with complex labels) that is tailored to improve the accuracy and robustness of downstream models, and to promote balanced performance across different classes. Second, the proposed data annotation methods will leverage an end-to-end mechanism that capitalizes on large language models, a sequence of verification procedures, and available side information to maximize the accuracy of generated labels. Third, the proposed adaptation/selection methods will distinguish between source examples that are inside or outside of a target domain and subsequently determine an example-specific adaptation/selection action end-to-end to ensure optimal use of source data. In addition, the proposed novel optimization algorithms and distributed systems will effectively tackle new challenges related to multi-level optimization, including non-differentiability, incompatibility with the optimizers of large language models, and scalability. This project represents the first one systematically leveraging multi-level optimization to create labeled data, effectively addressing a fundamental knowledge gap that existing methods often lack capabilities to perform end-to-end execution of multiple learning stages and therefore fall short in tailoring generated data to improve downstream models’ performance. Another significant innovation of this project is its effective harnessing of large language models for data annotation, which will substantially reduce the costs of manual labeling.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
机器学习在自动驾驶、疾病早期检测、药物设计等众多应用中取得了巨大的成功,机器学习模型的准确性高度依赖于大规模、人类标记的训练数据的可访问性。然而,在医疗保健、立法、环境科学等专业领域获取此类数据往往非常具有挑战性,因为获取高级人类标签的成本很高,而且存在数据隐私问题。该项目将通过提供算法、软件和系统来推动科学发展,这些算法、软件和系统可以自动生成高质量的标记数据,以缓解特定领域缺乏标记训练数据的问题,并允许训练高度准确的机器学习模型。该项目将通过降低数据壁垒,大大拓宽机器学习在各个应用领域的适用性,并将大幅降低人工数据注释的劳动力成本。例如,它将促进结构生物学和高能物理学的科学发现,并简化无线通信的工程设计。它将促进败血症、肺癌、帕金森病和睡眠呼吸暂停的早期检测,改善患者的预后和生活质量。应用于化合物设计和水泥生产,开发的技术有可能加快药物发现和降低能源消耗。为了实现创建高质量标记训练数据的目标,该项目将开发三种基于多级优化和大型语言模型的新方法的互补范例,分别用于:1)标记数据的端到端生成; 2)未标记数据的注释; 3)标记源数据的示例特定适配/选择。首先,所提出的数据生成方法将利用下游模型的最差情况和特定于类的性能,为生成数据(具有复杂标签)提供端到端和细粒度的指导,这些数据旨在提高下游模型的准确性和鲁棒性,并促进不同类之间的平衡性能。其次,拟议的数据注释方法将利用端到端的机制,利用大型语言模型、一系列验证程序和可用的辅助信息来最大限度地提高生成标签的准确性。第三,所提出的适配/选择方法将区分在目标域内部或外部的源示例,并且随后端到端地确定示例特定的适配/选择动作,以确保源数据的最佳使用。此外,所提出的新型优化算法和分布式系统将有效地解决与多级优化相关的新挑战,包括不可微性、与大型语言模型的优化器不兼容以及可扩展性。该项目代表了第一个系统地利用多级优化来创建标记数据的项目,有效地解决了现有方法通常缺乏执行多个学习阶段的端到端执行的能力的基本知识差距,因此无法定制生成的数据以提高下游模型的性能。该项目的另一个重要创新是有效利用大型语言模型进行数据注释,这将大大降低人工标注的成本。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Pengtao Xie其他文献

Mechanisms of structural thermal stress influence on hypersonic aerothermoelasticity
结构热应力对高超声速气动热弹性影响的机制
  • DOI:
    10.1016/j.ast.2025.110396
  • 发表时间:
    2025-09-01
  • 期刊:
  • 影响因子:
    5.800
  • 作者:
    Pengtao Xie;Kun Ye;Zhengyin Ye
  • 通讯作者:
    Zhengyin Ye
Generative AI enables medical image segmentation in ultra low-data regimes
生成式人工智能能够在超低数据环境下实现医学图像分割。
  • DOI:
    10.1038/s41467-025-61754-6
  • 发表时间:
    2025-07-14
  • 期刊:
  • 影响因子:
    15.700
  • 作者:
    Li Zhang;Basu Jindal;Ahmed Alaa;Robert Weinreb;David Wilson;Eran Segal;James Zou;Pengtao Xie
  • 通讯作者:
    Pengtao Xie
Supersonic flutter mechanism of “diamond-back” folding wings
“钻石背”折叠翼的超声速颤振机理
  • DOI:
    10.1016/j.ast.2024.109396
  • 发表时间:
    2024-10-01
  • 期刊:
  • 影响因子:
    5.800
  • 作者:
    Pengze Xie;Kun Ye;Pengtao Xie;Shubao Chen;Xiaopeng Wang;Zhengyin Ye
  • 通讯作者:
    Zhengyin Ye
Inference of multiple-wave population admixture by modeling decay of linkage disequilibrium with polynomial functions
通过用多项式函数模拟连锁不平衡的衰减来推断多波群体混合
  • DOI:
    10.1038/hdy.2017.5
  • 发表时间:
    2016-10
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ying Zhou;Kai Yuan;Yaoliang Yu;Xumin Ni;Pengtao Xie;Eric P. Xing;徐书华
  • 通讯作者:
    徐书华
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
DrugChat:在药物分子图上启用类似 ChatGPT 的功能
  • DOI:
    10.48550/arxiv.2309.03907
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Youwei Liang;Ruiyi Zhang;Li Zhang;Pengtao Xie
  • 通讯作者:
    Pengtao Xie

Pengtao Xie的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

分子伴侣Hsp90/FKBP51复合物通过稳定PPARγ减轻糖尿病心缺血再灌注损伤的机制研究
  • 批准号:
    JCZRYB202500986
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
基于SIRT1去乙酰化减轻海马区神经炎症探讨调腹通窍太极推拿手法干预缺血性PSD的临床疗效和机制研究
  • 批准号:
    JCZRLH202500203
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
钙库操纵性钙内流抑制剂CM4620减轻急性肺损伤的作用机制研究
  • 批准号:
    QN25H010011
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
辣椒来源纳米囊泡通过AMPK/eNOS轴抑制内皮间质转化减轻心肌梗死后纤维化的研究
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
白茅根活性成分当归素通过调节巨噬细胞可塑性减轻lgA肾病肾纤维化的机制研究
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
青蒿琥酯调控FOXO3/FDX1铜死亡途径减轻复苏后心肌损伤的作用机制研究
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
胰高血糖素样肽-1受体激动剂通过内皮UCP2减轻缺血性脑卒中后血脑屏障的破坏
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
肝肾串扰诱导的AHSG通过抑制铁死亡减轻急性肾损伤的作用及机制
  • 批准号:
    2025JJ50605
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
PKC γ通过调节线粒体蛋白的磷酸化减轻脑缺血再灌注损伤的实验研究
  • 批准号:
    2025JJ50618
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
TGR5调控炎症减轻心肌缺血再灌注损伤及机制研究
  • 批准号:
    2025JJ70390
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目

相似海外基金

Domino - Computational Fluid Dynamics Modelling of Ink Droplet Breakup for Mitigating Mist Formation during inkjet printing
Domino - 墨滴破碎的计算流体动力学模型,用于减轻喷墨打印过程中的雾气形成
  • 批准号:
    10090067
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Collaborative R&D
Collaborative Research: Leveraging the interactions between carbon nanomaterials and DNA molecules for mitigating antibiotic resistance
合作研究:利用碳纳米材料和 DNA 分子之间的相互作用来减轻抗生素耐药性
  • 批准号:
    2307222
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
Collaborative Research: AF: Medium: Algorithms Meet Machine Learning: Mitigating Uncertainty in Optimization
协作研究:AF:媒介:算法遇见机器学习:减轻优化中的不确定性
  • 批准号:
    2422926
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
Improving females' health and performance by mitigating heat strain
通过缓解热应激改善女性的健康和表现
  • 批准号:
    MR/X036235/1
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Fellowship
Mitigating the Influence of Social Bots in Heterogeneous Social Networks
减轻异构社交网络中社交机器人的影响
  • 批准号:
    DP240100181
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Discovery Projects
Collaborative Research: Leveraging the interactions between carbon nanomaterials and DNA molecules for mitigating antibiotic resistance
合作研究:利用碳纳米材料和 DNA 分子之间的相互作用来减轻抗生素耐药性
  • 批准号:
    2307223
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
IMPLEMENTATION: Shifting Culture and Mitigating Inequities in Landscape Ecology Through a Collaborative Network of Professional Societies
实施:通过专业协会的合作网络转变文化并减轻景观生态学中的不平等
  • 批准号:
    2335225
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Standard Grant
CAREER: Strengthening the Theoretical Foundations of Federated Learning: Utilizing Underlying Data Statistics in Mitigating Heterogeneity and Client Faults
职业:加强联邦学习的理论基础:利用底层数据统计来减轻异构性和客户端故障
  • 批准号:
    2340482
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Continuing Grant
Temporary Pacing Leads - Scoping and Mitigating Environmental Impact
临时起搏导联 - 界定和减轻环境影响
  • 批准号:
    10086805
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Collaborative R&D
Mitigating salmon gill disease by integrating genotype-environment studies with host-gill microbiome associations
通过将基因型-环境研究与宿主-鳃微生物组关联相结合来减轻鲑鱼鳃病
  • 批准号:
    BB/Y005295/1
  • 财政年份:
    2024
  • 资助金额:
    $ 50万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了