III: Small: Labeling Massive Data from Noisy, Incomplete and Crowdsourced Annotations

III:小:标记来自嘈杂、不完整和众包注释的海量数据

基本信息

  • 批准号:
    2007836
  • 负责人:
  • 金额:
    $ 39.89万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

Alongside the prosperity of deep learning, the demand for reliably labeled data is unprecedentedly high. Label acquisition is a highly nontrivial task---data labeling is tedious, labor-intensive, and prone to mistakes. Crowdsourcing techniques that integrate annotations from multiple annotators to improve accuracy have been essential for labeling large-scale data. However, existing crowdsourcing techniques face pressing challenges such as heavy workload of annotators, high computational cost, and a lack of strong theoretical guarantees.  This project will develop a series of analytical and computational tools for accurately labeling massive datasets from noisy, incomplete, and crowdsourced annotations---with provable guarantees. Leveraging advanced nonnegative matrix factorization theory, this project will offer solutions that are efficient and effective under critical conditions. The outcomes are expected to have broad and substantial positive impacts on the currently label-hungry artificial intelligence industry and the data annotation workforce. For example, the algorithms designed for handling structured data (e.g., speech) will largely benefit timely applications, e.g., intelligent assistants such as Alexa and Siri. The ability of reliably working under largely incomplete data will help design new data dispatch schemes leading to significantly reduced annotator workload. The project will also offer many training opportunities for undergraduate students, with an emphasis on engaging those from underrepresented groups.In terms of theory and methods, many aspects of crowdsourced data labeling (e.g., sample complexity, noise robustness, and identifiability of the underlying statistical model) are still poorly understood. This project will provide a suite of theoretical and computational tools that advance these aspects. To be specific, the first thrust will build up a coupled nonnegative matrix factorization (CNMF) framework that bridges the classic Dawid-Skene model for crowdsourcing and advanced nonnegative factor analysis theories. This will establish firm theoretical foundations for crowdsourcing under critical conditions, and lead to theory-backed algorithms to attain substantially improved sample complexity and noise/incomplete data robustness. The second thrust exploits domain-dependent knowledge, e.g., data structure and annotator dependence, to come up with situation-aware crowdsourcing techniques for enhanced performance. The third thrust designs stochastic optimization strategies to provide scalable implementations for the CNMF framework, and evaluates the proposed methods over a variety of real-world applications. The analytical and computational tools developed in this project will provide strong provable guarantees and refreshing algorithmic solutions for long-standing challenges in crowdsourced data labeling. In addition, the CNMF theory and algorithms are exciting new directions for computational linear algebra, whose impacts can go well beyond this project.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
随着深度学习的繁荣,对可靠标记数据的需求空前高涨。标签获取是一项非常重要的任务-数据标签是乏味的,劳动密集型的,并且容易出错。众包技术集成了来自多个注释器的注释以提高准确性,这对于标记大规模数据至关重要。然而,现有的众包技术面临着迫切的挑战,如繁重的工作量的注释,计算成本高,缺乏强有力的理论保证。 该项目将开发一系列分析和计算工具,用于从嘈杂的,不完整的和众包的注释中准确标记大量数据集,并提供可证明的保证。利用先进的非负矩阵分解理论,该项目将提供在临界条件下高效和有效的解决方案。预计这些成果将对目前渴望标签的人工智能行业和数据注释劳动力产生广泛而实质性的积极影响。例如,设计用于处理结构化数据的算法(例如,语音)将极大地有益于及时的应用,例如,智能助手,如Alexa和Siri。在大部分不完整的数据下可靠工作的能力将有助于设计新的数据分发方案,从而大大减少注释器的工作量。该项目还将为本科生提供许多培训机会,重点是吸引那些代表性不足的群体。在理论和方法方面,众包数据标签的许多方面(例如,样本复杂性、噪声鲁棒性和潜在统计模型的可识别性)仍然知之甚少。该项目将提供一套理论和计算工具,推进这些方面。具体而言,第一个推力将建立一个耦合的非负矩阵分解(CNMF)框架,桥梁的经典Dawid-Skene模型的众包和先进的非负因素分析理论。这将为关键条件下的众包建立坚实的理论基础,并导致理论支持的算法,以实现大幅改善的样本复杂性和噪声/不完整数据的鲁棒性。第二个推力利用领域相关知识,例如,数据结构和注释器依赖性,以提出用于增强性能的情境感知众包技术。第三推力设计随机优化策略,提供可扩展的CNMF框架的实现,并评估所提出的方法在各种现实世界的应用。该项目开发的分析和计算工具将为众包数据标签的长期挑战提供强有力的可证明保证和令人耳目一新的算法解决方案。此外,CNMF理论和算法是计算线性代数令人兴奋的新方向,其影响可以远远超出这个项目。这个奖项反映了NSF的法定使命,并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Deep Clustering with Incomplete Noisy Pairwise Annotations: A Geometric Regularization Approach
  • DOI:
    10.48550/arxiv.2305.19391
  • 发表时间:
    2023-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tri Nguyen;Shahana Ibrahim;Xiao Fu
  • 通讯作者:
    Tri Nguyen;Shahana Ibrahim;Xiao Fu
Mixed Membership Graph Clustering via Systematic Edge Query
Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization
通过注释器共现插补和可证明对称非负矩阵分解进行众包
Deep Learning From Crowdsourced Labels: Coupled Cross-entropy Minimization, Identifiability, and Regularization
  • DOI:
    10.48550/arxiv.2306.03288
  • 发表时间:
    2023-06
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Shahana Ibrahim;Tri Nguyen;Xiao Fu
  • 通讯作者:
    Shahana Ibrahim;Tri Nguyen;Xiao Fu
Learning Mixed Membership from Adjacency Graph Via Systematic Edge Query: Identifiability and Algorithm
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Xiao Fu其他文献

Fast algorithm based on the Hilbert transform for high-speed absolute distance measurement using a frequency scanning interferometry method
基于希尔伯特变换的快速算法,采用频率扫描干涉法进行高速绝对距离测量
  • DOI:
    10.1364/ao.447750
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    1.9
  • 作者:
    Xiuming Li;Fajie Duan;Xiao Fu;Ruijia Bao;Jiajia Jiang;Cong Zhang
  • 通讯作者:
    Cong Zhang
Localization algorithm based on minimum condition number for wireless sensor networks
基于最小条件数的无线传感器网络定位算法
  • DOI:
    10.1007/s11767-013-2115-5
  • 发表时间:
    2013-01
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Du Xiaoyu;Sun Lijuan;Xiao Fu;Wang Ruchuan
  • 通讯作者:
    Wang Ruchuan
Measurement of acoustic properties for passive-material samples using multichannel inverse filter
使用多通道逆滤波器测量无源材料样品的声学特性
Tensor-Based Parameter Estimation of Double Directional Massive Mimo Channel with Dual-Polarized Antennas
基于张量的双极化天线双向大规模MIMO信道参数估计

Xiao Fu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Xiao Fu', 18)}}的其他基金

CIF: Small: Latent Neural Factor Models for Radio Cartography From Bits
CIF:小:来自 Bits 的无线电制图的潜在神经因子模型
  • 批准号:
    2210004
  • 财政年份:
    2022
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Standard Grant
CAREER: Nonlinear Factor Analysis for Sensing and Learning
职业:传感和学习的非线性因子分析
  • 批准号:
    2144889
  • 财政年份:
    2022
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Continuing Grant
CCSS: Block-term Tensor Tools for Multi-aspect Sensing and Analysis
CCSS:用于多方面传感和分析的块项张量工具
  • 批准号:
    2024058
  • 财政年份:
    2020
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Standard Grant
Collaborative Research: MLWiNS: ANN for Interference Limited Wireless Networks
合作研究:MLWiNS:干扰有限无线网络的 ANN
  • 批准号:
    2003082
  • 财政年份:
    2020
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Standard Grant
Collaborative Research: Multimodal Sensing and Analytics at Scale: Algorithms and Applications
协作研究:大规模多模态传感和分析:算法和应用
  • 批准号:
    1808159
  • 财政年份:
    2018
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Standard Grant

相似国自然基金

昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
  • 批准号:
  • 批准年份:
    2022
  • 资助金额:
    10.0 万元
  • 项目类别:
    省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
  • 批准号:
    32000033
  • 批准年份:
    2020
  • 资助金额:
    24.0 万元
  • 项目类别:
    青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
  • 批准号:
    31972324
  • 批准年份:
    2019
  • 资助金额:
    58.0 万元
  • 项目类别:
    面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
  • 批准号:
    81900988
  • 批准年份:
    2019
  • 资助金额:
    21.0 万元
  • 项目类别:
    青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
  • 批准号:
    31870821
  • 批准年份:
    2018
  • 资助金额:
    56.0 万元
  • 项目类别:
    面上项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
  • 批准号:
    31802058
  • 批准年份:
    2018
  • 资助金额:
    26.0 万元
  • 项目类别:
    青年科学基金项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
  • 批准号:
    31772128
  • 批准年份:
    2017
  • 资助金额:
    60.0 万元
  • 项目类别:
    面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
  • 批准号:
    81704176
  • 批准年份:
    2017
  • 资助金额:
    20.0 万元
  • 项目类别:
    青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
  • 批准号:
    91640114
  • 批准年份:
    2016
  • 资助金额:
    85.0 万元
  • 项目类别:
    重大研究计划

相似海外基金

CAREER: Reaction Development and Advancing Spectroscopic Analysis for Selective Labeling and Radiolabeling of Small Molecules
职业:反应开发和推进小分子选择性标记和放射性标记的光谱分析
  • 批准号:
    2237610
  • 财政年份:
    2023
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Continuing Grant
Development of Small Molecule Probes for the Selective Modification and Labeling of the Mycobacterial Cell Wall
开发用于选择性修饰和标记分枝杆菌细胞壁的小分子探针
  • 批准号:
    10394129
  • 财政年份:
    2021
  • 资助金额:
    $ 39.89万
  • 项目类别:
SaTC: CORE: Small: GOALI: Predicting and Labeling Email Phishing from Social Influence Cues and User Characteristics.
SaTC:核心:小:GOALI:根据社会影响线索和用户特征预测和标记电子邮件网络钓鱼。
  • 批准号:
    2028734
  • 财政年份:
    2020
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Standard Grant
Development of a small-sized labeling method for membrane proteins and its application for heterooligomer analysis
膜蛋白小尺寸标记方法的开发及其在异源寡聚体分析中的应用
  • 批准号:
    18H02561
  • 财政年份:
    2018
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Small Bioorthogonal Gold Nanoparticles (AuNP) for In Vivo Labeling of Biosystems
用于生物系统体内标记的小型生物正交金纳米颗粒 (AuNP)
  • 批准号:
    452380-2013
  • 财政年份:
    2015
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Vanier Canada Graduate Scholarship Tri-Council - Doctoral 3 years
Target identification by small-molecule labeling in live cells
通过活细胞中的小分子标记进行靶标识别
  • 批准号:
    8680550
  • 财政年份:
    2014
  • 资助金额:
    $ 39.89万
  • 项目类别:
Target identification by small-molecule labeling in live cells
通过活细胞中的小分子标记进行靶标识别
  • 批准号:
    8808761
  • 财政年份:
    2014
  • 资助金额:
    $ 39.89万
  • 项目类别:
Small Bioorthogonal Gold Nanoparticles (AuNP) for In Vivo Labeling of Biosystems
用于生物系统体内标记的小型生物正交金纳米颗粒 (AuNP)
  • 批准号:
    452380-2013
  • 财政年份:
    2014
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Vanier Canada Graduate Scholarship Tri-Council - Doctoral 3 years
Small Bioorthogonal Gold Nanoparticles (AuNP) for In Vivo Labeling of Biosystems
用于生物系统体内标记的小型生物正交金纳米颗粒 (AuNP)
  • 批准号:
    452380-2013
  • 财政年份:
    2013
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Vanier Canada Graduate Scholarship Tri-Council - Doctoral 3 years
BIGDATA: Small: DA: DCM: Labeling the World
大数据: 小: DA: DCM: 标记世界
  • 批准号:
    1250793
  • 财政年份:
    2013
  • 资助金额:
    $ 39.89万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了