权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Data Valuation in the Wild: Theories, Algorithms, and Applications

职业：野外数据评估：理论、算法和应用

基本信息

批准号：
2239622
负责人：
Ruoxi Jia
金额：
$ 50万
依托单位：
Virginia Polytechnic Institute and State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-02-01 至 2028-01-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2239622&HistoricalAwards=false
关键词：
CAREER Data Valuation Wild Theories

项目摘要

Data are essential ingredients for building machine learning (ML) applications. The ability to quantify and measure the value of data is critical to the entire ML lifecycle: from identifying useful data sources, to setting propriety over samples during training, and to interpreting the reason why certain behaviors of a model emerge during deployment. The potential of data valuation has been observed in many applications over the past few years. However, intermixed with these positive results is a vast array of applications for which existing data valuation techniques are not yet applicable, or too expensive to execute, or produce valuation results with substantial uncertainty. This project aims to enable data valuation to overcome applicability, scalability, and reproducibility challenges and transition to a practical and reliable tool for a data-centric future. This work will have a broad impact on society in terms of facilitating automated data quality management, designing incentives for data sharing, and improving the robustness of ML applications. This project will train undergraduate students to solve ML problems from both an algorithmic and a data quality perspective, while in the meantime creating useful school-age learning modules implemented at local, regional, and global scales. The project consists of four research tasks to advance data valuation from different dimensions: 1) designing data valuation techniques that are robust to overcome the randomness in modern ML training algorithms; 2) developing new frameworks to determine the value of data samples given limited information about downstream learning tasks; 3) investigating principled methods to value heterogeneous and streaming data; and 4) creating and open-sourcing a unified multi-faceted evaluation platform to spur future advances in more complex data valuation. The proposed techniques are implemented and validated on a variety of high-impact real-world applications, including autonomous driving, energy-efficient buildings, and conversational artificial intelligence.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

数据是构建机器学习（ML）应用程序的基本要素。量化和衡量数据价值的能力对整个机器学习生命周期至关重要：从识别有用的数据源，到在训练期间对样本进行适当设置，以及解释在部署期间出现模型某些行为的原因。在过去几年中，在许多应用中已经观察到数据估值的潜力。然而，与这些积极的结果混杂在一起的是大量的应用，现有的数据评估技术还不适用，或者执行起来太昂贵，或者产生的评估结果具有很大的不确定性。该项目旨在使数据评估能够克服适用性、可扩展性和可再现性方面的挑战，并向以数据为中心的未来的实用可靠工具过渡。这项工作将在促进自动化数据质量管理、设计数据共享激励和提高机器学习应用程序的鲁棒性方面对社会产生广泛的影响。该项目将训练本科生从算法和数据质量的角度解决机器学习问题，同时创建有用的学龄学习模块，在本地、区域和全球范围内实施。该项目包括四个研究任务，从不同的维度推进数据评估：1)设计具有鲁棒性的数据评估技术，以克服现代机器学习训练算法中的随机性；2)在有限的下游学习任务信息下，开发新的框架来确定数据样本的价值；3)研究评估异构和流数据的原则方法；4)创建并开源一个统一的多方面评估平台，以促进未来更复杂的数据评估的发展。所提出的技术在各种高影响力的现实应用中得到了实施和验证，包括自动驾驶、节能建筑和会话人工智能。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（3）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

2D-Shapley: A Framework for Fragmented Data Valuation

DOI：
10.48550/arxiv.2306.10473
发表时间：
2023-06
期刊：
影响因子：
0
作者：
Zhihong Liu;H. Just;Xiangyu Chang;X. Chen;R. Jia
通讯作者：
Zhihong Liu;H. Just;Xiangyu Chang;X. Chen;R. Jia

LAVA: Data Valuation without Pre-Specified Learning Algorithms

DOI：
10.48550/arxiv.2305.00054
发表时间：
2023-04
期刊：
ArXiv
影响因子：
0
作者：
H. Just;Feiyang Kang;Jiachen T. Wang;Yi Zeng;Myeongseob Ko;Ming Jin;R. Jia
通讯作者：
H. Just;Feiyang Kang;Jiachen T. Wang;Yi Zeng;Myeongseob Ko;Ming Jin;R. Jia

Data Banzhaf: A Robust Data Valuation Framework for Machine Learning