权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Doctoral Dissertation Research: Evaluating the Promise and Pitfalls of Benchmarking in Machine Learning Research

博士论文研究：评估机器学习研究中基准测试的前景和陷阱

基本信息

批准号：
2124685
负责人：
Jacob Foster
金额：
$ 2万
依托单位：
University of California-Los Angeles
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-01 至 2023-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2124685&HistoricalAwards=false
关键词：
Doctoral Dissertation Research Evaluating Promise

项目摘要

This award is funded in whole or in part under the American Rescue Plan Act of 2021 (Public Law 117-2).The scientific and commercial success of machine learning (ML) has spurred government and corporate sponsors to invest billions of dollars in machine learning research. Despite this massive investment, there is limited quantitative research on how the ML field measures progress: a process called “benchmarking.” Benchmarking is the act of comparing algorithms on a quantitative metric after training them on the same benchmark dataset. Benchmarks organize ML researchers around common tasks. Achieving “state of the art” performance on an important benchmark can spark new research trajectories and advance careers: consider the 2012 success of “AlexNet” in a prominent computer vision task, which helped to launch current interest in deep learning. However, the practice of benchmarking has already engendered criticism that this near-ubiquitous research culture does not push the field towards socially beneficial outcomes, and leads to overinvestment in methods that maximize performance on academic datasets but are environmentally unsustainable or harm the public when used in the real world. This dissertation research will provide a comprehensive analysis of the strengths and weaknesses of benchmarking practices with respect to several public aims: accelerating innovation in science, increasing equity within the field, and promoting ethical research (i.e., an orientation toward research that benefits society and avoids harms). By blending sociological analysis, computational methods for extracting and analyzing benchmarking data from thousands of papers, and in-depth qualitative interviews, this research will produce an understanding of benchmarking culture in ML research that combines breadth and quantitative rigor with depth and interpretive nuance. This project has significant implications for government and corporate funders, researchers, and society more broadly. The dissertation consists of three subprojects. The first subproject explores evidence that benchmarking culture has stymied innovation by favoring utilization of the same datasets across multiple tasks and by incentivizing researchers to underinvest on nascent benchmarks and overinvest on mature ones. The second subproject explores how patterns in the adoption of benchmarks and rewards for state-of-the-art performance interact with status and resources to create inequities in the field. It tests the hypothesis that high-status researchers and institutions have disproportionate power to set the field’s research agenda by introducing benchmarks, while garnering disproportionate citations for state-of-the-art achievements. Both of these phenomena have the potential to create a “Matthew Effect” that disadvantages under-represented and under-resourced researchers/institutions. These subprojects use network science, natural language processing, and manual coding to create a large dataset of benchmarks and progress on those benchmarks across multiple ML task communities. The third subproject consists of qualitative interviews with ML researchers across career stages and expertise to gain first-hand perspectives on benchmarking culture and assess reforms to improve research ethics and societal outcomes.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该奖项全部或部分由2021年美国救援计划法案（公法117-2）资助。机器学习（ML）在科学和商业上的成功促使政府和企业赞助商在机器学习研究上投入数十亿美元。尽管有如此大规模的投资，但关于ML领域如何衡量进展的定量研究有限：这是一个被称为“基准测试”的过程。基准测试是在相同的基准数据集上训练算法后，在定量指标上比较算法的行为。基准将ML研究人员围绕常见任务组织起来。在一个重要的基准上实现“最先进”的性能可以激发新的研究轨迹并推动职业发展：考虑2012年“AlexNet”在一个突出的计算机视觉任务中的成功，这有助于激发当前对深度学习的兴趣。然而，基准测试的做法已经引起了批评，这种几乎无处不在的研究文化并没有推动该领域走向社会有益的结果，并导致过度投资于在学术数据集上最大化性能的方法，但在环境上不可持续或在真实的世界中使用时伤害公众。本论文的研究将提供一个基准实践的优势和劣势方面的几个公共目标的全面分析：加速科学创新，增加领域内的公平，并促进伦理研究（即，有利于社会而避免危害的研究方向）。通过融合社会学分析、从数千篇论文中提取和分析基准数据的计算方法以及深入的定性访谈，本研究将对ML研究中的基准文化产生理解，将广度和定量严谨性与深度和解释性细微差别相结合。该项目对政府和企业资助者，研究人员和更广泛的社会具有重要意义。论文由三个子项目组成。第一个子项目探讨了基准文化阻碍创新的证据，因为它倾向于在多个任务中使用相同的数据集，并激励研究人员在新生基准上投资不足，而在成熟基准上投资过度。第二个分项目探讨采用基准和奖励最新业绩的模式如何与地位和资源相互作用，在外地造成不平等。它检验了一个假设，即高地位的研究人员和机构有不成比例的权力，通过引入基准来设定该领域的研究议程，同时为最先进的成就获得不成比例的引用。这两种现象都有可能造成“马太效应”，使代表性不足和资源不足的研究人员/机构处于不利地位。这些子项目使用网络科学、自然语言处理和手动编码来创建一个大型的基准数据集，并在多个ML任务社区的基准上取得进展。第三个子项目包括对ML研究人员的职业生涯阶段和专业知识的定性访谈，以获得对基准文化的第一手观点，并评估改革以改善研究道德和社会成果。该奖项反映了NSF的法定使命，并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估来支持。