权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

EAGER: Developing data and evaluation methods to assess the generality and robustness of AI systems for abstraction and analogy-making

EAGER：开发数据和评估方法来评估人工智能系统进行抽象和类比的通用性和鲁棒性

基本信息

批准号：
2139983
负责人：
Melanie Mitchell
金额：
$ 19.97万
依托单位：
Santa Fe Institute
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-09-01 至 2024-02-29
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2139983&HistoricalAwards=false
关键词：
EAGER Developing data evaluation methods

项目摘要

The ability of humans to make conceptual abstractions and analogies is at the root of many of our most important cognitive capabilities, such as learning new concepts from small numbers of examples, flexibly adapting our prior knowledge and experience to new situations, and communicating our knowledge to others. While AI has made dramatic progress over the last decade in areas such as vision, natural language processing, and robotics, current AI systems almost entirely lack the ability to form humanlike abstractions and analogies. The lack of such abilities is in part responsible for the lack of robustness in current AI systems, as well as their difficulties with extrapolating what they have learned to diverse situations. While there have been many efforts in past AI research on this topic, each individual effort has generally focused on a specific problem domain, without careful evaluation of the AI system’s robustness within its domain or its generality across different domains. In this project we will promote progress in AI by creating a web-based platform that offers a diverse set of abstraction and analogy-making challenges for the research community as well as new evaluation methods that test for generality and robustness within and across different challenge domains. We will use our platform to evaluate selected existing AI approaches and to measure human performance on our challenges in order to compare with AI systems’ performance. Our work will contribute to the AI research community by spurring new approaches and evaluation methods for abstraction and analogy-making in machines, and will contribute more broadly via the development of methods for robust and generalizable AI systems. Our specific research plan is to (1) curate an initial suite of idealized challenge domains inspired by Hofstadter’s letter-string analogies, Raven’s progressive matrices, Bongard problems, and Chollet’s Abstraction and Reasoning Corpus; (2) develop evaluation methods along dimensions such as robustness to variations on a particular concept, generality across domains, and scalability to more complex instances of a problem; (3) evaluate selected AI methods for abstraction and analogy using our evaluation methods; and (4) measure human benchmarks on our challenge suite using paid participants on the Amazon Mechanical Turk platform. At the end of the project period, we will have demonstrated the utility and promise of our challenge problems and evaluations, and will have gained insight into their limitations. This work will set the stage for future efforts on expanding our challenge suite, improving our evaluation metrics, and developing and evaluating novel AI approaches to abstraction and analogy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

人类进行概念抽象和类比的能力是我们许多最重要的认知能力的根源，例如从少量的例子中学习新概念，灵活地调整我们先前的知识和经验以适应新的情况，并将我们的知识传达给他人。虽然人工智能在过去十年中在视觉、自然语言处理和机器人等领域取得了巨大进展，但目前的人工智能系统几乎完全缺乏形成类似人类的抽象和类比的能力。缺乏这种能力是当前人工智能系统缺乏鲁棒性的部分原因，也是它们难以将学到的知识外推到不同情况的原因。虽然在过去的人工智能研究中已经有很多关于这个主题的努力，但每个单独的努力通常都集中在特定的问题领域，而没有仔细评估人工智能系统在其领域内的鲁棒性或其在不同领域的通用性。在这个项目中，我们将通过创建一个基于网络的平台来促进人工智能的进步，该平台为研究界提供了一系列不同的抽象和类比挑战，以及新的评估方法，这些方法可以测试不同挑战领域内的通用性和鲁棒性。我们将使用我们的平台来评估选定的现有人工智能方法，并衡量人类在挑战中的表现，以便与人工智能系统的表现进行比较。我们的工作将通过促进机器中抽象和类比的新方法和评估方法为人工智能研究社区做出贡献，并将通过开发强大和可推广的人工智能系统的方法做出更广泛的贡献。我们的具体研究计划是：（1）策划一套最初的理想化挑战领域，灵感来自Hofstadter的字母串类比，Raven的渐进矩阵，Bongard问题和Chollet的抽象和推理语料库;（2）开发评估方法沿着维度，如对特定概念变化的鲁棒性，跨领域的通用性，以及对更复杂问题实例的可扩展性;（3）使用我们的评估方法评估选定的AI方法的抽象和类比;（4）使用Amazon Mechanical Turk平台上的付费参与者在我们的挑战套件上测量人类基准。在项目结束时，我们将展示我们的挑战问题和评估的实用性和前景，并将深入了解它们的局限性。这项工作将为未来的努力奠定基础，扩大我们的挑战套件，改善我们的评估指标，并开发和评估新的人工智能方法来抽象和类比。该奖项反映了NSF的法定使命，并被认为值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估来支持。

项目成果

期刊论文数量（4）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Evaluating Understanding on Conceptual Abstraction Benchmarks

DOI：
10.48550/arxiv.2206.14187
发表时间：
2022-06
期刊：
ArXiv
影响因子：
0
作者：
Victor Vikram Odouard;M. Mitchell
通讯作者：
Victor Vikram Odouard;M. Mitchell

How do we know how smart AI systems are?

我们如何知道人工智能系统有多智能？

DOI：
10.1126/science.adj5957
发表时间：
2023
期刊：
Science
影响因子：
56.9
作者：
Mitchell, Melanie
通讯作者：
Mitchell, Melanie

Rethink reporting of evaluation results in AI

重新思考人工智能评估结果的报告

DOI：
10.1126/science.adf6369
发表时间：
2023
期刊：
Science
影响因子：
56.9
作者：
Burnell, Ryan;Schellaert, Wout;Burden, John;Ullman, Tomer D.;Martinez-Plumed, Fernando;Tenenbaum, Joshua B.;Rutar, Danaja;Cheke, Lucy G.;Sohl-Dickstein, Jascha;Mitchell, Melanie
通讯作者：
Mitchell, Melanie

The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

ConceptARC 基准：评估 ARC 领域的理解和泛化