权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Collaborative Research: NCS-FO: Studying language in the brain in the modern machine learning era

合作研究：NCS-FO：研究现代机器学习时代大脑中的语言

基本信息

批准号：
2124052
负责人：
Boris Katz
金额：
$ 50万
依托单位：
Massachusetts Institute of Technology
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-09-15 至 2025-08-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2124052&HistoricalAwards=false
关键词：
Collaborative Research NCS FO Studying

项目摘要

The project will investigate how the brain processes language, one of the most consequential questions we can ask. Language skills significantly affect lifetime income and social disparities. The loss of language or a halt in its development can be devastating. At the same time, insights from the brain that could improve machines’ understanding of language opens up new applications -- from web search, to voice assistants, to, one day, robots that can help us in our daily lives. Using neuroscience to understand what happens during language use, what goes right and wrong, what linguistic structures and theories are used by the brain, would be revolutionary. To do this, neuroscientists use many of the same tools as those created for machine learning. Those machine learning tools have improved tremendously using large datasets, changing what machines are capable of; yet the neuroscience of language has been largely unable to reap these rewards. We will provide that data, the new methods and metrics, required to enable neuroscience to scale up and take advantage of modern machine learning. At the same time, scale in machine learning has democratized access to tools; scientific communities can investigate questions that pertain to them. Today, only a few groups have the resources to collect data and investigate questions around the neuroscience of language, leaving many communities in the dark. A large-scale central repository of data, tools, and benchmarks will democratize access to the study of language in the brain, one of the core aspects of what makes us human.Our technical goal is to produce the largest dataset, by a factor of 1000, for investigating the neuroscience of language along with new types of models that exploit this data, and benchmarks which formalize linguistic questions to derive insights about the language network and the structure of language. Thus far, investigations in the neuroscience of language have only been able to provide small snapshots of the language network on different datasets, making it hard to build a coherent picture. A single large-scale dataset with precise benchmarks that formally define what hypotheses in linguistics mean in terms of neural data will enable the community to ask many questions of the same data, allowing for a synthesis of the structure and operation of the language network. At the same time, large-scale data is known to be required to probe the understanding of artificial language models. It is likely that if tens of thousands of sentences are required to probe an artificial language-neural-networks and derive meaningful insight, the same scale of data will be required per subject to probe biological language-neural-networks. Formalizing questions around benchmarks on a common dataset has resulted in astronomical progress in many fields from parsing (Penn Treebank) to image recognition (ImageNet); we will apply this same methodology to the neuroscience of language. This process is so efficient, in part, because it casts questions in a way that non-domain experts can access; machine learning experts need not concern themselves with linguistic minutia, they will be able to improve decoding of language from the brain and thereby drive insights by following existing protocols. By putting forward linguistic questions in a precisely-defined manner, we will also enable cross-disciplinary collaboration: linguists will be able to propose benchmarks which are questions around the performance of classifiers or mapping between networks and neural activity. These benchmarks will provide a common mathematical language by which different fields can express their key questions, in a way that has not been possible before because no dataset existed that could even support such work. We see a future where neuroscience, linguistics, natural language processing, and machine learning act as an integrated whole to ask the right questions about language in the brain, to develop new tools that support answering those questions, and to probe a large-scale resource that supports building a coherent picture of the language system.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目将研究大脑如何处理语言，这是我们可以提出的最重要的问题之一。语言技能显著影响终身收入和社会差距。语言的丧失或发展的停滞可能是毁灭性的。与此同时，来自大脑的洞察力可以提高机器对语言的理解，这开辟了新的应用--从网络搜索到语音助手，再到有一天可以帮助我们日常生活的机器人。利用神经科学来了解语言使用过程中发生了什么，什么是对的和错的，大脑使用了什么语言结构和理论，这将是革命性的。为了做到这一点，神经科学家使用了许多与机器学习相同的工具。这些机器学习工具在使用大型数据集的情况下有了巨大的改进，改变了机器的能力;然而，语言的神经科学在很大程度上无法获得这些回报。我们将提供这些数据，新的方法和指标，使神经科学能够扩大规模并利用现代机器学习。与此同时，机器学习的规模使工具的使用民主化;科学界可以调查与它们有关的问题。今天，只有少数几个团体有资源收集数据并调查语言神经科学的问题，使许多社区处于黑暗之中。一个大规模的数据、工具和基准的中央存储库将使大脑中语言的研究民主化，这是我们人类的核心方面之一。我们的技术目标是产生最大的数据集，以1000倍的倍数，用于研究语言的神经科学，同时沿着利用这些数据的新型模型，和基准，这些基准将语言问题形式化，以获得对语言网络和语言结构的见解。到目前为止，语言神经科学的研究只能在不同的数据集上提供语言网络的小快照，因此很难建立一个连贯的图像。一个具有精确基准的单一大规模数据集，正式定义了语言学假设在神经数据方面的含义，将使社区能够对相同的数据提出许多问题，从而综合语言网络的结构和操作。与此同时，已知需要大规模数据来探索人工语言模型的理解。很可能，如果需要数万个句子来探测人工语言神经网络并获得有意义的见解，那么每个受试者都需要相同规模的数据来探测生物语言神经网络。在一个共同的数据集上围绕基准问题进行形式化，已经在从解析（Penn Treebank）到图像识别（ImageNet）的许多领域取得了巨大的进步;我们将把同样的方法应用于语言的神经科学。这个过程如此高效，部分原因是它以非领域专家可以访问的方式提出问题;机器学习专家不需要关心语言细节，他们将能够改善大脑对语言的解码，从而通过遵循现有协议来驱动洞察力。通过以精确定义的方式提出语言问题，我们还将实现跨学科合作：语言学家将能够提出关于分类器性能或网络与神经活动之间映射的问题的基准。这些基准将提供一种通用的数学语言，不同的领域可以通过这种语言来表达他们的关键问题，这在以前是不可能的，因为没有数据集可以支持这种工作。我们看到了一个未来，神经科学，语言学，自然语言处理和机器学习作为一个集成的整体来提出关于大脑中语言的正确问题，开发支持回答这些问题的新工具，探索一个巨大的一个规模资源，支持建立一个连贯的语言系统的图片。这个奖项反映了NSF的法定使命，并已被认为是值得支持，通过评估使用基金会的学术价值和更广泛的影响审查标准。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Boris Katz其他文献

Annotating the World Wide Web using Natural Language

DOI：
10.5555/2856695.2856709
发表时间：
1997-06
期刊：
影响因子：
0
作者：
Boris Katz
通讯作者：
Boris Katz

Learning to Answer Questions from Wikipedia Infoboxes

学习回答维基百科信息框中的问题

DOI：
发表时间：
2016
期刊：
Conference on Empirical Methods in Natural Language Processing
影响因子：
0
作者：
Alvaro Morales;Varot Premtoon;Cordelia Avery;Sue Felshin;Boris Katz
通讯作者：
Boris Katz

2 Semantic Annotation of Discourse Structure

2 语篇结构语义标注

DOI：
发表时间：
期刊：
影响因子：
0
作者：
Boris Katz;K. Nagao
通讯作者：
K. Nagao

REXTOR: A System for Generating Relations from Natural Language

REXTOR：从自然语言生成关系的系统

DOI：
10.3115/1117755.1117764
发表时间：
2000
期刊：
影响因子：
0
作者：
Boris Katz;Jimmy J. Lin
通讯作者：
Jimmy J. Lin

The role of context in question answering systems

上下文在问答系统中的作用

DOI：
发表时间：
2003
期刊：
CHI Extended Abstracts
影响因子：
0
作者：
Jimmy J. Lin;Dennis Quan;Vineet Sinha;Karun Bakshi;David Huynh;Boris Katz;David R Karger
通讯作者：
David R Karger