权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

RR: CompCog: A challenge suite for statistical word segmentation

RR：CompCog：统计分词挑战套件

基本信息

批准号：
1918813
负责人：
Joshua Hartshorne
金额：
$ 70.65万
依托单位：
Boston College
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-09-01 至 2024-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1918813&HistoricalAwards=false
关键词：
RR CompCog challenge suite statistical

项目摘要

A central scientific puzzle is how children manage to acquire language despite limited and inconsistent explicit feedback. Numerous mathematical results seem to suggest that acquiring a language should be impossible; the fact that children do it every day reveals a deep gap in the science of learning. Some research suggests that children make considerable headway by detecting patterns in what they hear even without any explicit teaching or even knowing what is being talked about ("statistical" or "unsupervised" learning). Indeed, much of the recent progress in "teaching" computers to understand language has made use of just this strategy. Even more compelling: numerous experiments have shown that both adults and infants are able to learn at least a little bit about language this way. How much they can learn remains unclear. A central difficulty is that mathematically, there are many different methods for pattern-detection and it is unclear which one(s) humans use. This is important because some work better than others; and whether unsupervised pattern-detection can help solve the mystery of language learning depends on which method is used. The purpose of this project is to put together a "challenge suite": a dataset that can be used to systematically evaluate and compare the possibilities. Such challenge suites have been instrumental in advancing artificial intelligence. This project also serves as a proof-of-concept to determine whether challenge suites are similarly beneficial for the science of learning, and at the same time provide valuable resources and training to the research community. To develop the challenge suite, the investigators will first conduct a comprehensive, quantitative literature review (meta-analysis) focusing on the largest body of work on unsupervised pattern-detection: adult statistical word segmentation. Aided by outside experimenters, the meta-analysis will be used to identify 10-15 key experiments. As a group, these experiments will establish a basic set of facts about adult statistical word segmentation that any theory must account for. For these reasons, the project will focus particularly on theoretically-central phenomena that distinguish different theories. To measure different aspects of linguistic pattern-detection, each experiment will involve large numbers of subjects (approx. 1,200 each) and a subset of 3-5 experiments with an even larger number (approx. 24,000 each). A tool will be developed to enable researchers to compare any mathematical theory of learning against these data, determining how well it matches human performance. In order to determine how the mathematical theory could learn language, a database of transcripts of child-directed speech in 3-5 languages will be developed. Each theory will also be tested/trained on the database to see how much it could learn about those languages. The challenge suite will be made available to all researchers as a download and also through a website where researchers can submit their models and compare results against those of other models. This work will be publicized to the scientific community through a closing workshop focused on models of unsupervised word segmentation.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

一个核心的科学难题是，儿童如何在有限和不一致的明确反馈下获得语言。大量的数学结果似乎表明，习得一门语言是不可能的;儿童每天都在学习的事实揭示了学习科学中的一个巨大缺口。一些研究表明，即使没有任何明确的教学，甚至不知道正在谈论什么，儿童也能通过检测他们所听到的模式（“统计”或“无监督”学习）取得相当大的进展。事实上，最近在“教”计算机理解语言方面的许多进展都利用了这种策略。更令人信服的是：大量的实验表明，成年人和婴儿都能通过这种方式学习至少一点语言。他们能学到多少还不清楚。一个核心的困难是，在数学上，有许多不同的方法来检测模式，目前还不清楚人类使用哪一种。这一点很重要，因为有些方法比其他方法效果更好;无监督模式检测是否有助于解决语言学习之谜取决于使用哪种方法。该项目的目的是建立一个“挑战套件”：一个可用于系统评估和比较各种可能性的数据集。此类挑战套件在推进人工智能方面发挥了重要作用。该项目还可以作为概念验证，以确定挑战套件是否对学习科学同样有益，同时为研究界提供宝贵的资源和培训。为了开发挑战套件，研究人员将首先进行全面的定量文献综述（荟萃分析），重点关注无监督模式检测方面最大的工作：成人统计分词。在外部实验者的帮助下，元分析将用于确定10-15项关键实验。作为一个群体，这些实验将建立一套关于成人统计分词的基本事实，任何理论都必须考虑到这一点。由于这些原因，该项目将特别关注区分不同理论的理论中心现象。为了测量语言模式检测的不同方面，每个实验都将涉及大量的受试者（约100名）。1，200每个）和一个子集的3-5个实验，甚至更大的数字（约。24，000）。将开发一种工具，使研究人员能够将任何学习的数学理论与这些数据进行比较，确定它与人类表现的匹配程度。为了确定数学理论如何学习语言，将开发一个3-5种语言的儿童指导语音转录数据库。每个理论还将在数据库上进行测试/训练，看看它可以对这些语言了解多少。挑战套件将以下载的形式提供给所有研究人员，也可以通过一个网站提供，研究人员可以提交他们的模型并将结果与其他模型进行比较。这项工作将通过一个专注于无监督分词模型的闭幕研讨会向科学界宣传。该奖项反映了NSF的法定使命，并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估来支持。