Bootstrapping a Corpus of Endangered Languages

引导濒危语言语料库

基本信息

  • 批准号:
    2319296
  • 负责人:
  • 金额:
    $ 43.64万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-09-01 至 2026-08-31
  • 项目状态:
    未结题

项目摘要

Understanding how language works requires a much better understanding of languages other than English. The vast bulk of research to date has focused on English and a few closely-related languages. Many of the outstanding questions – even about English itself - cannot be answered with data only from English and related languages. This project greatly expands the information available to scientists about 16 theoretically-important languages, chosen because they appear to be impossible according to leading theories. Because language is central to much of human activity, potential Broader Impacts of a better understanding of language are vast, including impact on second language education, language technologies such as speech recognition and AI, and rehabilitation of language-related disorders such as aphasia and dyslexia. The Broader Impacts of this project include supporting language preservation and revival as well as related goals of the communities that speak the 16 languages. Specifically, this project produces 'mid-scale' corpora on the order of one million words per language for each language. While much recent focus is on massive corpora with billions of words, mid-scale corpora played a critical role in computational, psycholinguistic, and acquisition studies of English and other high-resource languages. They are also more feasible. This project takes a 'bootstrapping' approach, first compiling, formatting, and redistributing existing materials for all 16 languages, including both text-only resources and audio paired with transcriptions. It then uses cutting edge machine learning to develop Automatic Speech Recognition for two of the languages and assess its usefulness in speeding up transcription of new corpus materials. These new materials are then used to refine the Automatic Speech Recognition, building a 'virtuous cycle' that speeds further work. The method can also be expanded to other languages. All materials and code are distributed for free in order to stimulate research and industry. This award is made as part of a funding partnership between the National Science Foundation and the National Endowment for the Humanities for the NSF Dynamic Language Infrastructure – NEH Documenting Endangered Languages Program.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
理解语言的工作原理需要更好地理解英语以外的语言。迄今为止,大量的研究集中在英语和一些密切相关的语言上。许多悬而未决的问题--甚至是关于英语本身的问题--不能仅仅用英语和相关语言的数据来回答。这个项目极大地扩展了科学家关于16种理论上重要的语言的信息,选择这些语言是因为它们在主流理论中似乎是不可能的。由于语言是人类大部分活动的核心,因此更好地理解语言的潜在影响是巨大的,包括对第二语言教育,语言技术(如语音识别和人工智能)以及语言相关疾病(如失语症和阅读障碍)的康复的影响。该项目的更广泛影响包括支持语言保护和复兴以及讲16种语言的社区的相关目标。具体来说,该项目为每种语言制作了每种语言100万个单词的“中等规模”语料库。虽然最近的重点是拥有数十亿单词的大型语料库,但中等规模的语料库在英语和其他高资源语言的计算,心理语言学和习得研究中发挥了关键作用。它们也更可行。这个项目采取了一种“自举”的方法,首先为所有16种语言编译、格式化和重新分发现有的材料,包括纯文本资源和与transmitting配对的音频。然后,它使用尖端机器学习来开发其中两种语言的自动语音识别,并评估其在加速新语料库材料转录方面的有用性。然后,这些新材料被用于改进自动语音识别,建立一个“良性循环”,加快进一步的工作。该方法还可以扩展到其他语言。所有材料和代码都是免费分发的,以促进研究和工业。该奖项是国家科学基金会和国家人文基金会为NSF动态语言基础设施- NEH记录濒危语言计划建立的资助伙伴关系的一部分。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Emily Prud'hommeaux其他文献

Emily Prud'hommeaux的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Emily Prud'hommeaux', 18)}}的其他基金

Collaborative Research: Deep learning speech recognition for documenting Seneca, a Native American language, and other acutely under-resourced languages
合作研究:深度学习语音识别,用于记录美洲原住民语言塞内卡语和其他资源严重匮乏的语言
  • 批准号:
    1761562
  • 财政年份:
    2018
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Continuing Grant

相似海外基金

From corpus to target data as steps for automatic assessment of L2 speech: L2 French phonological lexicon of Japanese learners
从语料库到目标数据作为 L2 语音自动评估的步骤:日语学习者的 L2 法语语音词典
  • 批准号:
    23K20100
  • 财政年份:
    2024
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Centre for Corpus Approaches to Social Science
社会科学语料库方法中心
  • 批准号:
    ES/Z000025/1
  • 财政年份:
    2024
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Research Grant
Collaborative Research: The Individual Differences Corpus: A resource for testing and refining hypotheses about individual differences in speech production
协作研究:个体差异语料库:用于测试和完善有关言语产生个体差异的假设的资源
  • 批准号:
    2234096
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Standard Grant
Building an Error-Annotated Corpus of Learner Indonesian and Developing an Automated Writing Support for Japanese Students Using Deep Linguistic Indonesian Parsers
建立一个错误注释的印尼语学习者语料库,并使用深度语言印尼语解析器为日本学生开发自动写作支持
  • 批准号:
    23K12235
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
A novel system to analyze the vascular-dynamic erectile responses of corpus cavernosum
一种分析海绵体血管动态勃起反应的新系统
  • 批准号:
    23KJ1859
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for JSPS Fellows
Creating the first oral and written corpus of Japanese learners of Spanish as a foreign language
创建第一个以西班牙语为外语的日本学习者口语和书面语料库
  • 批准号:
    23K00698
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Acceptability corpus development for investigating the difficulty of grammar acquisition in Malay/Indonesian
用于调查马来语/印尼语语法习得难度的可接受性语料库开发
  • 批准号:
    23H00639
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Constructing the Knife Crime Epidemic: A Corpus-Assisted Critical Study of the Media Reporting
构建持刀犯罪流行:语料库辅助的媒体报道批判性研究
  • 批准号:
    2885937
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Studentship
A corpus-based computational analysis of Hungarian negative emotive elements from the viewpoint of semantic changes
基于语料库的语义变化视角下的匈牙利语负面情绪元素计算分析
  • 批准号:
    23KF0028
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for JSPS Fellows
A Critical Edition of Giovan Battista Strozzi the Elder's Poetic Corpus: Reassessing Strozzi's Importance in Late Renaissance Italy
老乔万·巴蒂斯塔·斯特罗齐诗集评论版:重新评估斯特罗齐在文艺复兴晚期意大利的重要性
  • 批准号:
    23K00418
  • 财政年份:
    2023
  • 资助金额:
    $ 43.64万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了