Collaborative Research: RI: Medium: From Acoustic Signal to Morphosyntactic Analysis in One End-to-End Neural System
合作研究:RI:媒介:从声学信号到端到端神经系统中的形态句法分析
基本信息
- 批准号:2211952
- 负责人:
- 金额:$ 30.19万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-08-01 至 2026-07-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
There are approximately 7,000 languages in the world today, but this number is declining precipitously.Even many languages that currently have thousands upon thousands of speakers are likely to fall outof use within a generation. For the speakers of these languages, this represents a tragic loss of culturaland linguistic heritage, which are important anchors of their social identity. Each language also carriesirreplaceable data about language as a phenomenon of human behavior—the limits of its variation andthe patterns in its structure and development. Linguists and language activists are currently working toquickly and comprehensively document as many languages as possible. In the unfortunate event that alanguage fades from use, documentation ensures that its data will remain available for future cultural orscientific analysis. This project partially automates the process of language documentation using toolsfrom Natural Language Processing and Machine Learning. It differs from similar projects in using oneintegrated system to process the sounds of speech and the structure of words, instead of using two ormore separate components. With the collaboration of native speaker scholars, the researchers are applyingtheir methodology to four languages: Highland Puebla Nahuatl, Yoloxóchitl Mixtec, San Pedro AmuzgosAmuzgo, and North Slope Iñupiaq.The proposed research will dramatically transform the landscape of automatic morphosyntactic andmorphophonological analysis by introducing an end-to-end system that consumes speech as an input andproduces interlinear annotations as an output. The research team proposes to build an end-to-end system,a single neural net that, with small amounts of labeled data produced by native speaker linguists, candirectly convert recorded speech to analyzed text, producing four outputs: (1) surface transcription, (2)morphological segmentation of surface forms, (3) an underlying or canonical form for each morpheme,and (4) a gloss or standardized label for each morpheme. The proposed single end-to-end neural networkrepresents the first attempt to integrate the four aforementioned tasks into a single neural network, avoidingthe error-propagation problems that have plagued earlier attempts at creating a pipeline and mitigating thecomplexity of the technology for end-users. The researchers also propose innovative ways to incorporate linguisticknowledge into neural networks, including the use of differentiable weighted finite-state transducers,which are independently motivated by an iterative self-training architecture. This approach to iterative self training,in its own right, will represent an advance in machine learning — a new algorithm for upweightingwords and morphemes. The research also makes significant contributions to computational morphology.It includes a simple but expressive modification to existing schemes for segmentation and glossing, specificallyfor the representation of discontinuous morphemes. Furthermore, the proposal extends popularapproaches to morphological analysis (e.g., UniMorph) by systematically addressing derivation as well asinflection. This proposal addresses glossing of reduplication and noun-incorporation, which earlier workhas not.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当今世界上大约有7,000种语言,但这一数字正在急剧下降。即使是目前有成千上万人使用的许多语言,也可能在一代人的时间内消失。对于这些语言的使用者来说,这是文化和语言遗产的悲剧性损失,而这些遗产是他们社会身份的重要支柱。每一种语言也都携带着关于语言作为人类行为现象的不可替代的数据--语言变异的限度以及语言结构和发展的模式。语言学家和语言活动家目前正致力于尽可能多的语言快速和全面的文件。如果不幸的是,语言从使用中消失了,文档可以确保它的数据可以用于未来的文化或科学分析。该项目使用自然语言处理和机器学习的工具部分自动化语言文档的过程。它与类似项目的不同之处在于使用一个集成系统来处理语音和单词的结构,而不是使用两个或多个单独的组件。在母语学者的合作下,研究人员正在将他们的方法应用于四种语言:高地普埃布拉纳瓦特尔语,Yoloxóchitl Mixtec,圣佩德罗AmuzgosAmuzgo和北坡Iñupiaq。拟议的研究将通过引入一个端到端系统,将语音作为输入,并将线间注释作为输出,极大地改变自动形态句法和形态音位分析的前景。研究小组建议建立一个端到端的系统,一个单一的神经网络,使用母语语言学家产生的少量标记数据,可以直接将记录的语音转换为分析的文本,产生四个输出:(1)表面转录,(2)表面形式的形态分割,(3)每个词素的基础或规范形式,以及(4)每个词素的注释或标准化标签。提出的单一端到端神经网络代表了将上述四项任务集成到单一神经网络中的第一次尝试,避免了困扰早期创建管道和减轻最终用户技术复杂性的错误传播问题。研究人员还提出了将语言知识融入神经网络的创新方法,包括使用可微分加权有限状态传感器,这些传感器由迭代自训练架构独立激励。这种迭代自我训练的方法本身将代表机器学习的一个进步--一种增加单词和词素权重的新算法。这项研究也对计算形态学做出了重要贡献,它包括对现有的分割和注释方案进行简单但富有表现力的修改,特别是对不连续语素的表示。此外,该建议将流行的方法扩展到形态分析(例如,UniMorph)通过系统地解决派生以及变形。这个奖项反映了NSF的法定使命,并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估来支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Generalized Glossing Guidelines: An Explicit, Human- and Machine-Readable, Item-and-Process Convention for Morphological Annotation
通用注释指南:用于形态注释的明确的、人类和机器可读的项目和进程约定
- DOI:10.18653/v1/2023.sigmorphon-1.7
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Mortensen, David R.;Gulsen, Ela;He, Taiqi;Robinson, Nathaniel;Amith, Jonathan;Tjuatja, Lindia;Levin, Lori
- 通讯作者:Levin, Lori
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Jonathan Amith其他文献
Jonathan Amith的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Jonathan Amith', 18)}}的其他基金
A comparative database for biologists, botanists, and linguists
生物学家、植物学家和语言学家的比较数据库
- 批准号:
2109821 - 财政年份:2021
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora
合作研究:利用文献语言语料库改进自动语音识别和迁移学习技术
- 批准号:
2123578 - 财政年份:2021
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Documentation of discourse and cultural activities to advance scientific knowledge of an endangered tonal language
记录话语和文化活动,以增进对濒危声调语言的科学认识
- 批准号:
1761421 - 财政年份:2018
- 资助金额:
$ 30.19万 - 项目类别:
Continuing Grant
Collaborative Research: Contributions of Endangered Language Data for Advances in Technology-enhanced Speech Annotation
合作研究:濒危语言数据对技术增强语音注释进步的贡献
- 批准号:
1500595 - 财政年份:2015
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Documenting Traditional Ecological Knowledge in the Sierra Nororiental de Puebla, Mexico, in Synchronic and Diachronic Perspectives
从共时和历时的角度记录墨西哥普埃布拉东北山脉的传统生态知识
- 批准号:
1401178 - 财政年份:2014
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Corpus and lexicon development: Endangered genres of discourse and domains of cultural knowledge in Tu'un isavi (Mixtec) of Yoloxochitl, Guerrero
语料库和词汇发展:格雷罗州约洛索奇特尔的 Tuun isavi (Mixtec) 中濒临灭绝的话语流派和文化知识领域
- 批准号:
0966462 - 财政年份:2010
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Nahuatl Language Documentation Project: Sierra Norte de Puebla [ISO 639 azz]
纳瓦特尔语言文档项目:Sierra Norte de Puebla [ISO 639 azz]
- 批准号:
0756536 - 财政年份:2008
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Guerrero Nahuatl Language Documentation and Lexicon Enrichment Project
格雷罗纳瓦特尔语言文档和词典丰富项目
- 批准号:
0504164 - 财政年份:2005
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312841 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312842 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Foundations of Few-Round Active Learning
协作研究:RI:小型:少轮主动学习的基础
- 批准号:
2313131 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Lie group representation learning for vision
协作研究:RI:中:视觉的李群表示学习
- 批准号:
2313151 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Continuing Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312840 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Deep Constrained Learning for Power Systems
合作研究:RI:小型:电力系统的深度约束学习
- 批准号:
2345528 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Motion Fields Understanding for Enhanced Long-Range Imaging
合作研究:RI:小型:增强远程成像的运动场理解
- 批准号:
2232298 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: End-to-end Learning of Fair and Explainable Schedules for Court Systems
合作研究:RI:小型:法院系统公平且可解释的时间表的端到端学习
- 批准号:
2232055 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Lie group representation learning for vision
协作研究:RI:中:视觉的李群表示学习
- 批准号:
2313149 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Continuing Grant
Collaborative Research: CompCog: RI: Medium: Understanding human planning through AI-assisted analysis of a massive chess dataset
合作研究:CompCog:RI:中:通过人工智能辅助分析海量国际象棋数据集了解人类规划
- 批准号:
2312374 - 财政年份:2023
- 资助金额:
$ 30.19万 - 项目类别:
Standard Grant