Collaborative Research: Deep learning speech recognition for documenting Seneca, a Native American language, and other acutely under-resourced languages
合作研究:深度学习语音识别,用于记录美洲原住民语言塞内卡语和其他资源严重匮乏的语言
基本信息
- 批准号:1761477
- 负责人:
- 金额:$ 9.04万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-06-01 至 2022-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The Iroquoian languages of the Six Nations Confederacy, or the Haudenosaunee people, were first encountered when the explorer Jacques Cartier sailed up the St. Lawrence River in the 1530s. Dictionaries and grammars exist, but the basic elements of documentation for every language should also include annotated texts of diverse genres. As currently recognized in The Native American Languages Act, passed by the U.S. Congress in 1990, languages spoken by the indigenous peoples of North America have a unique status and importance. This project will bring together members of the Seneca Nation of Indians with a team of linguistics and computing researchers to record elders speaking Seneca, an Iroquoian language that is particularly endangered. The team will develop software to accurately and efficiently transcribe these recordings using automatic speech recognition (ASR), the technology behind digital personal assistants like Siri or Alexa. Seneca has an exceedingly complex word structure, known as polysynthesis, in which a word is equivalent to a clause or sentence. Such languages challenge ASR systems, which are generally designed to recognize words over a constrained vocabulary. This project will advance scientific knowledge by developing novel methods for generating synthetic text data to augment the existing written resources required to model this complexity. Broader impacts include the availability of the newly documented materials for language revitalization and scientific investigation. The project will provide undergraduates, graduate students, and young adults from the Seneca Nation with valuable STEM experience and broadening participation of Native Americans in the language and computing sciences, including supporting a Seneca doctoral student in computer science. The computational tools and methodologies developed will be accessible to others who are working to document and analyze low-resource languages, many spoken in regions of critical importance for national security.Spontaneous speech in Seneca contains long, complex words but also many short particles that are essential to understanding the discourse. Crucial for segmenting and annotating spoken Seneca are the prosodic patterns that occur in longer utterances, involving both metrical and tonal components. Most ASR frameworks would be challenged by the large vocabulary size that a polysynthetic morphological system tends to yield. In addition, ASR systems do not typically model high-level prosodic information. Seneca has little available text data derived from spontaneous speech, which is needed to build the predictive language models used in ASR and is invaluable to Seneca learners. Augmenting the available text data will require novel techniques for generating synthetic but plausible text, with a particular focus on neural sequence-to-sequence models. The ability of neural nets to model long-distance and hierarchical relationships will also be exploited to capture utterance-level prosodic patterns required for accurate segmentation of spontaneous speech in Seneca. By bringing together a range of expertise and by involving Seneca community members, key stakeholders in the language, the project bridges traditional linguistic methodology and computational approaches. Each new Seneca recording that is transcribed and annotated through this collaboration across disciplines will support the revitalization of the Seneca language and help to further the state of the art in low-resource language technology.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
16世纪30年代,探险家雅克·卡蒂埃(Jacques Cartier)沿着圣劳伦斯河逆流而上,第一次遇到了六族联盟(Six Nations Confederacy)的易洛魁语(Iroquoian languages)。字典和语法是存在的,但是每种语言的文档的基本元素还应该包括不同体裁的注释文本。正如1990年美国国会通过的《美国原住民语言法案》所承认的那样,北美原住民使用的语言具有独特的地位和重要性。这个项目将把印第安人塞内卡民族的成员与语言学和计算机研究人员组成的团队聚集在一起,记录老年人说塞内卡语的情况。塞内卡语是易洛魁人的一种特别濒危的语言。该团队将开发软件,使用自动语音识别(ASR)准确有效地转录这些录音,这是Siri或Alexa等数字个人助理背后的技术。塞内加有一个极其复杂的词结构,被称为多合成,其中一个词相当于一个从句或句子。这些语言对ASR系统构成了挑战,ASR系统通常被设计为在有限的词汇表中识别单词。该项目将通过开发生成合成文本数据的新方法来推进科学知识,以增加建模这种复杂性所需的现有书面资源。更广泛的影响包括为语言振兴和科学研究提供新的文献材料。该项目将为塞内卡国家的本科生、研究生和年轻人提供宝贵的STEM经验,并扩大美国原住民在语言和计算科学方面的参与,包括支持塞内卡国家的计算机科学博士生。开发的计算工具和方法将可供其他致力于记录和分析资源匮乏语言的人使用,这些语言许多是在对国家安全至关重要的地区使用的。塞内加的即兴演讲包含长而复杂的单词,但也有许多对理解话语至关重要的短词。对塞内卡口语进行分割和注释的关键是出现在较长话语中的韵律模式,包括韵律和音调成分。大多数ASR框架将受到多合成形态学系统倾向于产生的大词汇量的挑战。此外,ASR系统通常不模拟高水平的韵律信息。Seneca很少有来自自发语音的可用文本数据,这些数据需要建立ASR中使用的预测语言模型,对Seneca学习者来说是无价的。增加可用的文本数据将需要新的技术来生成合成但可信的文本,特别关注神经序列到序列模型。神经网络模拟远距离和层次关系的能力也将被利用来捕捉准确分割塞内卡自发语音所需的话语级韵律模式。通过汇集一系列专业知识,并让塞内卡社区成员、该语言的主要利益相关者参与进来,该项目将传统的语言学方法与计算方法结合起来。通过跨学科的合作,每一个新的塞内卡录音都将被转录和注释,这将支持塞内卡语言的复兴,并有助于推动低资源语言技术的发展。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Automatic Speech Recognition for Supporting Endangered Language Documentation
支持濒危语言文档的自动语音识别
- DOI:
- 发表时间:2021
- 期刊:
- 影响因子:1.8
- 作者:Prud'hommeaux, Emily;Jimerson, Robbie;Hatcher, Richard;Michelson, Karin
- 通讯作者:Michelson, Karin
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Karin Michelson其他文献
What does being a noun or a verb mean?
作为名词或动词意味着什么?
- DOI:
10.21248/hpsg.2020.6 - 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Jean;Karin Michelson - 通讯作者:
Karin Michelson
Argument Structure Of Oneida Kinship Terms1
奥奈达亲属关系术语的论证结构1
- DOI:
10.1086/652265 - 发表时间:
2010 - 期刊:
- 影响因子:0.1
- 作者:
Jean;Karin Michelson - 通讯作者:
Karin Michelson
Articulation without acoustics: "Soundless" vowels in Oneida and Blackfoot
没有声学的发音:奥奈达语和黑脚语中的“无声”元音
- DOI:
- 发表时间:
2012 - 期刊:
- 影响因子:0
- 作者:
B. Gick;Heather Bliss;Karin Michelson;Bosko Radanov - 通讯作者:
Bosko Radanov
Rules , constraints , and lexical phonology in Glenoe Scots James Myers
格伦诺苏格兰人詹姆斯·迈尔斯的规则、限制和词汇音韵学
- DOI:
- 发表时间:
2000 - 期刊:
- 影响因子:0
- 作者:
Glenoe Scots;J. Myers;D. Kemmerer;Karin Michelson;Jane S. Tsay - 通讯作者:
Jane S. Tsay
Specialized-domain grammars and the architecture of grammars: Possession in Oneida
专业领域语法和语法架构:奥奈达的占有
- DOI:
- 发表时间:
2020 - 期刊:
- 影响因子:1.1
- 作者:
Jean;Karin Michelson - 通讯作者:
Karin Michelson
Karin Michelson的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Karin Michelson', 18)}}的其他基金
Oneida Prosodic Categories Above the Word
奥奈达单词之上的韵律类别
- 批准号:
9222382 - 财政年份:1993
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: Geophysical and geochemical investigation of links between the deep and shallow volatile cycles of the Earth
合作研究:地球深层和浅层挥发性循环之间联系的地球物理和地球化学调查
- 批准号:
2333102 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Continuing Grant
Collaborative Research: Resolving the LGM ventilation age conundrum: New radiocarbon records from high sedimentation rate sites in the deep western Pacific
合作研究:解决LGM通风年龄难题:西太平洋深部高沉降率地点的新放射性碳记录
- 批准号:
2341426 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Continuing Grant
Collaborative Research: Resolving the LGM ventilation age conundrum: New radiocarbon records from high sedimentation rate sites in the deep western Pacific
合作研究:解决LGM通风年龄难题:西太平洋深部高沉降率地点的新放射性碳记录
- 批准号:
2341424 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Continuing Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403088 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403090 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant
Collaborative Research: Resolving the LGM ventilation age conundrum: New radiocarbon records from high sedimentation rate sites in the deep western Pacific
合作研究:解决LGM通风年龄难题:西太平洋深部高沉降率地点的新放射性碳记录
- 批准号:
2341425 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Continuing Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403089 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant
Collaborative Research: Geophysical and geochemical investigation of links between the deep and shallow volatile cycles of the Earth
合作研究:地球深层和浅层挥发性循环之间联系的地球物理和地球化学调查
- 批准号:
2333101 - 财政年份:2024
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant
Collaborative Research: FET: Medium:Compact and Energy-Efficient Compute-in-Memory Accelerator for Deep Learning Leveraging Ferroelectric Vertical NAND Memory
合作研究:FET:中型:紧凑且节能的内存计算加速器,用于利用铁电垂直 NAND 内存进行深度学习
- 批准号:
2312886 - 财政年份:2023
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312841 - 财政年份:2023
- 资助金额:
$ 9.04万 - 项目类别:
Standard Grant