STREAMLInED: Shared Tasks for Rapid, Efficient Analysis of Many Languages in Emerging Documentation
STREAMLInED:用于快速、高效分析新兴文档中多种语言的共享任务
基本信息
- 批准号:1760475
- 负责人:
- 金额:$ 12.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-06-15 至 2024-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
This project aligns the research interests of two separate scientific and engineering communities in order to push the boundaries of automatic speech processing technology and bring its benefits to the urgent task of endangered language documentation. Automatic speech processing technology has become familiar in the everyday lives of many speakers of English and other widely spoken languages through tools such as automatic captioning and voice-driven personal assistants. Meanwhile, linguists are rushing to document and analyze the thousands of languages that by the end of this century, will no longer be acquired by children. Such work would be greatly assisted by automatic processing of recorded spoken endangered language data. Modern automatic speech processing tools, however, require training data sets orders of magnitude larger than what is available for endangered languages. This project will advance scientific knowledge on this problem by structuring a "shared task evaluation challenge" around language documentation-based data sets. Better language documentation puts communities in a better position to undertake language revitalization, which in turn can be a key component of community development for marginalized populations. Broader impacts also include the benefits of bringing speech technology that works with small data sets to widely spoken but understudied languages, often languages of communication in regions of geopolitical and economic importance to national interests. Language documentation projects typically begin with large quantities of recorded speech. Turning that spoken signal into a transcribed form is a major bottleneck in the language documentation process. Similarly, language archives house recorded, unanalyzed data from many languages with no living fluent speaker, but which have communities interested in revitalizing their heritage languages. At the same time, the development of technology that can work effectively with very small training data sets is an open and interesting challenge for speech researchers. The shared task evaluation challenge framework provides the structure of a friendly competition in which different research groups can explore and compare approaches that are evaluated with standardized data and metrics. This strategy for focusing research effort has advanced the frontiers of language technology for decades. This project will apply it for the first time to the specific challenges of endangered language documentation: working with truly low-resource languages, with often noisy or other imperfect recording conditions. The specific tasks the challenge will focus on include: identifying the language and speaker of each segment of a recording, identifying the genre (e.g. story telling vs. dialogue) of segments of recordings, and aligning short partial transcriptions to the spoken recordings. The researchers will prepare the data (based on existing data sets identified in language archives), set up functioning baseline systems that task participants can use for comparison and/or build on further, establish evaluation metrics, and execute the shared task. The shared task structure will encourage and support participants in making their contributions open source, with an eye towards ensuring they are available to language documentation researchers. The project will also include outreach to the language documentation community in order to train such researchers in the use of the technology developed.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目将两个不同的科学和工程界的研究兴趣结合起来,以突破自动语音处理技术的界限,并将其优势应用于濒危语言文献记录的紧迫任务。通过自动字幕和语音驱动的个人助理等工具,自动语音处理技术已在许多英语和其他广泛使用的语言使用者的日常生活中变得熟悉。与此同时,语言学家正忙着记录和分析到本世纪末儿童将不再习得的数千种语言。对记录的濒危语言口语数据的自动处理将极大地协助此类工作。然而,现代自动语音处理工具需要的训练数据集比可用于濒危语言的数据集大几个数量级。该项目将通过围绕基于语言文档的数据集构建“共享任务评估挑战”来推进有关该问题的科学知识。更好的语言记录使社区能够更好地进行语言复兴,这反过来又可以成为边缘化群体社区发展的关键组成部分。更广泛的影响还包括将适用于小数据集的语音技术引入广泛使用但研究不足的语言的好处,这些语言通常是对国家利益具有重要地缘政治和经济重要性的地区的交流语言。语言文档项目通常从大量录制的语音开始。将口头信号转换为转录形式是语言文档过程中的主要瓶颈。同样,语言档案馆保存着许多语言的记录、未经分析的数据,这些语言目前还没有流利的使用者,但社区对振兴其传统语言感兴趣。与此同时,开发能够有效地处理非常小的训练数据集的技术对于语音研究人员来说是一个开放且有趣的挑战。共享任务评估挑战框架提供了友好竞争的结构,不同的研究小组可以探索和比较使用标准化数据和指标评估的方法。几十年来,这种集中研究工作的策略已经推动了语言技术的前沿发展。该项目将首次将其应用于濒危语言文档的具体挑战:使用真正的资源匮乏的语言,以及经常有噪音或其他不完善的记录条件。挑战将重点关注的具体任务包括:识别录音每个片段的语言和说话者,识别录音片段的类型(例如讲故事与对话),以及将简短的部分转录与口语录音对齐。研究人员将准备数据(基于语言档案中确定的现有数据集),建立功能基线系统,任务参与者可以使用该系统进行比较和/或进一步构建,建立评估指标,并执行共享任务。共享任务结构将鼓励和支持参与者将其贡献开源,以确保语言文档研究人员可以使用它们。该项目还将包括对语言文档社区的推广,以培训这些研究人员使用所开发的技术。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Investigating Speaker Diarization of Endangered Language Data
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Gina-Anne Levow
- 通讯作者:Gina-Anne Levow
Developing a Shared Task for Speech Processing on Endangered Languages
- DOI:10.33011/computel.v1i.967
- 发表时间:2021-03
- 期刊:
- 影响因子:0
- 作者:Gina-Anne Levow;Emily Ahn;Emily M. Bender
- 通讯作者:Gina-Anne Levow;Emily Ahn;Emily M. Bender
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Gina-Anne Levow其他文献
The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition
- DOI:
- 发表时间:
2006-07 - 期刊:
- 影响因子:0
- 作者:
Gina-Anne Levow - 通讯作者:
Gina-Anne Levow
Identifying local corrections in human-computer dialogue
- DOI:
10.21437/interspeech.2004-146 - 发表时间:
2004 - 期刊:
- 影响因子:0
- 作者:
Gina-Anne Levow - 通讯作者:
Gina-Anne Levow
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words
- DOI:
10.3115/1118935.1118945 - 发表时间:
2003-07 - 期刊:
- 影响因子:0
- 作者:
Gina-Anne Levow - 通讯作者:
Gina-Anne Levow
Learning to Speak to a Spoken Language System: Vocabulary Convergence in Novice Users
- DOI:
- 发表时间:
2003 - 期刊:
- 影响因子:0
- 作者:
Gina-Anne Levow - 通讯作者:
Gina-Anne Levow
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News
- DOI:
- 发表时间:
2004 - 期刊:
- 影响因子:0
- 作者:
Gina-Anne Levow - 通讯作者:
Gina-Anne Levow
Gina-Anne Levow的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Gina-Anne Levow', 18)}}的其他基金
EL-STEC: Shared Task Evaluation Campaigns with Endangered Language Data
EL-STEC:使用濒危语言数据进行共享任务评估活动
- 批准号:
1500157 - 财政年份:2015
- 资助金额:
$ 12.5万 - 项目类别:
Standard Grant
EAGER: ATAROS: Automatic Tagging and Recognition of Stance
EAGER:ATAROS:自动标记和立场识别
- 批准号:
1351034 - 财政年份:2013
- 资助金额:
$ 12.5万 - 项目类别:
Standard Grant
相似海外基金
AUC-GRANTED: Advancing Transformation of the Research Enterprise through Shared Resource Support Model for Collective Impact and Synergistic Effect.
AUC 授予:通过共享资源支持模型实现集体影响和协同效应,推进研究企业转型。
- 批准号:
2341110 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Cooperative Agreement
Haptic Shared Control Systems And A Neuroergonomic Approach To Measuring System Trust
触觉共享控制系统和测量系统信任的神经工学方法
- 批准号:
EP/Y00194X/1 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Research Grant
Shared Spaces: The How, When, and Why of Adolescent Intergroup Interactions
共享空间:青少年群体间互动的方式、时间和原因
- 批准号:
ES/T014709/2 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Research Grant
I(eye)-SCREEN: A real-world AI-based infrastructure for screening and prediction of progression in age-related macular degeneration (AMD) providing accessible shared care
I(eye)-SCREEN:基于人工智能的现实基础设施,用于筛查和预测年龄相关性黄斑变性 (AMD) 的进展,提供可及的共享护理
- 批准号:
10102692 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
EU-Funded
OpenBioMAPS: shared tools for accelerating UK bio-manufacturing
OpenBioMAPS:加速英国生物制造的共享工具
- 批准号:
BB/Y007808/1 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Research Grant
A Secure Hub for Access, Reliability, and Exchange of Data (SHARED)
用于访问、可靠性和数据交换的安全中心(共享)
- 批准号:
2346746 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Standard Grant
Shared and distinct genetic architecture of autoimmune and hormonal alopecias
自身免疫性脱发和激素性脱发的共同和独特的遗传结构
- 批准号:
MR/X030466/1 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Research Grant
Shared Post-Human Imagination: Human-AI Collaboration in Media Creation
共享的后人类想象力:媒体创作中的人机协作
- 批准号:
AH/Z50564X/1 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Research Grant
Dynamic Shared Control for Soft Robots
软体机器人的动态共享控制
- 批准号:
2349067 - 财政年份:2024
- 资助金额:
$ 12.5万 - 项目类别:
Standard Grant
CAREER: Learning and Leveraging Conventions in the Design of an Adaptive Haptic Shared Control for Steering a Semi-Automated Vehicle
职业:学习和利用设计用于驾驶半自动车辆的自适应触觉共享控制的惯例
- 批准号:
2238268 - 财政年份:2023
- 资助金额:
$ 12.5万 - 项目类别:
Standard Grant














{{item.name}}会员




