权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora

合作研究：利用文献语言语料库改进自动语音识别和迁移学习技术

基本信息

批准号：
2123624
负责人：
Shinji Watanabe
金额：
$ 19.35万
依托单位：
Carnegie-Mellon University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-12-01 至 2025-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2123624&HistoricalAwards=false
关键词：
Collaborative Research Improving Techniques Automatic

项目摘要

Computational tools such as automatic speech recognition (that is, the conversion of speech to text), are increasingly used to facilitate and mediate communication. Doctors speak into their computers, which transcribe their speech into legible written summaries; online virtual assistants have become ubiquitous in support networks in a wide range of situations; and end users increasingly expect their speech to be understood, processed, and acted upon by cell phones, navigation devices, and tools such as Alexa. The creation of such mechanisms, however, is currently dependent upon a large amount of training data (speech and text) that is only available for major languages. It is quite challenging to develop speech recognition systems when only 10 hours of transcribed audio is available. One way of addressing this problem is through transfer learning, in which a speech recognizer is trained on a relatively large amount of data for one endangered language ( 50 hours of transcribed audio) is then extended to related languages for which only a small corpus of material will be developed (10 hours of transcribed audio and 90 hours of untranscribed audio). The objectives of this project are both theoretical and substantive. For the first, this project advances the development of natural language processing for low-resource languages and establishes a protocol for extending this to other related languages. Substantively, this project produces an unprecedented corpus of transcribed audio for five related languages, facilitating the comparative study of these languages by theoretical and descriptive linguists. The data and findings will be available at Linguistic Data Consortium at the University of Pennsylvania, and Sam Noble Oklahoma Museum of Natural History, University of Oklahoma.State-of-the-art automatic speech recognition (ASR) depends upon the existence of a corpus of material (audio recordings with time-coded transcriptions) and the application of artificial intelligence systems that utilize neural networks to replicate humans learning by interpreting raw data. This present project employs what is called an "end-to-end neural network." Effectively, the artificial neural network is presented with input data (the acoustic speech signal) and a prepared the end result (a transcription) and learns to achieve the same result. To accomplish this, the original corpus is divided into training (~ 80%), validation (~10%), and test (~10%) sets. For endangered language documentation the goal is not simply accuracy of the ASR system but also the reduction of human effort to achieve highly accurate time-coded transcriptions that will be archived as a permanent record of target language. The project team has already developed a highly accurate system for one phonologically difficult tonal language (character error rate 8%) and reduced the human effort required to produce an accurate time-coded transcription by 75% (from 40 hours needed by a human starting from scratch to 9 hours needed by a human proofing a transcription generated by ASR). For this project the same team will explore ASR strategies for a morphologically complex agglutinative language in the hope of achieving the same degree of accuracy and reduction in human effort. This project will also address another challenge for state-of-the-art ASR: The transfer of an effective system developed for one language to low-resource, virtually undocumented related languages. Should the project be successful it will serve as a model for similar efforts with other languages and language groups.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

诸如自动语音识别（即语音到文本的转换）之类的计算工具越来越多地用于促进和调解通信。医生对着他们的电脑说话，电脑会将他们的讲话转录成清晰的书面摘要;在线虚拟助手在各种情况下的支持网络中无处不在;最终用户越来越希望他们的讲话能被手机、导航设备和Alexa等工具理解、处理和执行。然而，这种机制的创建目前依赖于大量的训练数据（语音和文本），而这些数据仅适用于主要语言。当只有10小时的转录音频可用时，开发语音识别系统是相当具有挑战性的。解决这个问题的一种方法是通过迁移学习，其中语音识别器在一种濒危语言（50小时的转录音频）的相对大量的数据上进行训练，然后扩展到仅开发一小部分材料的相关语言（10小时的转录音频和90小时的未转录音频）。该项目的目标既是理论性的，也是实质性的。首先，该项目推动了低资源语言的自然语言处理的发展，并建立了一个协议，将其扩展到其他相关语言。实质上，该项目为五种相关语言制作了前所未有的转录音频语料库，促进了理论和描述语言学家对这些语言的比较研究。这些数据和研究结果将在宾夕法尼亚大学的语言数据联盟和俄克拉荷马州自然历史博物馆，俄克拉荷马州大学。最先进的自动语音识别（ASR）依赖于语料库的存在（带有时间编码传输的音频记录）以及人工智能系统的应用，该系统利用神经网络通过解释原始数据来复制人类的学习。本项目采用所谓的“端到端神经网络”。“有效地，人工神经网络呈现输入数据（声学语音信号）和准备好的最终结果（转录），并学习以实现相同的结果。为了实现这一点，原始语料库被分为训练集（约80%）、验证集（约10%）和测试集（约10%）。对于濒危语言文档，目标不仅仅是ASR系统的准确性，而且还减少了人工努力，以实现高度准确的时间编码翻译，并将其作为目标语言的永久记录存档。该项目团队已经为一种语音学上困难的音调语言开发了一个高度准确的系统（字符错误率为8%），并将产生准确的时间编码转录所需的人力减少了75%（从人类从头开始需要40小时到人类校对ASR生成的转录所需的9小时）。在这个项目中，同一个团队将为一种形态复杂的粘着语言探索ASR策略，希望达到同样的准确度，并减少人类的努力。该项目还将解决最先进的ASR面临的另一个挑战：将为一种语言开发的有效系统转移到低资源，几乎没有文档的相关语言。如果该项目获得成功，它将成为与其他语言和语言团体进行类似努力的典范。该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（2）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

ML-SUPERB：多语言语音通用性能基准

DOI：
10.21437/interspeech.2023-1316
发表时间：
2023
期刊：
ISCA
影响因子：
0
作者：
Shi, Jiatong;Berrebbi, Dan;Chen, William;Hu, En-Pei;Huang, Wei-Ping;Chung, Ho-Lam;Chang, Xuankai;Li, Shang-Wen;Mohamed, Abdelrahman;Lee, Hung-yi
通讯作者：
Lee, Hung-yi

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Shinji Watanabe其他文献

Discriminative feature transforms using differenced maximum mutual information

使用差分最大互信息进行判别性特征变换

DOI：
10.1109/icassp.2012.6288981
发表时间：
2012
期刊：
2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
Marc Delcroix;A. Ogawa;Shinji Watanabe;T. Nakatani;Atsushi Nakamura
通讯作者：
Atsushi Nakamura

CMU’s IWSLT 2022 Dialect Speech Translation System

CMU 的 IWSLT 2022 方言语音翻译系统

DOI：
发表时间：
2022
期刊：
International Workshop on Spoken Language Translation
影响因子：
0
作者：
Brian Yan;Patrick Fernandes;Siddharth Dalmia;Jiatong Shi;Yifan Peng;Dan Berrebbi;Xinyi Wang;Graham Neubig;Shinji Watanabe
通讯作者：
Shinji Watanabe

太陽光有効利用のための分子論的物質変換化学

有效利用阳光的分子材料转化化学

DOI：
发表时间：
2019
期刊：
影响因子：
0
作者：
Hiroshi Tanimura;Shinji Watanabe;and Tetsu Ichitsubo;山本雅納
通讯作者：
山本雅納

SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

SPGISpeech：5, 000 小时的转录金融音频，用于完全格式化的端到端语音识别

DOI：
10.21437/interspeech.2021-1860
发表时间：
2021
期刊：
Energy
影响因子：
9
作者：
Patrick K. O’Neill;Vitaly Lavrukhin;Somshubra Majumdar;V. Noroozi;Yuekai Zhang;Oleksii Kuchaiev;Jagadeesh Balam;Yuliya Dovzhenko;Keenan Freyberg;Michael D. Shulman;Boris Ginsburg;Shinji Watanabe;G. Kucsko
通讯作者：
G. Kucsko