CAREER: Modeling Spoken Language Without Parallel Text Annotations
职业:在没有并行文本注释的情况下对口语进行建模
基本信息
- 批准号:2238605
- 负责人:
- 金额:$ 60万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-02-01 至 2028-01-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Automatic speech recognition and understanding technology has been widely adopted into personal digital assistants, automatic transcription of videos and meetings, and many more applications. Building these systems requires massive datasets of speech audio that is human-transcribed into text. When sufficient data is available for a particular domain, modern models based on deep neural networks are capable of highly accurate speech recognition and downstream language understanding tasks. However, for the vast majority of the world's 7,000 languages and even more numerous dialects, large scale annotated datasets simply do not exist, preventing speech technology from serving these languages and their speakers. Inspired by the fact that humans learn to speak long before they can read or write, this CAREER project explores a new paradigm for speech processing that does not rely on transcribed speech. Instead, it develops new models that are capable of learning spoken language directly from speech audio, and applies these models to tasks including building speech recognizers without transcribed speech and automatically translating speech from one language into another. These advances fit within a larger movement in the research community to dramatically reduce the cost and increase the availability of speech recognition and understanding technology to many more languages and users than are served today.This project leverages self-supervised and multimodal learning approaches to automatically discover linguistic structure (phones, words, phrases, etc.) in the raw speech signal which can be treated as ``pseudo-text'' and used in place of conventional text for downstream tasks. It develops new neural network layers for attention-based segmentation of speech, applied in a hierarchical fashion to discover speech units at multiple levels of abstraction. A second novel technique involves adding self-prediction layers and training objectives to a model using the segmentation layers, where the higher layers that would capture word-like structure attempt to predict the tokenization of lower layers that capture sub-word structure. In this way, the model can automatically learn a pronunciation lexicon that captures the compositional relationship between the different tiers of discovered speech units. The project applies these techniques to three downstream applications that are steadily growing in importance in the speech field: unsupervised speech recognition, textless speech-to-speech translation, and textless generation speech for dialog and image captioning.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
自动语音识别和理解技术已被广泛应用于个人数字助理、视频和会议的自动转录以及更多应用。构建这些系统需要大量的语音音频数据集,这些数据集是人类转录成文本的。当特定领域有足够的数据可用时,基于深度神经网络的现代模型能够执行高度准确的语音识别和下游语言理解任务。然而,对于世界上7,000种语言中的绝大多数,甚至更多的方言,大规模的注释数据集根本不存在,这使得语音技术无法为这些语言及其使用者提供服务。受人类在能够阅读或写作之前很久就学会说话这一事实的启发,这个CAREER项目探索了一种不依赖于转录语音的语音处理新范式。相反,它开发了能够直接从语音音频中学习口语的新模型,并将这些模型应用于包括构建语音识别器而无需转录语音和自动将语音从一种语言翻译成另一种语言的任务。这些进步符合研究界的一个更大的运动,以显着降低成本,并增加语音识别和理解技术的可用性,以更多的语言和用户比今天的服务。该项目利用自我监督和多模式学习方法来自动发现语言结构(音素,单词,短语等)。在原始语音信号中,其可以被视为"伪文本“,并用于替代下游任务的常规文本。它开发了新的神经网络层,用于基于注意力的语音分割,以分层方式应用于发现多个抽象级别的语音单元。第二种新技术涉及使用分割层将自预测层和训练目标添加到模型中,其中将捕获类词结构的较高层尝试预测捕获子词结构的较低层的标记化。以这种方式,模型可以自动学习发音词典,该发音词典捕获所发现的语音单元的不同层之间的组成关系。该项目将这些技术应用于语音领域三个重要性稳步增长的下游应用:无监督语音识别、无文本语音到语音翻译以及用于对话和图像字幕的无文本生成语音。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(3)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Syllable Discovery and Cross Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
基于视觉的自我监督语音模型中的音节发现和跨语言泛化
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Peng, Puyuan;Li, Shang-Wen;Rasanen, Okko;Mohamed, Abdelrahman;Harwath, David
- 通讯作者:Harwath, David
Audio-Visual Neural Syntax Acquisition
视听神经语法获取
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Lai, Cheng-I Jeff;Shi, Freda;Peng, Puyuan;Kim, Yoon;Gimpel, Kevin;Chang, Shiyu;Chuang, Yung-Sung;Bhati, Saurabhchand;Cox, David;Harwath, David
- 通讯作者:Harwath, David
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
- DOI:10.48550/arxiv.2305.11095
- 发表时间:2023-05
- 期刊:
- 影响因子:0
- 作者:Puyuan Peng;Brian Yan;Shinji Watanabe;David F. Harwath
- 通讯作者:Puyuan Peng;Brian Yan;Shinji Watanabe;David F. Harwath
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
David Harwath其他文献
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
基于视觉的自我监督语音模式中的音节发现和跨语言泛化
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Puyuan Peng;Shang;Okko Rasanen;Abdel;David Harwath - 通讯作者:
David Harwath
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
SpeechCLIP:将语音与预先训练的视觉和语言模型集成
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Yi;Hsuan;Heng;Layne Berry;Hung;David Harwath - 通讯作者:
David Harwath
Learning to Map Efficiently by Active Echolocation
学习通过主动回声定位有效地绘制地图
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Xixi Hu;Senthil Purushwalkam;David Harwath;Kristen Grauman - 通讯作者:
Kristen Grauman
Interface Design for Self-Supervised Speech Models
自监督语音模型的界面设计
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Yi;David Harwath - 通讯作者:
David Harwath
United States Patent US 007288319 B 2 ( 12 ) ( 10 ) Patent N 0 . : US 7 , 288 , 319 B 2
美国专利US 007288319 B 2(12)(10)专利N 0。
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Layne Berry;Yi;Hsuan;Heng;Hung;David Harwath - 通讯作者:
David Harwath
David Harwath的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
Galaxy Analytical Modeling
Evolution (GAME) and cosmological
hydrodynamic simulations.
- 批准号:
- 批准年份:2025
- 资助金额:10.0 万元
- 项目类别:省市级项目
相似海外基金
Modeling the processing mechanisms of temporal structures in spoken Japanese
日语口语时间结构处理机制建模
- 批准号:
18K11366 - 财政年份:2018
- 资助金额:
$ 60万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
RI: Small: Modeling Idiosyncrasies of Speech for Automatic Spoken Language Processing
RI:小:为自动口语处理建模语音特质
- 批准号:
1617176 - 财政年份:2016
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
EAGER: Collaborative Research: Modeling Distinctive Partners in Adaptive Spoken Dialog
EAGER:协作研究:在自适应口语对话中建模独特的合作伙伴
- 批准号:
1044693 - 财政年份:2010
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
EAGER: Collaborative Research: Modeling Distinctive Partners in Adaptive Spoken Dialog
EAGER:协作研究:在自适应口语对话中建模独特的合作伙伴
- 批准号:
1043665 - 财政年份:2010
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
Study of automatic style transformation of spoken transcripts based on statistical modeling of spontaneous speech
基于自发语音统计建模的口语笔录自动风格转换研究
- 批准号:
21700193 - 财政年份:2009
- 资助金额:
$ 60万 - 项目类别:
Grant-in-Aid for Young Scientists (B)
User Modeling on Temporal Changes in User Behaviors for Spoken Dialogue Systems
口语对话系统用户行为时间变化的用户建模
- 批准号:
21700164 - 财政年份:2009
- 资助金额:
$ 60万 - 项目类别:
Grant-in-Aid for Young Scientists (B)
Eye Gaze in Salience Modeling for Robust Spoken Language Understanding
用于鲁棒口语理解的显着性建模中的眼睛注视
- 批准号:
0535112 - 财政年份:2005
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
Modeling Real-Time Interpersonal Interaction in Spoken Communication
口语交流中实时人际互动建模
- 批准号:
0415150 - 财政年份:2004
- 资助金额:
$ 60万 - 项目类别:
Continuing Grant
CAREER: Modeling and Optimizing User-Centric Mixed-Initiative Spoken Dialog Systems
职业:建模和优化以用户为中心的混合主动语音对话系统
- 批准号:
0238514 - 财政年份:2003
- 资助金额:
$ 60万 - 项目类别:
Continuing Grant
Co-Operative Study on Modeling and Machine Inplementation of Spoken Language Conversation
口语对话建模与机器实现的合作研究
- 批准号:
02305010 - 财政年份:1990
- 资助金额:
$ 60万 - 项目类别:
Grant-in-Aid for Co-operative Research (A)