权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Modeling Spoken Language Without Parallel Text Annotations

职业：在没有并行文本注释的情况下对口语进行建模

基本信息

批准号：
2238605
负责人：
David Harwath
金额：
$ 60万
依托单位：
University of Texas at Austin
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-02-01 至 2028-01-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2238605&HistoricalAwards=false
关键词：
CAREER Modeling Spoken Language Without

项目摘要

Automatic speech recognition and understanding technology has been widely adopted into personal digital assistants, automatic transcription of videos and meetings, and many more applications. Building these systems requires massive datasets of speech audio that is human-transcribed into text. When sufficient data is available for a particular domain, modern models based on deep neural networks are capable of highly accurate speech recognition and downstream language understanding tasks. However, for the vast majority of the world's 7,000 languages and even more numerous dialects, large scale annotated datasets simply do not exist, preventing speech technology from serving these languages and their speakers. Inspired by the fact that humans learn to speak long before they can read or write, this CAREER project explores a new paradigm for speech processing that does not rely on transcribed speech. Instead, it develops new models that are capable of learning spoken language directly from speech audio, and applies these models to tasks including building speech recognizers without transcribed speech and automatically translating speech from one language into another. These advances fit within a larger movement in the research community to dramatically reduce the cost and increase the availability of speech recognition and understanding technology to many more languages and users than are served today.This project leverages self-supervised and multimodal learning approaches to automatically discover linguistic structure (phones, words, phrases, etc.) in the raw speech signal which can be treated as ``pseudo-text'' and used in place of conventional text for downstream tasks. It develops new neural network layers for attention-based segmentation of speech, applied in a hierarchical fashion to discover speech units at multiple levels of abstraction. A second novel technique involves adding self-prediction layers and training objectives to a model using the segmentation layers, where the higher layers that would capture word-like structure attempt to predict the tokenization of lower layers that capture sub-word structure. In this way, the model can automatically learn a pronunciation lexicon that captures the compositional relationship between the different tiers of discovered speech units. The project applies these techniques to three downstream applications that are steadily growing in importance in the speech field: unsupervised speech recognition, textless speech-to-speech translation, and textless generation speech for dialog and image captioning.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

自动语音识别和理解技术已被广泛应用于个人数字助理、视频和会议的自动转录以及更多应用。构建这些系统需要大量的语音音频数据集，这些数据集是人类转录成文本的。当特定领域有足够的数据可用时，基于深度神经网络的现代模型能够执行高度准确的语音识别和下游语言理解任务。然而，对于世界上7,000种语言中的绝大多数，甚至更多的方言，大规模的注释数据集根本不存在，这使得语音技术无法为这些语言及其使用者提供服务。受人类在能够阅读或写作之前很久就学会说话这一事实的启发，这个CAREER项目探索了一种不依赖于转录语音的语音处理新范式。相反，它开发了能够直接从语音音频中学习口语的新模型，并将这些模型应用于包括构建语音识别器而无需转录语音和自动将语音从一种语言翻译成另一种语言的任务。这些进步符合研究界的一个更大的运动，以显着降低成本，并增加语音识别和理解技术的可用性，以更多的语言和用户比今天的服务。该项目利用自我监督和多模式学习方法来自动发现语言结构（音素，单词，短语等）。在原始语音信号中，其可以被视为"伪文本“，并用于替代下游任务的常规文本。它开发了新的神经网络层，用于基于注意力的语音分割，以分层方式应用于发现多个抽象级别的语音单元。第二种新技术涉及使用分割层将自预测层和训练目标添加到模型中，其中将捕获类词结构的较高层尝试预测捕获子词结构的较低层的标记化。以这种方式，模型可以自动学习发音词典，该发音词典捕获所发现的语音单元的不同层之间的组成关系。该项目将这些技术应用于语音领域三个重要性稳步增长的下游应用：无监督语音识别、无文本语音到语音翻译以及用于对话和图像字幕的无文本生成语音。该奖项反映了NSF的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（3）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Syllable Discovery and Cross Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

基于视觉的自我监督语音模型中的音节发现和跨语言泛化

DOI：
发表时间：
2023
期刊：
Interspeech
影响因子：
0
作者：
Peng, Puyuan;Li, Shang-Wen;Rasanen, Okko;Mohamed, Abdelrahman;Harwath, David
通讯作者：
Harwath, David

Audio-Visual Neural Syntax Acquisition

视听神经语法获取

DOI：
发表时间：
2023
期刊：
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU
影响因子：
0
作者：
Lai, Cheng-I Jeff;Shi, Freda;Peng, Puyuan;Kim, Yoon;Gimpel, Kevin;Chang, Shiyu;Chuang, Yung-Sung;Bhati, Saurabhchand;Cox, David;Harwath, David
通讯作者：
Harwath, David

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization