权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CHS: Small: Compounding Dividends on Voice Banking

CHS：小：语音银行的复利红利

基本信息

批准号：
1816726
负责人：
H. Timothy Bunnell
金额：
$ 10.41万
依托单位：
Alfred I du Pont Hospital for Children
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-03-01 至 2022-12-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1816726&HistoricalAwards=false
关键词：
CHS Small Compounding Dividends Voice

项目摘要

Text to speech (TTS) synthesis has become a successful and ubiquitous technology. The area of application for TTS technology that motivates this research is its use for Augmentative and Alternative Communication (AAC). According to the American Speech-Language and Hearing Association (ASHA), more than two million people in the United States have severe communication disorders that impair their ability to talk. AAC devices that use TTS to create spoken output are used by many of these people to support communication. Historically, AAC users have had access to a relatively small family of generic TTS voices that are neither unique to them nor typically age- or dialect-appropriate. However, advances in TTS technology make it possible to create personalized synthetic voices that capture the unique vocal identity of AAC device users if they are able to record enough speech. This allows patients with neurodegenerative diseases such as ALS to "bank" their voice - that is, to record examples of their speech that can later be used to create a personal TTS voice - before the disease progresses to a point that they can no longer speak. Unfortunately, one major barrier to voice banking, especially for patients who may already be experiencing some difficulty speaking, is the amount of speech needed to create a natural sounding TTS voice that fully captures the vocal identity of the voice banker. To reduce this barrier, this research will combine a type of speech synthesis called parallel formant synthesis that was developed several decades ago, with deep learning computational techniques that allow a computer to learn how to control the parameters of the parallel formant synthesizer to reproduce the speech of a target speaker given examples of the target speaker's speech. A parallel formant synthesizer will be implemented and trained to model speech recorded by voice bankers, and its output will be compared with that of other synthesizers that have been trained with the same speech data. Objective measures of similarity between synthetic and natural utterances, and subjective measures of voice quality and similarity using human listeners, will be used. This will be the first step toward building a parallel formant synthesis-based voice conversion system capable of creating TTS voices from a small number of natural speech samples, and also better able to model the expressive nature of natural speech.Despite advances in TTS technology, there are multiple challenges to the application of this technology for voice banking. Specifically: (a) the amount of speech required (several hours) to create the most natural sounding TTS voices using unit selection or hybrid DNN/unit selection is prohibitive for most voice bankers; (b) existing voice conversion techniques that do not require large amounts of parallel speech from the target talker generally produce speech sounding less natural and less like the target speaker when compared to concatenative synthesis; and (c) both concatenative and statistical parametric techniques produce speech that is only as expressive as the data within the speech corpus from which they have been constructed or trained. Parallel formant synthesis, because it is based explicitly on the perceptually most salient features of natural speech and lends itself to independently modeling laryngeal, suprasegmental, and segmental features should be better able to address all three of these challenges. As proof of concept, a parallel formant synthesis (PFS) vocoder with DNN-based parameter estimation will be implemented. The vocoder will be implemented within the Merlin DNN synthesis framework so that speech output of the PFS system can be directly compared to output generated by the World and MagPhase vocoders. Training will be based on corpora drawn from the same set of 1600 utterances recorded by multiple individuals who have contributed their recordings to the ModelTalker project. The selected target talkers will be balanced for gender and span a wide range of English dialects, but use of speakers with noticeable levels of dysarthria will be avoided. Objective comparisons will be based on Mel-Cepstral Difference (MCD) between synthetic and natural sentence tokens that were not used in training the synthesizers. Subjective measures (Mean Opinion Scores) will be obtained from human listeners via Amazon Mechanical Turk.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

文本到语音 (TTS) 合成已成为一项成功且普遍存在的技术。推动这项研究的 TTS 技术的应用领域是其在增强和替代通信 (AAC) 中的应用。据美国言语和听力协会 (ASHA) 称，美国有超过 200 万人患有严重的沟通障碍，影响了他们的说话能力。其中许多人使用使用 TTS 创建语音输出的 AAC 设备来支持通信。从历史上看，AAC 用户可以使用相对较小的通用 TTS 语音系列，这些语音既不是他们独有的，也不是通常适合年龄或方言的。然而，TTS 技术的进步使得创建个性化合成语音成为可能，如果 AAC 设备用户能够录制足够的语音，则可以捕获他们独特的声音身份。这使得患有 ALS 等神经退行性疾病的患者能够在疾病发展到无法说话之前“存储”他们的声音，即记录他们的语音示例，以便以后用于创建个人 TTS 语音。不幸的是，语音银行的一个主要障碍，特别是对于可能已经遇到说话困难的患者来说，是创建自然的 TTS 语音所需的语音量，以充分捕捉语音银行人员的声音身份。为了减少这一障碍，这项研究将几十年前开发的一种称为并行共振峰合成的语音合成与深度学习计算技术结合起来，该技术允许计算机学习如何控制并行共振峰合成器的参数，以在给定目标说话者语音示例的情况下重现目标说话者的语音。将实施并训练并行共振峰合成器，以对语音银行人员记录的语音进行建模，并将其输出与使用相同语音数据训练的其他合成器的输出进行比较。将使用合成和自然话语之间相似性的客观测量，以及使用人类听众的语音质量和相似性的主观测量。这将是构建基于并行共振峰合成的语音转换系统的第一步，该系统能够从少量自然语音样本创建 TTS 语音，并且能够更好地对自然语音的表达本质进行建模。尽管 TTS 技术取得了进步，但将该技术应用于语音银行仍面临多种挑战。具体来说：(a) 使用单元选择或混合 DNN/单元选择创建听起来最自然的 TTS 语音所需的语音量（几个小时）对于大多数语音银行人员来说是令人望而却步的； (b) 现有的语音转换技术不需要来自目标讲话者的大量并行语音，与串联合成相比，通常会产生听起来不太自然且不太像目标讲话者的语音； (c) 连接和统计参数技术产生的语音仅与构建或训练它们的语音语料库中的数据一样具有表现力。并行共振峰合成，因为它明确地基于自然语音的感知上最显着的特征，并且适合独立建模喉部、超音段和音段特征，所以应该能够更好地解决所有这三个挑战。作为概念验证，将实现具有基于 DNN 参数估计的并行共振峰合成 (PFS) 声码器。声码器将在 Merlin DNN 合成框架内实现，以便 PFS 系统的语音输出可以直接与 World 和 MagPhase 声码器生成的输出进行比较。培训将基于从同一组 1600 条话语中提取的语料库，这些话语由多个向 ModelTalker 项目贡献了录音的个人记录。选定的目标说话者将在性别上保持平衡，并涵盖多种英语方言，但将避免使用具有明显构音障碍的说话者。客观比较将基于未在训练合成器时使用的合成句子标记和自然句子标记之间的 Mel-Cepstral Difference (MCD)。主观测量（平均意见分数）将通过 Amazon Mechanical Turk 从人类听众那里获得。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力优点和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（1）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Unsupervised Training of a DNN-Based Formant Tracker

基于 DNN 的共振峰跟踪器的无监督训练

DOI：
10.21437/interspeech.2021-1690
发表时间：
2021
期刊：
Proceedings of InterSpeech 2021
影响因子：
0
作者：
Lilley, Jason;Bunnell, H. Timothy
通讯作者：
Bunnell, H. Timothy

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

H. Timothy Bunnell其他文献

Reliable prediction of childhood obesity using only routinely collected EHRs may be possible

DOI：
10.1016/j.obpill.2024.100128
发表时间：
2024-12-01
期刊：
Research article
影响因子：
作者：
Mehak Gupta;Daniel Eckrich;H. Timothy Bunnell;Thao-Ly T. Phan;Rahmatollah Beheshti
通讯作者：
Rahmatollah Beheshti