权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

VocaliD SBIR Phase II: Optimized Speech Corpora for Personalized Speech Synthesis

VocaliD SBIR 第二阶段：用于个性化语音合成的优化语音语料库

基本信息

批准号：
9408604
负责人：
RUPAL PATEL
金额：
$ 60.83万
依托单位：
VOCALID, INC.
依托单位国家：
美国
项目类别：
财政年份：
2015
资助国家：
美国
起止时间：
2015-06-01 至 2019-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9408604
关键词：
Acoustics Address Adoption Age Augmentative and Alternative Communication Award Collection Communication Complex Computer Simulation Computer software Computers Custom DNA Data Data Collection Data Set Development Dreams Ensure Environment Ethnic Origin Family member Gender Generations Generic Drugs Human Impairment Individual Length Limb Prosthesis Linguistics Manuals Mediating Methods Modeling Noise Outcome Output Penetration Personality Phase Process Prosthesis Protocols documentation Reading Residual state Safety Sampling Seeds Series Signal Transduction Small Business Innovation Research Grant Speech Speech Intelligibility Technology Text Time Training Transcript Variant Voice Voice Quality base communication device cost cost effective crowdsourcing design experimental study girls improved literacy man model building phrases social integration socioeconomics sound speech processing success vocalization

项目摘要

Our voices are not identical, they are our identities. The human voice is a powerful signal that conveys one's age, gender, size, ethnicity, and personality, among other attributes. Yet, until now, users of augmentative and alternative communication (AAC) devices, screen reading technologies and other text-to-speech (TTS) applications have relied on a limited set of mass-produced, generic-sounding synthetic voices. This mismatch in vocal identity impacts educational outcomes, infringes on personal safety, and hinders social integration. Conventional methods for building a synthetic voice require a voice actor to record an extensive dataset of studio-quality recordings which are used to train a computational model and generate the output voice. The process is time and labor intensive and thus inaccessible to everyday consumers let alone those with speech impairment. VocaliD Inc's award winning technology offers an unprecedented means to build custom crafted synthetic voices that reflect the recipient by combining his/her own residual vocalizations with recordings of a matched speaker from our Human Voicebank. We have discovered that even a single vowel contains enough "vocal DNA" to seed the personalization process. VocaliD's custom voice sounds like the recipient in age, personality and vocal identity but is as clear and understandable as the donor's recordings. To create an affordable and efficient method of voice personalization, we leverage the penetration of high quality microphones and recording software on consumer grade computers and increased technological literacy to crowdsource the collection of speech and voice recordings. This enables engagement across broad age, socioeconomic, cultural and linguistic groups in order to truly sample the diversity of the human voice. The challenges, however, are to ensure high quality recordings and to sufficiently engage speech donors to complete the recording corpus. This Phase II project builds upon our success in Phase I to reduce the length of the donor corpus and to streamline and automate the recipient protocol. Results of our perceptual experiments indicated that while we were able to reduce the length of the donor corpus by 70%, it came at the cost of reduced intelligibility and naturalness. Since voice quality is vital to acceptance and adoption of our voices, this Phase II proposal is aimed at improving the clarity and expressiveness of our voices while maintaining the optimized corpus length. We propose to improve TTS intelligibility by developing methods to mitigate the effects of background noise and reverberation during donor and recipient recordings and aligning expected and actual spoken transcripts to reduce errors in TTS model building (Aim 1). To address the issue of TTS naturalness, we propose to modify the donor corpus to include more prosodically diverse contrasts and adapt the donor protocol to elicit natural melodic intonation and phrasing (Aim 2). These advances will yield a scalable and cost-effective method of personalized voice creation that will humanize speech-enabled technologies for AAC and beyond.

我们的声音并不相同，它们是我们的身份。人类的声音是一种强有力的信号，年龄、性别、体型、种族和个性等属性。然而，到目前为止，替代通信（AAC）设备、屏幕阅读技术和其他文本到语音（TTS）应用依赖于有限的一组大规模生产的、听起来一般的合成语音。这种不匹配，口头认同影响教育成果，侵犯人身安全，阻碍社会融合。用于构建合成语音的常规方法需要配音演员记录大量的语音数据集，录音室质量的录音，用于训练计算模型并生成输出语音。的这个过程是时间和劳动密集型的，因此日常消费者无法接触到，更不用说那些说话的人了损伤VocaliD公司的获奖技术提供了一个前所未有的手段，建立定制的合成声音，通过将他/她自己的残余发声与录音相结合来反映接受者。与我们的人类语音库匹配的人我们发现即使是一个元音 "声音DNA"来播种个性化过程。VocaliD的定制声音听起来像年龄的接收者，个性和声音的身份，但作为捐赠者的录音清晰易懂。创建一个负担得起的和有效的语音个性化的方法，我们利用高质量的渗透麦克风和录音软件的消费级计算机和提高技术素养，众包收集演讲和录音。这使得参与跨越了广泛的年龄，社会经济、文化和语言群体之间的对话，以便真正了解人类声音的多样性。的然而，挑战是确保高质量的录音，并充分吸引演讲捐赠者完成录音语料库这个第二阶段项目建立在我们在第一阶段的成功，以减少捐助者的长度，语料库和简化和自动化的接收方协议。我们的知觉实验结果表明，虽然我们能够将供体语料库的长度减少70%，但这是以减少可理解性和自然性。由于语音质量对于接受和采用我们的声音至关重要，因此第二阶段该提案旨在提高我们声音的清晰度和表现力，同时保持优化的语料长度我们建议通过开发方法来减轻在供体和受体记录期间的背景噪声和混响，并将预期和实际口语成绩单，以减少TTS模型构建中的错误（目标1）。为了解决TTS自然度的问题，我们建议修改捐赠语料库，以包括更多的韵律多样性对比，并适应捐赠协议引出自然的旋律语调和乐句（目标2）。这些进步将产生一个可扩展的和具有成本效益的一种个性化的语音创建方法，将使AAC及其他语音技术人性化。