权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Speech Processing Based on Deep Gaussian Process With Stochastic Differential Equation Layers

基于随机微分方程层深度高斯过程的语音处理

基本信息

批准号：
21K11955
负责人：
郡山知樹
金额：
$ 2.66万
依托单位：
CyberAgent, Inc. AI tech studio AI Lab (2022)The University of Tokyo (2021)
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
2021
资助国家：
日本
起止时间：
2021-04-01 至 2024-03-31
项目状态：
已结题

项目摘要

近年主流となっている深層ニューラルネットワーク(Deep neural network, DNN)に基づく音声情報処理は，大量の音声データを用いて大量のパラメータを学習する手法である．しかし，音声は言語・方言や話者，話し方や周囲の環境など多様性が非常に高いため，あらゆる音声を収録することは非常に困難である．そのため，例えば収録音声を十分に用意できない話者の音声を生成するone-shot音声合成のように，大量のパラメータを用いることに適さない音声情報処理が多く存在する．そこで本研究の目的は，少量のパラメータであっても複雑な関数を表現可能である，層の微分方程式表現を用いたいわゆる無限層の深層学習に基づく音声情報処理，特に音声合成における有効性を調査することである．今年度は，深層ガウス過程に基づく音声合成における畳み込み層の有効性を示した．これによって，DNNと同様の機能を持つ層を，より性能の高いDGPでも実現できることを示した．また，時間的な連続性を表現するための前段階として，長文音声合成の基盤作成を行った．具体的には，長文の中で知覚的な影響の大きいポーズを，事前学習済み言語モデルを用いて予測する手法を提案し，より自然な長文音声合成を実現することに成功した．この成果により，テキストと音声の時間軸方向の伸縮を適切に行うことの重要性が明らかになり，層の深層方向および時間軸方向のモデル化への指針が示された．

In recent years, the mainstream neural network has been the deep neural network. DNN) is based on voice information processing, and a large amount of voice information is processed using a large amount of voice information.しかし, sound は speech, dialect や speaker, words し方やweek囲の Environment など多様性がIt's very high, it's very difficult, it's very difficult to record the sound.そのため, for example, えばReceives the recorded voice を十に intends できない Speaker の音声をGenerates するone-shot sound Vocal synthesis is done, and a lot of sound information processing is done with a large amount of sound information processing. This is the purpose of this study. It is possible to express a small amount of complex numbers using simple differential equations, and it is useful to express layer differential equations.いたいわゆるInfinite layer of deep learningにbasedづくVoice information processing, specialにVoice synthesisにおけるeffectivenessをinvestigationすることである． This year, the effectiveness of the deep layering process and the sound synthesis of the base layer will be shown.これによって, DNN has the same function and maintains the layer を, and the performance of より is high and the DGP でも実 is now showing した.また, the expression of the continuous nature of time, the early stage of するための, として, the basis for the synthesis of long text and sound, を行った. The specific meaning of the article, the influence of the long text, the influence of the subject, and the language of the subject should be learned in advance. Using the prediction method and the proposal, the natural and long text sound synthesis is now successful.このachievementにより,テキストと soundのtime axis directionのstretchをappropriateに行うことのimportanceが明らかになり, layer のdeep direction およびtime axis direction のモデル化へのPointer が Show された.

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

More differentiated pause insertion for phoneme-based multi-speaker TTS models

基于音素的多说话人 TTS 模型的更差异化的暂停插入

DOI：
发表时间：
2023
期刊：
影响因子：
0
作者：
Dong Yang;Tomoki Koriyama;Yuki Saito;Takaaki Saeki;Detai Xin;Hiroshi Saruwatari
通讯作者：
Hiroshi Saruwatari

Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-To-Speech

DOI：
10.1109/icassp49357.2023.10096402
发表时间：
2023-02
期刊：
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
D. Yang;Tomoki Koriyama;Yuki Saito;Takaaki Saeki;Detai Xin;H. Saruwatari
通讯作者：
D. Yang;Tomoki Koriyama;Yuki Saito;Takaaki Saeki;Detai Xin;H. Saruwatari

Pause Prediction Using BERT-based Features for Long-form Text-to-speech Synthesis

使用基于 BERT 的特征进行长格式文本到语音合成的暂停预测