权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

End-to-End音声合成とEnd-to-End音声認識の統合システム

端到端语音合成和端到端语音识别集成系统

基本信息

批准号：
19J21031
负责人：
上乃聖
金额：
$ 1.98万
依托单位：
Kyoto University
依托单位国家：
日本
项目类别：
Grant-in-Aid for JSPS Fellows
财政年份：
2019
资助国家：
日本
起止时间：
2019-04-25 至 2022-03-31
项目状态：
已结题

来源：
https://kaken.nii.ac.jp/en/grant/KAKENHI-PROJECT-19J21031/
关键词：
音声認識音声合成

项目摘要

研究の目的はEnd-to-End音声合成とEnd-to-End音声認識を統合することで、適用対象(タスク、ドメイン)のテキストのみがある条件でも音声との対データを構成し、一括で学習するシステムを実現することである。今年度は音声認識と音声合成を効率的に統合でき、かつ音声認識の性能の低下が少ない表現を構成する方法の研究を行った。音声認識の性能の低下の原因のひとつとして挙げられるのが、実際に人間が話した音声(自然音声)と音声合成システムが生成した音声(合成音声)に差があることである。音声合成においては、通常テキストから人が聞くことのできる音声波形を作るのに必要な周波数スペクトル特徴量を予測するモデルを用いた後に、その周波数スペクトル特徴量を音声波形に変換するモデルを用いて、音声波形を生成する。周波数スペクトル特徴量は音声認識の訓練データとしても用いられ、生成された音声波形を再び周波数スペクトル特徴量に変換し、音声認識に用いる。音声波形に変換するモデルには自然音声と合成音声の差異を埋める効果があるが、この波形生成に非常に時間がかかるという問題がある。そこで今年度は音声波形に変換するモデルを用いずに周波数スペクトル特徴量上で直接差異を埋めるネットワークを構築した。提案手法では、生成された周波数スペクトル特徴量だけでなく、音声合成のタスクで利用可能な発話の音素系列情報も用いる。評価実験から、提案手法が音声波形に変換するよりも少ない処理時間で音声認識の拡張の効果が高いことを示し、また、発話の音素系列情報の利用も改善に重要であることを示した。

The purpose of this study is to integrate End-to-End sound synthesis and End-to-End sound recognition, to apply the conditions of object (s) and object (s), and to construct and learn the conditions of object (s) and object (s). This year, the research on the integration of sound recognition and sound synthesis efficiency, the improvement of sound recognition performance and the composition of sound synthesis performance is carried out. The reasons for the low performance of sound recognition are endless. In reality, there is a huge difference between the sound of human speech (natural sound) and the sound generated by the sound synthesis system (synthetic sound). Sound synthesis in general, sound waveform generation, sound waveform generation, sound signal generation, sound waveform generation, sound Frequency selection feature quantity is used in training of sound recognition, generation of sound waveform and frequency selection feature quantity is used in sound recognition. The difference between natural sound and synthetic sound is caused by the change of sound waveform. This year's acoustic waveform changes the number of cycles, the number of features, and the number of features. The proposed method is to generate the number of cycles, select the feature quantity, and synthesize the sound by using the phoneme series information that may be transmitted. The evaluation method is important for improving the utilization of phoneme series information such as sound waveform conversion, sound processing time and sound recognition.