权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Development of a mutual conversion method between face image and voice during speech

语音时人脸图像与声音相互转换方法的开发

基本信息

批准号：
22K12916
负责人：
鈴木基之
金额：
$ 2.33万
依托单位：
Osaka Institute of Technology
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
2022
资助国家：
日本
起止时间：
2022-04-01 至 2026-03-31
项目状态：
未结题

项目摘要

本年度は，唇動画像から音声を生成する方法を確立するため，入力画像の種類の違いと話者に対する頑健性について検討を行った。一般に唇動画像から発話内容を推定する研究においては，唇近辺を切り抜いた動画像が入力として用いられている。しかしこうした画像の中には，肌の色や唇の大きさの違い，といった個人性情報も含まれるため，特にモデル学習に利用した話者と異なる話者に対しては性能が劣化することが考えられる。そこで入力画像をより単純化し，個人性を排除した場合の性能について検討を行った。唇画像から，唇の輪郭にそって20点の特徴点を抽出し，それの座標値をそのまま入力した場合と，特徴点間を直線で結び，唇を単純な図形で表現した上で入力した場合について性能を評価した。なお，音声生成に用いるニューラルネットワークの構造や音声特徴量は，本研究開始前に検討を行っていたモデルと同じものを利用した。また評価には，劣化した音声の了解度を測る指標のひとつであるSTOI（Short-Time Objective Intelligibility measure）を利用した。1名の発話データでモデル学習と評価を行ったところ，入力に唇動画像を用いた時はSTOIが0.496であったのに対し，座標値は0.441，単純な図形表現は0.431と性能が劣化することがわかった。これは，入力データを単純にすることで，音声生成に必要な情報まで落ちてしまっているのが原因と思われる。3名の発話データでモデルを学習し，学習に用いた話者（既知話者）と用いなかった話者（未知話者）に対する性能をそれぞれ評価したところ，唇動画像では未知話者に対する性能が，既知話者に対する性能と比較して24%程度劣化した。一方，特徴点の座標値や単純な図形表現を入力した場合は，17%程度の性能劣化にとどまっており，より話者に対する頑健性が得られていることがわかった。

This year, the method of generating lip animation images and sounds was established, and the types of input images and speakers were discussed. General lip animation images are used to estimate the content of speech. In the middle of the picture, the muscles and lips are in violation of each other. In the middle of the picture, the personal information is included. In the middle of the picture, the speech is used. In the middle of the picture, the speech is used. In the middle of the picture, the performance is deteriorated. The performance of the individual is excluded from the performance of the individual. Lip portrait, lip wheel 20 points feature point extraction, coordinate value, input force, straight line between feature points, lip pure shape, performance, input force, performance evaluation. Before this study, we discussed the application of the structure and acoustic characteristics of sound generation. The evaluation is based on the evaluation of STOI (Short-Time Objective Intelligibility Measure). 1. When the input force is applied to the lip animation image, the STOI is 0.496, the coordinate value is 0.441, and the pure shape performance is 0.431. For example, if the sound is generated, the necessary information will be generated. 3. The performance of the speaker (known speaker) and the speaker (unknown speaker) is degraded by 24% compared with the performance of the speaker (unknown speaker). On the one hand, the coordinate value of the characteristic point is pure and the performance is degraded to the extent of 17%.

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

唇動画像からの音声生成法における入力特徴量の単純化に関する検討

唇动图像语音生成方法输入特征简化研究

DOI：
发表时间：
2023
期刊：
日本音響学会音声研究会資料
影响因子：
0
作者：
金澤尚希;鈴木基之
通讯作者：
鈴木基之

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

鈴木基之其他文献

Spotify音楽データを用いたユーザの感情に基づく音楽推薦手法の提案

利用Spotify音乐数据提出基于用户情感的音乐推荐方法

DOI：
发表时间：
2023
期刊：
影响因子：
0
作者：
Yukonhiatou Chaxiong;Yoshihisa Tomoki;Kawakami Tomoya;Teranishi Yuuichi;Shimojo Shinji;撫佐昭裕;鈴木基之;鈴木基之;曽田円香，志風美雨，辻愛美紗，中野美由紀
通讯作者：
曽田円香，志風美雨，辻愛美紗，中野美由紀