权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

言語情報とパラ言語情報を統合した音声の構造的表象の提案とその音声合成への応用

整合语言和副语言信息的语音结构表示的提出及其在语音合成中的应用

基本信息

批准号：
19650036
负责人：
峯松信明
金额：
$ 2.11万
依托单位：
The University of Tokyo
依托单位国家：
日本
项目类别：
Grant-in-Aid for Exploratory Research
财政年份：
2007
资助国家：
日本
起止时间：
2007 至 2008
项目状态：
已结题

项目摘要

音声が運ぶ情報は大きく,言語的情報,パラ言語的情報,非言語的情報に分かれる。我々は音声から非言語的情報に相当する音響特徴量のみを分離する方法を提案している。年齢・性別による音声の音響的変形,収録機器・伝送機器による音声の音響的変形はいずれも,静的な空間写像として数学的にモデル化できる。よって,写像不変量でもって音声を表象・モデル化することで,静的な変形(変換)に不変な音声情報処理が可能となる。我々は分布間の距離尺度であるf-divergenceが如何なる変換に対しても不変であることを証明しており,発声中の全ての音響事象を分布として捉え,任意の二分布間(事象間)距離を計測し,距離行列として音声を(話者不変的に)表象する手法を提案している。距離行列は一つの幾何学的形態を規定するため,これを音声の構造的表象と呼んでいる。非言語情報がそぎ落とされるということは,言語情報とパラ言語情報のみが表象された音声表象であることを意味する。本研究では,この構造表象に対して,非言語的情報である話者の性別,年齢,体格(即ち声道形状)を戻すことで音声を生成する枠組みを検討した。即ち,言語情報,パラ言語情報は構造として与えられ,その構造を音に変換する声道の長さや形状の情報(非言語的情報)を付与することで音に変換する枠組みである。具体的には,幾つかの既に実現された音事象を初期条件として与え,構造的表象を制約条件としてその後の音事象を次々と音響空間内に定位する方法を採択した。この場合,定位済みの事象群をn個とすると,このn個の事象を中心とする超楕円を描き,n個の超楕円の交点が次に生成すべき音の定位場所,となる。この探索問題を計算機上に実装し,また,いくつかの高速化アルゴリズムを検討することで,現実的な計算量で構造からの音声生成を可能にした。この音声生成方式は,言語情報+パラ言語情報が混在した音声表象(構造的表象)を出発点として音を導出するという点が従来の音声生成方式とは大きく異なる。

Sound が transport ぶ information く large くく, verbal information パラ verbal information, non-verbal information に divided into れるれる. I 々は sounds から nonverbal intelligence に quite する acoustics, 徴 quantity のみを separation する method proposed をしている. Years 齢 · gender による sounds の sound variations of shape, 収 record machine, 伝 send machine による sounds の acoustics - shape はいずれも, static な space to write like として mathematical にモデル change できる. よって, write like no - でもって sounds を representation, モデル change することで, static な (-) - shape に - not な sounds intelligence 処 Richard が may となる. I 々はの distance between distribution scale である f - divergence が how なる variations in にし seaborne ても - not であることを prove しており, the sound 発の all ての sound things like を distribution として catch え, arbitrary の (things like between) distance between two distribution を measuring し, distance among として sounds を representation (words - not に) する gimmick しを bill Youdaoplaceholder0 てる. The distance between the rows and columns を - the shape of <s:1> geometry を stipulates するため, れをれを the appearance of the structure of the sound <s:1> と call んでるるる. Nonverbal intelligence がそぎ fall とされるということは, verbal intelligence とパラ verbal intelligence のみが representation された sounds representation であることを mean する. This study では, この structure representation にし seaborne て, nonverbal intelligence であるの gender, the speaker's words in 齢, physique (namely ち channel shape) を戻すことで sounds を generated する枠 group みを beg し検た. Namely ち, verbal intelligence, パラ verbal intelligence は tectonic として and えられ, そを sound にの structure variations in する long track のさや shape の intelligence (nonverbal intelligence) を give することで sound に variations in する枠 group みである. Specific には, several つかのに both be presently された sound things like を initial conditions としてえ and structural representation を restriction conditions としてその sound things like を time since の々にと acoustics space positioning する method を mining 択した. この occasions, positioning 済みの thing elephant を n とすると, この n の things like を center とする super 楕 has drifted back towards ¥ を tracing き, n の super 楕 has drifted back towards ¥ の intersection が times に generated すべきの positioning places, となる. この explore problems にを computer be し, また, いくつか high speed のアルゴリズムを beg す検ることで, now be な computation で tectonic からの sounds possible にをした. この way sounds generated は, verbal intelligence + パラ verbal intelligence が mixed した representation (construction) sounds representation をと発 point して sound を export するという point が従 to の sounds generation とは big きく different なる.