权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Audio-Visual Speech Enhancement and Speaker Separation

视听语音增强和扬声器分离

基本信息

批准号：
2243852
负责人：
金额：
--
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2019
资助国家：
英国
起止时间：
2019 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2243852
关键词：
Audio Visual Speech Enhancement Speaker

项目摘要

The problem with audio perception is that individual sounds are mixed together with unknown acoustic reverberations, and this makes it impossible to extract them without prior knowledge of the source characteristics. The problem of audio-source separation is a fundamental problem in audio perception. Humans have the ability to understanding speech when it is mixed with other types of sound and noise; by isolating and focusing attention to one voice from a multitude. This research aims to reproduce or model this accomplishment of the brain with computational and algorithmic means. Speech enhancement is a method of increasing speech intelligibility by using algorithms to separate and enhance the original source of the speech from others. Automating the process of speech enhancement has many real-world applications such as increasing the effectiveness of assistive technology for the hearing impaired, creating virtual reality with high clarity and better transcription of speech in noisy audio tracks. Additionally, with ever-increasing use of audio-visual and voice-controlled technologies, the ability to capture and enhance a speaker's voice is becoming imperative in the robustness of automatic speech recognition (ASR) systems. These systems tend to infer speech well in quiet environments, but they struggle when background noise is present.Although recently there has been significant advancement in speech separation using deep learning methods, it is still considered a difficult problem due to time-variant input signals and high variability of reverberant sound fields. Traditionally the task of speech enhancement is either performed on audio-only tracks or the combination of audio and video inputs. Deep learning techniques have been applied to challenging tasks such as removing background noise from speech, separating a speaker from multiple speech signals, or more generally separating arbitrary classes of sound from each other. This work will address the shortcomings of the current methods and will explore conditioning speech separation tasks by conditioning on complementary information, such as visual cues from the speaker's lip motions.

音频感知的问题是，单个声音与未知的声学混响混合在一起，这使得在没有源特征的先验知识的情况下无法提取它们。音频源分离问题是音频感知中的一个基本问题。人类有能力理解与其他类型的声音和噪音混合的语音;通过将注意力从众多声音中分离出来并集中到一个声音上。这项研究旨在通过计算和算法手段再现或模拟大脑的这一成就。语音增强是一种通过使用算法将原始语音源与其他语音源分离并增强来提高语音可懂度的方法。自动化语音增强过程具有许多现实应用，例如提高听力受损者辅助技术的有效性，创建具有高清晰度的虚拟现实以及在嘈杂音轨中更好地转录语音。此外，随着视听和语音控制技术的不断增加的使用，捕获和增强说话者的语音的能力在自动语音识别（ASR）系统的鲁棒性中变得至关重要。这些系统在安静的环境中能够很好地推断语音，但在存在背景噪声的情况下就很难了。虽然最近使用深度学习方法进行语音分离取得了重大进展，但由于输入信号时变和混响声场的高度可变性，语音分离仍然被认为是一个难题。传统上，语音增强的任务要么在仅音频的轨道上执行，要么在音频和视频输入的组合上执行。深度学习技术已被应用于具有挑战性的任务，例如从语音中去除背景噪声，从多个语音信号中分离说话者，或者更一般地将任意类别的声音彼此分离。这项工作将解决目前的方法的缺点，并将探索条件的语音分离任务，条件的补充信息，如从扬声器的嘴唇运动的视觉线索。