权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Deep neural networks for multi-channel speaker localization and speech separation

用于多通道说话者定位和语音分离的深度神经网络

基本信息

批准号：
1808932
负责人：
DeLiang Wang
金额：
$ 30万
依托单位：
Ohio State University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-12-01 至 2022-11-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1808932&HistoricalAwards=false
关键词：
Deep neural networks multi channel

项目摘要

In recent years, there is a dramatic increase in the deployment of the voice-based interface for human-machine communication. Such devices typically have multiple microphones (or channels), and as they are used in homes, cars, and so on, a major technical challenge is how to reliably localize a target speaker and recognize his/her speech in everyday environments with multiple sound sources and room reverberation. The performance of traditional approaches to localization and separation degrades significantly in the presence of interfering sounds and room reverberation. This project investigates multi-channel speaker localization and speech separation from a deep learning perspective. The innovative approach in this project is to train deep neural networks to perform single-channel speech separation in order to identify the time-frequency regions dominated by the target speaker. Such regions across microphone pairs provide the basis for robust speaker localization and separation. Building on this novel perspective, the proposed research seeks to achieve robust speaker localization and speech separation. For robust speaker localization, time-frequency (T-F) masks will be generated by deep neural networks (DNN) from single-channel noisy speech signals. Across each pair of microphones, an integrated mask will be calculated from the two corresponding single-channel masks and then used to weight a generalized cross-correlation function, from which the direction of the target speaker will be estimated. An alternative method for localization will be based on mask-weighted steered responses. For robust speech separation, masking-based beamforming will be initially performed, where T-F masking and accurate speaker localization are expected to enhance beamforming results substantially. To overcome the limitation of spatial filtering in multi-source reverberant conditions, spectral (monaural) and spatial information will be integrated as DNN input features in order to separate only the target signal with speech characteristics and originating from a specific direction. The proposed approach will be evaluated using automatic speech recognition rate, as well as localization and separation accuracy, on multi-channel noisy and reverberant datasets recorded in real-world environments. This will ensure a broader impact not only in advancing speech processing technology but also in facilitating the design of next-generation hearing aids in the long run.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

近年来，用于人机通信的基于语音的界面的部署急剧增加。这样的设备通常具有多个麦克风（或通道），并且当它们在家庭、汽车等中使用时，主要的技术挑战是如何在具有多个声源和房间混响的日常环境中可靠地定位目标说话者并识别他/她的语音。传统的定位和分离方法的性能显着降低干扰声和房间混响的存在。该项目从深度学习的角度研究多通道说话人定位和语音分离。该项目的创新方法是训练深度神经网络来执行单通道语音分离，以识别目标说话人主导的时频区域。跨麦克风对的这样的区域为鲁棒的说话者定位和分离提供了基础。基于这一新颖的视角，本文的研究旨在实现鲁棒的说话人定位和语音分离。为了鲁棒的说话人定位，深度神经网络（DNN）将从单通道噪声语音信号中生成时频（T-F）掩码。在每对麦克风中，将从两个对应的单通道掩模计算集成掩模，然后用于加权广义互相关函数，从广义互相关函数将估计目标说话者的方向。另一种定位方法将基于面罩加权转向响应。对于鲁棒的语音分离，最初将执行基于掩蔽的波束形成，其中T-F掩蔽和准确的说话人定位有望大大增强波束形成结果。为了克服空间滤波在多源混响条件下的限制，频谱（单声道）和空间信息将被集成为DNN输入特征，以便仅分离具有语音特征并源自特定方向的目标信号。所提出的方法将使用自动语音识别率，以及定位和分离精度，在多通道的嘈杂和混响的数据集记录在现实世界的环境中进行评估。这将确保更广泛的影响，不仅在推进语音处理技术，而且在促进下一代助听器的设计，从长远来看。这个奖项反映了NSF的法定使命，并已被认为是值得的支持，通过评估使用基金会的知识价值和更广泛的影响审查标准。

项目成果

期刊论文数量（14）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement

DOI：
10.1109/taslp.2020.2986896
发表时间：
2020
期刊：
IEEE/ACM Transactions on Audio, Speech, and Language Processing
影响因子：
0
作者：
H. Taherian;Zhong-Qiu Wang;Jorge Chang;Deliang Wang
通讯作者：
H. Taherian;Zhong-Qiu Wang;Jorge Chang;Deliang Wang

Location-based training for multi-channel talker-independent speaker separation

基于位置的多通道独立于说话者分离的训练

DOI：
发表时间：
2022
期刊：
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing
影响因子：
0
作者：
Taherian, H.;Tan, K.;Wang, D.L.
通讯作者：
Wang, D.L.

Localization based Sequential Grouping for Continuous Speech Separation

DOI：
10.1109/icassp43922.2022.9746896
发表时间：
2021-07
期刊：
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
Zhong-Qiu Wang;Deliang Wang
通讯作者：
Zhong-Qiu Wang;Deliang Wang

Neural Cascade Architecture for Multi-Channel Acoustic Echo Suppression

DOI：
10.1109/taslp.2022.3192104
发表时间：
2022
期刊：
IEEE/ACM Transactions on Audio, Speech, and Language Processing
影响因子：
0
作者：
H. Zhang;Deliang Wang
通讯作者：
H. Zhang;Deliang Wang

Count and separate: incorporating speaker counting for continuous speaker separation

计数和分离：结合扬声器计数以实现连续的扬声器分离

DOI：
10.1109/icassp39728.2021.9414677
发表时间：
2021
期刊：
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing
影响因子：
0
作者：
Wang, Z.-Q.;Wang, D.L.
通讯作者：
Wang, D.L.