权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Explore new approaches to distant microphone speech recognition that combine information across multiple microphone array devices

探索结合多个麦克风阵列设备信息的远程麦克风语音识别新方法

基本信息

批准号：
2112956
负责人：
金额：
--
依托单位：
University of Sheffield
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2018
资助国家：
英国
起止时间：
2018 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2112956
关键词：
Explore new approaches distant microphone

项目摘要

It is becoming common for speech to be used to communicate with digital devices. In the last few years, devices such as Google Home and Amazon Alexa have arrived in millions of homes. Getting speech recognition to work well in home environments is very challenging. The home is often a very noisy place, for example, if the device is placed in the kitchen, the washing machine may be running and people could be talking in the background. Also, the person speaking is often several metres away from the device (the 'distant microphone' scenario). This is a problem because the speech signal may easily be dominated by other sound sources which may be closer to the microphones.This project will develop novel solutions to the distant microphone speech recognition problem. It will be conducted within the Speech and Hearing Research Group under the supervision of Prof. Jon Barker. It will take advantage of a new data set ('CHiME-5') that has been acquired by Prof. Barker's research team with support from Google (http://spandh.dcs.shef.ac.uk/chime_challenge/). CHiME-5 is a set of recordings of parties taking place in real homes. The data is captured with multiple recording devices, each of which captures video and four synchronised microphone channels. This unique data provides an opportunity to address new research questions lying outside the scope of current speech technology.Research questionsTwo key research directions will be prioritised,Visually-driven beamforming algorithms: The most successful approach to distant microphone speech recognition is to use multiple microphones and apply techniques that enhance the signals coming from some directions while suppressing the signals coming from others. This requires detecting and tracking which directions are important. The project will look at how this information might be extracted from the video signal (e.g.,using person tracking techniques.)Speech recognition with multiple microphone arrays: The 'beamforming' described above requires synchronised microphones with known positions with respect to each other. It can therefore be easily applied across multiple devices whose relative location is uncertain (e.g., combining outputs of two Google Homes in the same room). The CHiME-5 data has up to six devices within the same acoustic area and therefore provides a unique opportunity to find new solutions to this problem. A starting place would be to explore techniques for weighting and fusing the outputs of independent recognition systems.MethodologySpeech recognition systems have evolved into hugely complex pieces of software. Fortunately, speech research has been effectively open-sourced with the community now focused around the Kaldi speech recognition toolkit. The CHiME-5 data set will be published with an open-source Kaldi 'baseline' that will represent a state-of-the-art system for single device audio-only system. It will also provide a set of 'rules' for training systems that allowsfair comparison between research groups. This will provide a robust reference against which to compare the performance of audio-visual and multi-device extensions.The research will require a mixture of methods to be employed: video face and person tracking and beamforming algorithms; speech recognition fusion strategies, and signal quality assessment techniques. In addition, it will be necessary to have a fuller understanding of state-of-the-art techniques employed in the baseline recogniser, including convolutional neural networks, i-vector analysis, speaker-adaptive training, neural network language modelling, etc. Fortunately there are many excellent textbooks, tutorial papers and review papers that coverthese areas.CHiME-5 is a complex 'conversational' speech recognition task. Training and testing the recognition systems will be computationally demanding. Modern speech recognisers use 'deep learning' which requires specialist GPU hardware.

语音被用于与数字设备通信正变得越来越普遍。在过去的几年里，谷歌主页和亚马逊Alexa等设备已经进入了数百万个家庭。让语音识别在家庭环境中很好地工作是非常具有挑战性的。家里通常是一个非常嘈杂的地方，例如，如果设备放在厨房里，洗衣机可能正在运行，人们可能在后台交谈。此外，说话的人通常距离设备有几米远(“远距离麦克风”场景)。这是一个问题，因为语音信号很容易被其他离麦克风更近的声源所支配。这个项目将为远程麦克风语音识别问题开发新的解决方案。它将在乔恩·巴克教授的监督下，在言语和听力研究小组内进行。它将利用巴克教授的研究团队在谷歌(http://spandh.dcs.shef.ac.uk/chime_challenge/).的支持下获得的一个新的数据集(CHIME-5)CHINE-5是一组在真实家庭中举行的派对的录音。这些数据是用多个记录设备捕获的，每个设备都捕获视频和四个同步的麦克风通道。这些独特的数据为解决当前语音技术范围之外的新研究问题提供了机会。研究问题两个关键研究方向将优先考虑，视觉驱动的波束形成算法：远程麦克风语音识别最成功的方法是使用多个麦克风，并应用技术来增强来自某些方向的信号，同时抑制来自其他方向的信号。这需要检测和跟踪哪些方向是重要的。该项目将研究如何从视频信号中提取这些信息(例如，使用人物跟踪技术)。使用多个麦克风阵列进行语音识别：上面描述的波束成形需要具有彼此已知位置的同步麦克风。因此，它可以很容易地应用于相对位置不确定的多个设备(例如，将同一房间中两个Google Home的输出组合在一起)。CHINE-5数据在同一声学区域内有多达六个设备，因此为找到解决该问题的新方案提供了独特的机会。首先是探索对独立识别系统的输出进行加权和融合的技术。方法论语音识别系统已经演变成极其复杂的软件。幸运的是，语音研究已经被有效地开源，社区现在专注于Kaldi语音识别工具包。CHINE-5数据集将与开源的Kaldi‘Baseline’一起发布，这将代表单设备纯音频系统的最先进系统。它还将为培训系统提供一套规则，允许研究小组之间进行公平比较。这将提供一个可靠的参考，用于比较视听和多设备扩展的性能。这项研究将需要使用多种方法：视频人脸和人物跟踪和波束形成算法；语音识别融合策略，以及信号质量评估技术。此外，有必要对基线识别器中采用的最先进技术有更全面的了解，包括卷积神经网络、I向量分析、说话人自适应训练、神经网络语言建模等。幸运的是，有许多优秀的教科书、教程和评论论文涵盖了这些领域。CHIME-5是一项复杂的对话式语音识别任务。训练和测试识别系统将需要大量的计算。现代语音识别器使用深度学习，这需要专业的GPU硬件。