权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Immersive Audio-Visual 3D Scene Reproduction Using a Single 360 Camera

使用单个 360 度摄像头实现沉浸式视听 3D 场景再现

基本信息

批准号：
EP/V03538X/1
负责人：
HANSUNG KIM
金额：
$ 34.08万
依托单位：
University of Southampton
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2021
资助国家：
英国
起止时间：
2021 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FV03538X%2F1
关键词：
Immersive Audio Visual 3D Scene

项目摘要

The COVID-19 pandemic has changed our lifestyle and caused high demand for remote communication and experience. Many organizations have had to set up remote work systems with video conferencing platforms. However, current video conferencing systems do not meet basic requirements for remote collaboration due to the lack of eye contact, gaze awareness and spatial audio synchronisation. Reproduction of a real space as an audio-visual 3D model allows users to remotely experience real-time interaction in real environments, thus it can be widely utilised in various applications such as healthcare, teleconferencing, education, entertainments, etc. The goal of this project is to develop a simple and practical solution to estimate geometrical structure and acoustic properties of general scenes allowing spatial audio to be adapted to the environment and listener location to give an immersive rendering of the scene to improve user experience.Existing 3D scene reproduction systems have two problems. (i) Audio and vision systems have been researched separately. Computer vision research has mainly focused on improving the visual side of scene reconstruction. In an immersive display, such as a VR system, the experience is not perceived as "realistic" by users if sound is not matched with the visual cues. On the other hand, audio researches have been using only audio sensors to measure acoustic properties without considering the complementary effect with visual sensors. (ii) Current capture and recording systems for 3D scene reproduction require too invasive set up and professional process to be deployed by users in their private places. A LiDAR sensor is expensive and requires long scanning time. Perspective images require large number of photos to cover the whole scene. The objective of this research is to develop an end-to-end audio-visual 3D scene reproduction pipeline using a single shot from a consumer 360 (panoramic) camera. In order to make the system easily accessible by common users in their own private spaces, automatic solution using computer vision and artificial intelligence algorithms should be included in the back-end. A deep neural network (DNN) jointly trained for semantic scene reconstruction and acoustic property prediction for the captured environments will be developed. This process includes inference for invisible regions from the camera. Impulse Responses (IRs) characterising acoustic attributes of an environment allow to reproduce the acoustics of the space with any sound sources. It also allows to extract the original (dry) sound by eliminating acoustic effects from recorded sound so that this source can be re-rendered in new environments with different acoustic effects. A simple and efficient method to estimate acoustic IRs from the captured single 360 photo will be investigated. This semantic scene data is used to provide immersive audio-visual experience to users. Two types of display scenarios will be considered: personalised display system such as a VR headset with headphones and communal display system (e.g., TV or projector) with loudspeakers. Real-time 3D human pose tracking using a single 360 camera will be developed to accurately render 3D audio-visual scene at the locations of users. Delivering binaural sound to listeners using loudspeakers is a challenging task. Audio beam-forming techniques aligned with human-pose tracking for multiple loudspeakers will be investigated in collaboration with the project partners in audio processing. The resulting system would have a significant impact on innovation of VR and multimedia systems, and open up new and interesting applications for their deployment. This award should provide the foundation for the PI to establish and lead a group with a unique research direction which is aligned with national priorities and will address a major long-term research challenge.

COVID-19疫情改变了我们的生活方式，并对远程通信和体验产生了很高的需求。许多组织不得不使用视频会议平台建立远程工作系统。然而，当前的视频会议系统由于缺乏目光接触、凝视意识和空间音频同步而不满足远程协作的基本要求。将真实的空间再现为视听3D模型允许用户远程体验真实的环境中的实时交互，因此其可以广泛地用于各种应用，例如医疗保健、电话会议、教育、娱乐、该项目的目标是开发一种简单实用的解决方案，以估计一般场景的几何结构和声学特性，从而使空间音频适应环境和收听者位置，以给出场景的沉浸式渲染，从而改善用户体验。(i)音频和视频系统已分别进行了研究。计算机视觉的研究主要集中在改善场景重建的视觉方面。在诸如VR系统的沉浸式显示器中，如果声音与视觉提示不匹配，则用户不会将体验感知为“真实的”。另一方面，音频研究一直只使用音频传感器来测量声学特性，而没有考虑与视觉传感器的互补作用。(ii)当前用于3D场景再现的捕获和记录系统需要过于侵入性的设置和专业的过程以由用户在其私人场所中部署。LiDAR传感器价格昂贵，需要很长的扫描时间。透视图像需要大量的照片来覆盖整个场景。本研究的目的是开发一个端到端的视听3D场景再现管道使用一个单一的拍摄从消费者360（全景）相机。为了让普通用户在自己的私人空间内轻松访问该系统，后端应包括使用计算机视觉和人工智能算法的自动解决方案。将开发一个联合训练的深度神经网络（DNN），用于捕获环境的语义场景重建和声学特性预测。此过程包括从相机推断不可见区域。表征环境的声学属性的脉冲响应（IR）允许再现具有任何声源的空间的声学。它还允许通过从记录的声音中消除声学效果来提取原始（干）声音，以便该源可以在具有不同声学效果的新环境中重新呈现。一个简单而有效的方法来估计声学IR从捕获的单一360照片将进行调查。该语义场景数据用于向用户提供沉浸式视听体验。将考虑两种类型的显示场景：个性化显示系统，如带有耳机的VR头戴式耳机和公共显示系统（例如，电视或投影仪）与扬声器。将开发使用单个360摄像机的实时3D人体姿势跟踪，以准确地在用户的位置处呈现3D视听场景。使用扬声器将双耳声音传递给听众是一项具有挑战性的任务。将与音频处理方面的项目合作伙伴合作，研究与多个扬声器的人体姿态跟踪相一致的音频波束形成技术。由此产生的系统将对VR和多媒体系统的创新产生重大影响，并为其部署开辟新的有趣的应用程序。该奖项将为PI建立和领导一个具有独特研究方向的小组提供基础，该研究方向与国家优先事项保持一致，并将解决重大的长期研究挑战。