权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

ROSSINI: Reconstructing 3D structure from single images: a perceptual reconstruction approach

ROSSINI：从单个图像重建 3D 结构：感知重建方法

基本信息

批准号：
EP/S016260/1
负责人：
Andrew Schofield
金额：
$ 52.23万
依托单位：
Aston University
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2019
资助国家：
英国
起止时间：
2019 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FS016260%2F1
关键词：
ROSSINI Reconstructing 3D structure single

项目摘要

Consumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.Computer scientists have built machine vision systems for lifting to 3D by incorporating scene constraints. A popular technique is to train a deep neural network with a collection of 2D images and associated 3D range data. However, to be successful, this approach requires a very large dataset, which can be expensive to acquire. Furthermore, performance is only as good as the dataset is complete: if the system encounters a type of scene or geometry that does not conform to the training dataset, it will fail. Most methods have been trained for specific situations - e.g. indoor, or street scenes - and these systems are typically less effective for rural scenes and less flexible and robust than humans. Finally, such systems provide a single reconstructed output, without any measure of uncertainty. The user must assume that the 3D reconstruction is correct, which will be a costly assumption in many cases.Computer systems are designed and evaluated based upon their accuracy with respect to the real world. However, the ultimate goal of lifting into 3D is not perfect accuracy - rather it is to deliver a 3D representation that provides a useful and compelling visual experience for a human observer, or to guide a robot whilst avoiding obstacles. Importantly, humans are expert at interacting with 3D environments, even though our perception can deviate substantially from true metric depth. This suggests that human-like representations are both achievable and sufficient, in any and all environments.ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.

消费者在电影、电视和虚拟现实(VR)中享受3D内容的身临其境体验，但制作成本高昂。拍摄3D电影需要两个摄像机来模拟观众的两只眼睛。一种常见但昂贵的替代方案是拍摄单一的视角，然后在后期制作中使用视频艺术家来创建左眼和右眼的视角。如果计算机可以从2D内容自动生成3D模型(和双目图像)：将图像提升到3D会怎么样？这就是这个项目的总体目标。提升到3D有多种用途，比如机器人的路线规划，自动驾驶车辆的避障，以及VR和电影中的应用。从2D图像估计3D结构是困难的，因为原则上，图像可以从无限多的3D场景创建。辨别这些可能的世界中哪一个是正确的是非常困难的，然而人类一直将2D图像解释为3D场景。每当我们看照片、看电视或凝视远处时，我们都会这样做，因为在这些地方，双目深度线索很弱。尽管我们在判断距离时会犯一些错误，但我们快速理解任何场景布局的能力使我们能够在任何环境中导航并与之交互。计算机科学家已经建立了机器视觉系统，通过加入场景约束来提升到3D。一种流行的技术是用一组2D图像和相关的3D距离数据来训练深度神经网络。然而，要想成功，这种方法需要一个非常大的数据集，这可能是昂贵的获取。此外，只有当数据集完整时，性能才是好的：如果系统遇到不符合训练数据集的场景或几何类型，它将失败。大多数方法都是针对特定情况--例如室内或街道场景--进行培训的，这些系统对农村场景的效率通常较低，灵活性和健壮性不如人类。最后，这样的系统提供了单一的重构输出，没有任何不确定性。用户必须假设3D重建是正确的，这在许多情况下将是一个代价高昂的假设。计算机系统的设计和评估是基于它们相对于真实世界的精度。然而，提升到3D的最终目标并不是完美的准确性-相反，它是为了提供一种3D表示，为人类观察者提供有用和引人注目的视觉体验，或者在避开障碍物的同时引导机器人。重要的是，人类擅长与3D环境互动，尽管我们的感知可能会严重偏离真实的公制深度。ROSSINI将开发一种新的机器视觉系统用于三维重建，该系统比以往的方法更灵活和健壮。聚焦于静态图像，我们将确定对人类重要的关键结构特征。我们将把神经网络和计算机视觉方法结合起来，形成类似人类的场景描述和3D场景模型。我们的目标是(I)生成对人类来说看起来正确的3D表示，即使它们在几何上并不严格正确(Ii)对所有类型的场景都是如此，以及(Iii)表达每个重建中固有的不确定性。为此，我们将收集人类对图像的解读数据，并将这些信息纳入我们的网络。我们的新训练方法将从人类和现有的地面真实数据集中学习；训练算法选择最有用的人工任务(即判断特定图像中的深度)以最大化学习。重要的是，纳入人类感知数据应减少所需训练数据的总量，同时减轻过度依赖特定数据集的风险。此外，当经过充分训练后，我们的系统将生成3D重建图像，并提供有关深度估计可靠性的信息。

项目成果

期刊论文数量（9）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

The Monocular Depth Estimation Challenge

DOI：
10.1109/wacvw58289.2023.00069
发表时间：
2022-11
期刊：
2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)
影响因子：
0
作者：
Jaime Spencer;C. Qian;Chris Russell;Simon Hadfield;E. Graf;W. Adams;A. Schofield;J. Elder;R. Bowden;Heng Cong;S. Mattoccia;Matteo Poggi;Zeeshan Khan Suri;Yang Tang;Fabio Tosi;Hao Wang;Youming Zhang;Yusheng Zhang;Chaoqiang Zhao
通讯作者：
Jaime Spencer;C. Qian;Chris Russell;Simon Hadfield;E. Graf;W. Adams;A. Schofield;J. Elder;R. Bowden;Heng Cong;S. Mattoccia;Matteo Poggi;Zeeshan Khan Suri;Yang Tang;Fabio Tosi;Hao Wang;Youming Zhang;Yusheng Zhang;Chaoqiang Zhao

Surface Attitude Judgements in monocular and stereo textures: a method evaluation

单目和立体纹理的表面姿态判断：方法评估

DOI：
发表时间：
2021
期刊：
影响因子：
0
作者：
Qian CS
通讯作者：
Qian CS

What surprises the Mona Lisa? The relative importance of the eyes and eyebrows for detecting surprise in briefly presented face stimuli.

蒙娜丽莎有何惊喜？

DOI：
10.1016/j.visres.2023.108275
发表时间：
2023
期刊：
Vision research
影响因子：
1.8
作者：
Skog E
通讯作者：
Skog E

The Second Monocular Depth Estimation Challenge

DOI：
10.1109/cvprw59228.2023.00308
发表时间：
2023-04
期刊：
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
影响因子：
0
作者：
Jaime Spencer;C. Qian;Michaela Trescakova;Chris Russell;Simon Hadfield;E. Graf;W. Adams;A. Schofield;J. Elder;R. Bowden;Ali Anwar;Hao Chen;Xiaozhi Chen;Kai Cheng;Yuchao Dai;Huynh Thai Hoa;Sadat Hossain;Jian-qiang Huang;Mohan Jing;Bo Li;Chao Li;Baojun Li;Zhiwen Liu;S. Mattoccia;Siegfried Mercelis;Myungwoo Nam;Matteo Poggi;Xiaohua Qi;Jiahui Ren;Yang Tang;Fabio Tosi;L. Trinh;S M Nadim Uddin;Khan Muhammad Umair;Kaixuan Wang;Yufei Wang;Yixing Wang;Mochu Xiang;Guangkai Xu;Wei Yin;Jun Yu;Qi Zhang;Chaoqiang Zhao
通讯作者：
Jaime Spencer;C. Qian;Michaela Trescakova;Chris Russell;Simon Hadfield;E. Graf;W. Adams;A. Schofield;J. Elder;R. Bowden;Ali Anwar;Hao Chen;Xiaozhi Chen;Kai Cheng;Yuchao Dai;Huynh Thai Hoa;Sadat Hossain;Jian-qiang Huang;Mohan Jing;Bo Li;Chao Li;Baojun Li;Zhiwen Liu;S. Mattoccia;Siegfried Mercelis;Myungwoo Nam;Matteo Poggi;Xiaohua Qi;Jiahui Ren;Yang Tang;Fabio Tosi;L. Trinh;S M Nadim Uddin;Khan Muhammad Umair;Kaixuan Wang;Yufei Wang;Yixing Wang;Mochu Xiang;Guangkai Xu;Wei Yin;Jun Yu;Qi Zhang;Chaoqiang Zhao

Surface Attitude Judgements in synthetic textures and real-world images: a method evaluation

合成纹理和真实世界图像中的表面姿态判断：方法评估