权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BIGDATA: F: Collaborative Research: From Visual Data to Visual Understanding

BIGDATA：F：协作研究：从视觉数据到视觉理解

基本信息

批准号：
1903222
负责人：
Jia Deng
金额：
$ 19.91万
依托单位：
Princeton University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-09-01 至 2020-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1903222&HistoricalAwards=false
关键词：
BIGDATA Collaborative Research Visual Data

项目摘要

The field of visual recognition, which focuses on creating computer algorithms for automatically understanding photographs and videos, has made tremendous gains in the past few years. Algorithms can now recognize and localize thousands of objects with reasonable accuracy as well as identify other visual content, such as scenes and activities. For instance, there are now smart phone apps that can automatically sift through a user's photos and find all party pictures, or all pictures of cars, or all sunset photos. However, the type of "visual understanding" done by these methods is still rather superficial, exhibiting mostly rote memorization rather than true reasoning. For example, current algorithms have a hard time telling if an image is typical (e.g., car on a road) or unusual (e.g., car in the sky), or answering simple questions about a photograph, e.g., "what are the people looking at?", "what just happened?", "what might happen next?" A central problem is that current methods lack the data about the world outside of the photograph. To achieve true human-like visual understanding, computers will have to reason about the broader spatial, temporal, perceptual, and social context suggested by a given visual input. This project is using big visual data to gather large-scale deep semantic knowledge about how events, physical and social interactions, and how people perceive the world and each other. The research focuses on developing methods to capture and represent this knowledge in a way that makes it broadly applicable to a range of visual understanding tasks. This will enable novel computer algorithms that have a deeper, more human-like, understanding of the visual world and can effectively function in complex, real-world situations and environments. For example, if a robot can predict what a person might do next in a given situation, then the robot can better aid the person in their task. Broader impacts will include new publicly-available software tools and data that can be used for various visual reasoning tasks. Additionally, the project will have a multi-pronged educational component, including incorporating aspects of the research in the graduate teaching curriculum, undergraduate and K-12 outreach, as well as special mentoring and focused events for advancement of women in computer science.The main technical focus of this project is to advance computational recognition efforts toward producing a general human-like visual understanding of images and video that can function on previously unseen data, unseen tasks and settings. The aim of this project is to develop a new large-scale knowledge base called the visual Memex that extracts and stores vast set of visual relationships between data items in a multi-graph representation, with nodes corresponding to data items and edges indicating different types of relationships. This large knowledge base will be used in a lambda-calculus-powered reasoning engine to make inferences about visual data on a global scale. Additionally, the project will test computational recognition algorithms on several visual understanding tasks designed to evaluate progress on a variety of aspects of visual understanding, including: linguistic (evaluating our understanding about imagery through language tasks such as visual question-answering), to purely visual (evaluating our understanding of spatial context through visual fill-in-the-blanks), to temporal (evaluating our temporal understanding by predicting future states), to physical (evaluating our understanding of human-object and human-scene interactions by predicting affordances). Datasets, knowledge base, and evaluation tools will be hosted on the project web site (http://www.tamaraberg.com/grants/bigdata.html).

视觉识别领域专注于创建自动理解照片和视频的计算机算法，在过去几年中取得了巨大的进步。算法现在可以以合理的精度识别和定位数千个对象，并识别其他视觉内容，如场景和活动。例如，现在有智能手机应用程序可以自动筛选用户的照片，并找到所有的聚会照片，或所有的汽车照片，或所有日落照片。然而，这些方法所达到的“视觉理解”仍然是相当肤浅的，大多表现为死记硬背，而不是真正的推理。例如，当前的算法很难判断图像是否是典型的（例如，道路上的汽车）或不寻常的（例如，天空中的汽车），或者回答关于照片的简单问题，例如，“人们在看什么？”，“刚才发生了什么事？”，“接下来可能会发生什么？“一个核心问题是，目前的方法缺乏关于照片之外世界的数据。为了实现真正的人类视觉理解，计算机必须对给定视觉输入所暗示的更广泛的空间，时间，感知和社会背景进行推理。该项目使用大视觉数据来收集关于事件、物理和社会互动以及人们如何感知世界和彼此的大规模深层语义知识。该研究的重点是开发方法来捕捉和表示这种知识的方式，使其广泛适用于一系列的视觉理解任务。这将使新的计算机算法能够对视觉世界有更深入、更人性化的理解，并能在复杂的现实情况和环境中有效地发挥作用。例如，如果一个机器人可以预测一个人在给定的情况下下一步会做什么，那么机器人就可以更好地帮助这个人完成任务。更广泛的影响将包括可用于各种视觉推理任务的新的公开软件工具和数据。此外，该项目将有一个多管齐下的教育组成部分，包括将研究的各个方面纳入研究生教学课程，本科和K-12推广，以及特别辅导和重点活动，以提高妇女在计算机科学。该项目的主要技术重点是推进计算识别的努力，以产生一个一般的人类-比如对图像和视频的视觉理解，可以在以前看不见的数据、看不见的任务和设置上发挥作用。该项目的目的是开发一个新的大规模知识库，称为可视化Memex，它可以提取和存储多图表示中数据项之间的大量可视化关系，节点对应于数据项，边缘表示不同类型的关系。这个庞大的知识库将被用于一个基于微积分的推理引擎中，以在全球范围内对视觉数据进行推理。此外，该项目将在几个视觉理解任务上测试计算识别算法，旨在评估视觉理解各个方面的进展，包括：语言（通过视觉问答等语言任务评估我们对图像的理解），以纯粹的视觉（通过视觉填空来评估我们对空间背景的理解），（通过预测未来状态来评估我们对时间的理解），到物理（通过预测启示来评估我们对人-物体和人-场景交互的理解）。数据集、知识库和评价工具将放在项目网站（http：//www.tamaraberg.com/grants/bigdata.html）上。