权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BIGDATA: F: Audio-Visual Scene Understanding

BIGDATA：F：视听场景理解

基本信息

批准号：
1741472
负责人：
Chenliang Xu
金额：
$ 65万
依托单位：
University of Rochester
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2017
资助国家：
美国
起止时间：
2017-09-01 至 2022-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1741472&HistoricalAwards=false
关键词：
BIGDATA Audio Visual Scene Understanding

项目摘要

Understanding scenes around us, i.e., recognizing objects, human actions and events, and inferring their spatial, temporal, correlative and causal relations, is a fundamental capability in human intelligence. Similarly, designing computer algorithms that can understand scenes is a fundamental problem in artificial intelligence. Humans consciously or unconsciously use all five senses (vision, audition, taste, smell, and touch) to understand a scene, as different senses provide complimentary information. For example, watching a movie with the sound muted makes it very difficult to understand the movie; walking on a street with eyes closed without other guidance can be dangerous. Existing machine scene understanding algorithms, however, are designed to rely on just a single modality. Take the two most commonly used senses, vision and audition, as an example, there are scene understanding algorithms designed to deal with each single modality. However, no systematic investigations have been conducted to integrate these two modalities towards more comprehensive audio-visual scene understanding. Designing algorithms that jointly model audio and visual modalities towards a complete audio-visual scene understanding is important, not only because this is how humans understand scenes, but also because it will enable novel applications in many fields. These fields include multimedia (video indexing and scene editing), healthcare (assistive devices for visually and aurally impaired people), surveillance security (comprehensive monitoring of the suspicious activities), and virtual and augmented reality (generation and alternation of visuals and/or sound tracks). In addition, the investigators will involve graduate and undergraduate students in the research activities, integrate research results into the teaching curriculum, and conduct outreach activities to local schools and communities with an aim to broader participation in computer science. This project aims to achieve human-like audio-visual scene understanding that overcomes the limitations of single-modality approaches through big data analysis of Internet videos. The core idea is to learn to parse a scene into elements and infer their relations, i.e., forming an audio-visual scene graph. Specifically, an element of the audio-visual scene can be a joint audio-visual component of an event when the event shows correlated audio and visual features. It can also be an audio component or a visual component if the event only appears in one modality. The relations between the elements include spatial and temporal relations at a lower level, as well as correlative and causal relations at a higher level. Through this scene graph, information across the two modalities can be extracted, exchanged and interpreted. The investigators propose three main research thrusts: (1) Learning joint audio-visual representations of scene elements; (2) Learning a scene graph to organize scene elements; and (3) Cross-modality scene completion. Each of the three research thrusts explores a dimension in the space of audio-visual scene understanding, yet they are also inter-connected. For example, the audio-visual scene elements are nodes in the scene graph, and the scene graph, in turn, guides the learning of relations among scene elements with structured information; the cross-modality scene completion generates missing data in the scene graph and is necessary for good audio-visual understanding of the scene. Expected outcomes of this proposal include: a software package for learning joint audio-visual representations of various scene elements; a web-deployed system for audio-visual scene understanding utilizing the learned scene elements and scene graphs, illustrated with text generation; a software package for cross-modality scene completion based on scene understanding; and a large-scale video dataset with annotations for audio-visual association, text generation and scene completion. Datasets, software and demos will be hosted on the project website.

了解我们周围的场景，即，识别物体、人类行为和事件，并推断它们的空间、时间、相关性和因果关系，是人类智能的基本能力。同样，设计能够理解场景的计算机算法是人工智能的一个基本问题。人类有意识或无意识地使用所有五种感官（视觉，听觉，味觉，嗅觉和触觉）来理解场景，因为不同的感官提供了互补的信息。例如，看电影时声音静音会使人很难理解电影;在没有其他指导的情况下闭着眼睛走在街上可能会很危险。然而，现有的机器场景理解算法被设计为仅依赖于单个模态。以视觉和听觉这两种最常用的感官为例，有一些场景理解算法被设计用于处理每一种单一的模态。然而，没有进行系统的调查，以整合这两种模式，以更全面的视听场景的理解。设计算法，共同模拟音频和视觉模态，以实现完整的视听场景理解是很重要的，不仅因为这是人类理解场景的方式，而且因为它将在许多领域实现新的应用。这些领域包括多媒体（视频索引和场景编辑），医疗保健（视觉和听觉障碍者的辅助设备），监控安全（全面监控可疑活动）以及虚拟和增强现实（视觉和/或声音轨迹的生成和交替）。此外，研究人员将让研究生和本科生参与研究活动，将研究成果融入教学课程，并向当地学校和社区开展外联活动，以扩大计算机科学的参与。该项目旨在通过对互联网视频的大数据分析，实现类似人类的视听场景理解，克服单一模态方法的局限性。其核心思想是学习将场景解析为元素并推断它们的关系，即，形成视听场景图。具体地，当事件显示相关的音频和视觉特征时，视听场景的元素可以是事件的联合视听分量。如果事件仅以一种模态出现，则其也可以是音频组件或视觉组件。要素之间的关系既包括低层次的空间关系和时间关系，也包括高层次的相关关系和因果关系。通过该场景图，可以提取、交换和解释两种模态的信息。研究人员提出了三个主要的研究方向：（1）学习场景元素的联合视听表示;（2）学习场景图来组织场景元素;（3）跨模态场景完成。这三个研究重点中的每一个都探索了视听场景理解空间中的一个维度，但它们也是相互联系的。例如，视听场景元素是场景图中的节点，而场景图反过来用结构化信息指导场景元素之间的关系学习;跨模态场景补全生成场景图中的缺失数据，并且对于场景的良好视听理解是必要的。这一建议的预期成果包括：一个学习各种场景元素的联合视听表示的软件包;一个利用学习到的场景元素和场景图进行视听场景理解的网络部署系统，并附有文本生成说明;一个基于场景理解的跨模态场景补全软件包;以及一个大规模的视频数据集，带有用于视听关联、文本生成和场景完成的注释。数据集、软件和演示将在项目网站上托管。

项目成果

期刊论文数量（49）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Audio-Visual Event Localization in the Wild

野外视听事件定位

DOI：
发表时间：
2019
期刊：
IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops
影响因子：
0
作者：
Tian, Yapeng;Shi, Jing;Li, Bochen;Duan, Zhiyao;Xu, Chenliang
通讯作者：
Xu, Chenliang

Speech Driven Talking Face Generation From a Single Image and an Emotion Condition