BIGDATA: F: Audio-Visual Scene Understanding

BIGDATA:F:视听场景理解

基本信息

  • 批准号:
    1741472
  • 负责人:
  • 金额:
    $ 65万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2017
  • 资助国家:
    美国
  • 起止时间:
    2017-09-01 至 2022-08-31
  • 项目状态:
    已结题

项目摘要

Understanding scenes around us, i.e., recognizing objects, human actions and events, and inferring their spatial, temporal, correlative and causal relations, is a fundamental capability in human intelligence. Similarly, designing computer algorithms that can understand scenes is a fundamental problem in artificial intelligence. Humans consciously or unconsciously use all five senses (vision, audition, taste, smell, and touch) to understand a scene, as different senses provide complimentary information. For example, watching a movie with the sound muted makes it very difficult to understand the movie; walking on a street with eyes closed without other guidance can be dangerous. Existing machine scene understanding algorithms, however, are designed to rely on just a single modality. Take the two most commonly used senses, vision and audition, as an example, there are scene understanding algorithms designed to deal with each single modality. However, no systematic investigations have been conducted to integrate these two modalities towards more comprehensive audio-visual scene understanding. Designing algorithms that jointly model audio and visual modalities towards a complete audio-visual scene understanding is important, not only because this is how humans understand scenes, but also because it will enable novel applications in many fields. These fields include multimedia (video indexing and scene editing), healthcare (assistive devices for visually and aurally impaired people), surveillance security (comprehensive monitoring of the suspicious activities), and virtual and augmented reality (generation and alternation of visuals and/or sound tracks). In addition, the investigators will involve graduate and undergraduate students in the research activities, integrate research results into the teaching curriculum, and conduct outreach activities to local schools and communities with an aim to broader participation in computer science. This project aims to achieve human-like audio-visual scene understanding that overcomes the limitations of single-modality approaches through big data analysis of Internet videos. The core idea is to learn to parse a scene into elements and infer their relations, i.e., forming an audio-visual scene graph. Specifically, an element of the audio-visual scene can be a joint audio-visual component of an event when the event shows correlated audio and visual features. It can also be an audio component or a visual component if the event only appears in one modality. The relations between the elements include spatial and temporal relations at a lower level, as well as correlative and causal relations at a higher level. Through this scene graph, information across the two modalities can be extracted, exchanged and interpreted. The investigators propose three main research thrusts: (1) Learning joint audio-visual representations of scene elements; (2) Learning a scene graph to organize scene elements; and (3) Cross-modality scene completion. Each of the three research thrusts explores a dimension in the space of audio-visual scene understanding, yet they are also inter-connected. For example, the audio-visual scene elements are nodes in the scene graph, and the scene graph, in turn, guides the learning of relations among scene elements with structured information; the cross-modality scene completion generates missing data in the scene graph and is necessary for good audio-visual understanding of the scene. Expected outcomes of this proposal include: a software package for learning joint audio-visual representations of various scene elements; a web-deployed system for audio-visual scene understanding utilizing the learned scene elements and scene graphs, illustrated with text generation; a software package for cross-modality scene completion based on scene understanding; and a large-scale video dataset with annotations for audio-visual association, text generation and scene completion. Datasets, software and demos will be hosted on the project website.
了解我们周围的场景,即,识别物体、人类行为和事件,并推断它们的空间、时间、相关性和因果关系,是人类智能的基本能力。同样,设计能够理解场景的计算机算法是人工智能的一个基本问题。人类有意识或无意识地使用所有五种感官(视觉,听觉,味觉,嗅觉和触觉)来理解场景,因为不同的感官提供了互补的信息。例如,看电影时声音静音会使人很难理解电影;在没有其他指导的情况下闭着眼睛走在街上可能会很危险。然而,现有的机器场景理解算法被设计为仅依赖于单个模态。以视觉和听觉这两种最常用的感官为例,有一些场景理解算法被设计用于处理每一种单一的模态。然而,没有进行系统的调查,以整合这两种模式,以更全面的视听场景的理解。设计算法,共同模拟音频和视觉模态,以实现完整的视听场景理解是很重要的,不仅因为这是人类理解场景的方式,而且因为它将在许多领域实现新的应用。这些领域包括多媒体(视频索引和场景编辑),医疗保健(视觉和听觉障碍者的辅助设备),监控安全(全面监控可疑活动)以及虚拟和增强现实(视觉和/或声音轨迹的生成和交替)。此外,研究人员将让研究生和本科生参与研究活动,将研究成果融入教学课程,并向当地学校和社区开展外联活动,以扩大计算机科学的参与。该项目旨在通过对互联网视频的大数据分析,实现类似人类的视听场景理解,克服单一模态方法的局限性。其核心思想是学习将场景解析为元素并推断它们的关系,即,形成视听场景图。具体地,当事件显示相关的音频和视觉特征时,视听场景的元素可以是事件的联合视听分量。如果事件仅以一种模态出现,则其也可以是音频组件或视觉组件。要素之间的关系既包括低层次的空间关系和时间关系,也包括高层次的相关关系和因果关系。通过该场景图,可以提取、交换和解释两种模态的信息。研究人员提出了三个主要的研究方向:(1)学习场景元素的联合视听表示;(2)学习场景图来组织场景元素;(3)跨模态场景完成。这三个研究重点中的每一个都探索了视听场景理解空间中的一个维度,但它们也是相互联系的。例如,视听场景元素是场景图中的节点,而场景图反过来用结构化信息指导场景元素之间的关系学习;跨模态场景补全生成场景图中的缺失数据,并且对于场景的良好视听理解是必要的。这一建议的预期成果包括:一个学习各种场景元素的联合视听表示的软件包;一个利用学习到的场景元素和场景图进行视听场景理解的网络部署系统,并附有文本生成说明;一个基于场景理解的跨模态场景补全软件包;以及一个大规模的视频数据集,带有用于视听关联、文本生成和场景完成的注释。数据集、软件和演示将在项目网站上托管。

项目成果

期刊论文数量(49)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Audio-Visual Event Localization in the Wild
野外视听事件定位
Speech Driven Talking Face Generation From a Single Image and an Emotion Condition
  • DOI:
    10.1109/tmm.2021.3099900
  • 发表时间:
    2021-07-26
  • 期刊:
  • 影响因子:
    7.3
  • 作者:
    Eskimez, Sefik Emre;Zhang, You;Duan, Zhiyao
  • 通讯作者:
    Duan, Zhiyao
One-Class Learning Towards Synthetic Voice Spoofing Detection
  • DOI:
    10.1109/lsp.2021.3076358
  • 发表时间:
    2020-10
  • 期刊:
  • 影响因子:
    3.9
  • 作者:
    You Zhang;Fei Jiang;Z. Duan
  • 通讯作者:
    You Zhang;Fei Jiang;Z. Duan
How to Make a BLT Sandwich? Learning VQA towards Understanding Web Instructional Videos
Noise-Resilient Training Method for Face Landmark Generation From Speech
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Chenliang Xu其他文献

Deep Audio Prior
深度音频优先
  • DOI:
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yapeng Tian;Chenliang Xu;Dingzeyu Li
  • 通讯作者:
    Dingzeyu Li
Scale-Adaptive Video Understanding
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chenliang Xu
  • 通讯作者:
    Chenliang Xu
Audio-Visual Action Prediction with Soft-Boundary in Egocentric Videos
自我中心视频中具有软边界的视听动作预测
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Luchuan Song;Jing Bi;Chao Huang;Chenliang Xu
  • 通讯作者:
    Chenliang Xu
Audio-Visual Object Localization in Egocentric Videos
以自我为中心的视频中的视听对象定位
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chao Huang;Yapeng Tian;Anurag Kumar;Chenliang Xu
  • 通讯作者:
    Chenliang Xu
A Study of Actor and Action Semantic retention in Video Supervoxel Segmentation
视频超体素分割中演员和动作语义保留的研究
  • DOI:
    10.1142/s1793351x13400114
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chenliang Xu;Richard F. Doell;S. Hanson;C. Hanson;Jason J. Corso
  • 通讯作者:
    Jason J. Corso

Chenliang Xu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Chenliang Xu', 18)}}的其他基金

III: Small: Collaborative Research: Scalable Deep Bayesian Tensor Decomposition
III:小:协作研究:可扩展的深贝叶斯张量分解
  • 批准号:
    1909912
  • 财政年份:
    2019
  • 资助金额:
    $ 65万
  • 项目类别:
    Standard Grant
RI: Small: Learning Dynamics and Evolution towards Cognitive Understanding of Videos
RI:小:视频认知理解的学习动态和演化
  • 批准号:
    1813709
  • 财政年份:
    2018
  • 资助金额:
    $ 65万
  • 项目类别:
    Standard Grant

相似海外基金

EduSay™ - developing a digital, audio-visual and kinesthetic English pronunciation training programme for international students and professionals; upskilling communications for education, employability, UK productivity and integration
EduSay™ - 为国际学生和专业人士开发数字、视听和动觉英语发音培训计划;
  • 批准号:
    10063001
  • 财政年份:
    2023
  • 资助金额:
    $ 65万
  • 项目类别:
    Collaborative R&D
Empowering Archivists: Applying New Tools and Approaches for Better Representation of Women in Audio-Visual Collections
赋予档案管理员权力:应用新工具和方法在音像收藏中更好地代表女性
  • 批准号:
    AH/Y007328/1
  • 财政年份:
    2023
  • 资助金额:
    $ 65万
  • 项目类别:
    Research Grant
User-centric Audio-Visual Scene Understanding for Augmented Reality Smart Glasses in the Wild
以用户为中心的野外增强现实智能眼镜的视听场景理解
  • 批准号:
    23K16912
  • 财政年份:
    2023
  • 资助金额:
    $ 65万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Audio-visual poetics for the environmental pollutions: A research on the documentaries and expressions of "Kogai" films
环境污染的视听诗学——“小外”电影的纪录片与表达研究
  • 批准号:
    22H00613
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Using eye tracking to examine audio-visual rhythm perception in infants
使用眼动追踪检查婴儿的视听节律感知
  • 批准号:
    572614-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    University Undergraduate Student Research Awards
Emotional McGurk: Developing a novel tool to examine audio-visual integration of affective signals
Emotional McGurk:开发一种新颖的工具来检查情感信号的视听整合
  • 批准号:
    574638-2022
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    University Undergraduate Student Research Awards
Neural Rendering of object-based audio-visual scenes
基于对象的视听场景的神经渲染
  • 批准号:
    2644080
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    Studentship
Ghosts amongst us: an audio-visual exploration of haunting in Palestine
我们身边的鬼魂:对巴勒斯坦闹鬼事件的视听探索
  • 批准号:
    2733997
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    Studentship
Audio-visual object-based dynamic scene representation from monocular video
单目视频中基于视听对象的动态场景表示
  • 批准号:
    2701695
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    Studentship
Towards in-vehicle situation awareness using visual and audio sensors
使用视觉和音频传感器实现车内态势感知
  • 批准号:
    LP210200931
  • 财政年份:
    2022
  • 资助金额:
    $ 65万
  • 项目类别:
    Linkage Projects
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了