权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Natural Narratives and Multimodal Context as Weak Supervision for Learning Object Categories

职业：自然叙事和多模态上下文作为学习对象类别的弱监督

基本信息

批准号：
2046853
负责人：
Adriana Kovashka
金额：
$ 54.71万
依托单位：
University of Pittsburgh
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-05-01 至 2026-04-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2046853&HistoricalAwards=false
关键词：
CAREER Natural Narratives Multimodal Context

项目摘要

This project develops a framework to train computer vision models for detection of objects from weak, naturally-occurring supervision of language (text or speech) and additional multimodal signals. It considers dynamic settings, where humans interact with their visual environment and refer to the encountered objects, e.g., “Carefully put the tomato plants in the ground” and “Please put the phone down and come set the table,” and captions written for a human audience to complement an image, e.g., news article captions. The challenge of using such language-based supervision for training detection systems is that along with useful signal, the speech contains many irrelevant tokens. The project will benefit society by exploring novel avenues for overcoming this challenge and reducing the need for expensive and potentially unnatural crowdsourced labels for training. It has the potential to make object detection systems more scalable and thus more usable by a broad user base in a variety of settings. The resources and tools developed would allow natural, lightweight learning in different environments, e.g., different languages or types of imagery where the well-known object categories are not useful or where there is a shift in both the pixels as well as the way in which humans refer to objects (different cultures, medicine, art). This project opens possibilities for learning in vivo rather than in vitro; while the focus here is on object categories, multimodal weak supervision is useful for a larger variety of tasks. Research and education are integrated through local community outreach and research mentoring for students from lesser-known universities, new programs for student training including honing graduate students' writing skills, and development of interactive educational modules and demos based on research findings. This project creatively connects two domains, vision-and-language, and object detection, and pioneers training of object detection models with weak language supervision and a large vocabulary of potential classes. The impact of noise in the language channel will be mitigated through three complementary techniques that model visual concreteness of words, to what extent the text refers to the visual environment it appears with, and whether the weakly-supervised models that are learned are logically consistent. Two complementary word-region association mechanisms will be used (metric learning and cross-modal transformers), whose application is novel for weakly-supervised detection. Importantly, to make detection feasible, not only the semantics of image-text pairs, but their discourse relationship, will be captured. To facilitate and disambiguate the association of words to a physical environment, the latter will be represented through additional modalities, namely sound, motion, depth and touch, which are either present in the data or estimated. This project advances knowledge of how multimodal cues contextualize the relation between image and text; no prior work has modeled image-text relationships along multiple channels (sound, depth, touch, motion). Finally, to connect the appearance of objects to the purpose and use of these objects, relationships between objects, properties and actions will be semantically organized in a graph, and grammars to represent activities involving objects will be extracted, still maintaining the weakly-supervised setting.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目开发了一个框架，用于训练计算机视觉模型，以从弱的、自然发生的语言（文本或语音）监督和其他多模态信号中检测对象。它考虑了动态设置，其中人类与他们的视觉环境进行交互并参考遇到的对象，例如，“小心地把番茄植物放在地上”和“请放下电话，过来摆桌子”，以及为人类观众编写的标题，以补充图像，例如，新闻标题将这种基于语言的监督用于训练检测系统的挑战在于，语音沿着有用的信号，包含许多不相关的标记。该项目将通过探索克服这一挑战的新途径，减少对昂贵且可能不自然的众包标签的需求，从而造福社会。它有可能使目标检测系统更具可扩展性，从而在各种设置中更容易被广泛的用户群使用。开发的资源和工具将允许在不同的环境中进行自然的轻量级学习，例如，不同的语言或图像类型，其中众所周知的对象类别没有用，或者像素以及人类提及对象的方式都有变化（不同的文化，医学，艺术）。该项目为在体内而不是体外学习提供了可能性;虽然这里的重点是对象类别，但多模态弱监督对于更广泛的任务是有用的。研究和教育是通过当地社区的推广和研究辅导的学生从鲜为人知的大学，新的学生培训计划，包括磨练研究生的写作技巧，并根据研究结果的互动教育模块和演示的发展相结合。该项目创造性地将视觉和语言以及对象检测这两个领域联系起来，并开创了具有弱语言监督和大量潜在类词汇的对象检测模型的训练。语言通道中噪声的影响将通过三种互补技术来减轻，这些技术对单词的视觉具体性进行建模，文本在多大程度上涉及它出现的视觉环境，以及学习的弱监督模型是否在逻辑上一致。将使用两个互补的词区域关联机制（度量学习和交叉模态变换器），其应用对于弱监督检测是新颖的。重要的是，为了使检测可行，不仅要捕获图像-文本对的语义，还要捕获它们的话语关系。为了促进和消除单词与物理环境的关联，后者将通过额外的模态来表示，即声音，运动，深度和触摸，这些模态存在于数据中或估计。这个项目的知识如何多模态线索语境化图像和文本之间的关系，没有以前的工作已经建模图像-文本关系沿着多个通道（声音，深度，触摸，运动）。最后，为了将对象的外观与这些对象的目的和用途联系起来，对象、属性和动作之间的关系将在图中进行语义组织，并且将提取用于表示涉及对象的活动的语法，仍然保持着微弱的-该奖项反映了NSF的法定使命，并已被认为是值得通过使用基金会的智力价值和更广泛的评估支持影响审查标准。

项目成果

期刊论文数量（6）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Weakly-Supervised Action Detection Guided by Audio Narration

DOI：
10.1109/cvprw56347.2022.00159
发表时间：
2022-05
期刊：
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
影响因子：
0
作者：
Keren Ye;Adriana Kovashka
通讯作者：
Keren Ye;Adriana Kovashka

Improving language-supervised object detection with linguistic structure analysis

DOI：
10.1109/cvprw59228.2023.00588
发表时间：
2023-06
期刊：
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
影响因子：
0
作者：
Arushi Rai;Adriana Kovashka
通讯作者：
Arushi Rai;Adriana Kovashka

Boosting Weakly Supervised Object Detection using Fusion and Priors from Hallucinated Depth

DOI：
10.1109/wacv57701.2024.00079
发表时间：
2023-03
期刊：
2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
影响因子：
0
作者：
Cagri Gungor;Adriana Kovashka
通讯作者：
Cagri Gungor;Adriana Kovashka

Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection

DOI：
10.1109/wacv56688.2023.00222
发表时间：
2023-01
期刊：
2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
影响因子：
0
作者：
Cagri Gungor;Adriana Kovashka
通讯作者：
Cagri Gungor;Adriana Kovashka

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

DOI：
10.1145/3591106.3592223
发表时间：
2023-04
期刊：
Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
影响因子：
0
作者：
Giacomo Nebbia;Adriana Kovashka
通讯作者：
Giacomo Nebbia;Adriana Kovashka

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Adriana Kovashka其他文献

Detecting Sexually Provocative Images

检测性挑逗图像

DOI：
10.1109/wacv.2017.79
发表时间：
2017
期刊：
2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
影响因子：
0
作者：
Debashis Ganguly;Mohammad H. Mofrad;Adriana Kovashka
通讯作者：
Adriana Kovashka

Syntharch: Interactive Image Search with Attribute-Conditioned Synthesis

Syntharch：具有属性条件合成的交互式图像搜索

DOI：
10.1109/cvprw50498.2020.00093
发表时间：
2020
期刊：
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
影响因子：
0
作者：
Zac Yu;Adriana Kovashka
通讯作者：
Adriana Kovashka

Inferring Visual Persuasion via Body Language, Setting, and Deep Features

通过肢体语言、场景和深层特征推断视觉说服力

DOI：
10.1109/cvprw.2016.102
发表时间：
2016
期刊：
2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
影响因子：
0
作者：
Xinyue Huang;Adriana Kovashka
通讯作者：
Adriana Kovashka