权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Web-Scale Semantic Image and Video Understanding

网络规模的语义图像和视频理解

基本信息

批准号：
RGPIN-2018-04657
负责人：
Sigal, Leonid
金额：
$ 2.99万
依托单位：
University of British Columbia
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2020
资助国家：
加拿大
起止时间：
2020-01-01 至 2021-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=712121
关键词：
Web Scale Semantic Image Video

项目摘要

Visual recognition is a sub-field of computer vision which centers on building algorithms that can automatically and intelligently recognize, catalog and understand image/video content. Significant progress in accuracy and scope has been made on recognition problems in recent years, driven by powerful machine learning algorithms (e.g., deep learning), large labeled datasets and novel problem definitions. However, current approaches lack capabilities that permit accurate fine-grained detailed scene understanding. Most algorithms are limited to coarse image- or video-level interpretations (e.g., choosing among 1,000 noun categories, 200 action classes, or 18 candidate answers for a visual question). Such understanding is useful, but at the same time is still very limiting in enabling the breadth of potential applications ranging from media curation to autonomous navigation. It is difficult to quantify what level of performance or scale is necessary for visual recognition to be broadly successful. I believe the next transformative milestone to be - detailed scene understanding at the accuracy of the current coarse models. This requires developing capabilities of recognizing fine-grained object categories, numbering in 100,000, and spatio-temporal (predicate) relations among the objects and elements of the scene, counting in 1,000. The former would give ability to recognize nearly every object/noun; the latter would enable situated contextual reasoning critically important for AI. My long term research objective is to develop such accurate detailed models for visual understanding at scale; models that can describe and localize objects and people, reason about their spatial and functional relationships, their actions and interactions. This proposal tackles three fundamental sub-challenges to achieving this objective in the corresponding research threads: 1. The ever growing fine-grained set of classes requires development of novel data efficient learning algorithms. As the categories to recognize become more specific, the amount of data per category decreases (e.g., there are millions of car images, but few of 1957 Jaguar XKSS). We will build on our recent work where we developed the only method to date capable of recognizing up to 310,000 categories. 2. Moving beyond recognition of isolated objects, requires reasoning about structures relating objects, people, and scene elements in space and time. Rich flexible structured models will be developed to enable such reasoning. 3. To alleviate the black-box nature of existing architectures, not suitable for decision-critical tasks, we will develop algorithms that enable interpretability and more human-like introspective reasoning. Importantly, the program will also focus on applying the developed algorithms to specific recognition problems relevant for media search/retrieval, augmented reality and medical imaging.

视觉识别是计算机视觉的一个子领域，其核心是构建能够自动智能地识别、分类和理解图像/视频内容的算法。近年来，在强大的机器学习算法（例如，深度学习）、大型标记数据集和新颖的问题定义。然而，目前的方法缺乏能力，允许准确的细粒度的详细场景的理解。大多数算法限于粗略的图像或视频级解释（例如，在1,000个名词类别、200个动作类或18个视觉问题的候选答案中进行选择）。这种理解是有用的，但同时在实现从媒体策划到自主导航的潜在应用的广度方面仍然非常有限。很难量化什么水平的性能或规模是必要的视觉识别是广泛的成功。我相信下一个变革性的里程碑是-在当前粗糙模型的准确性上详细的场景理解。这需要开发识别细粒度对象类别（以100，000为单位）和场景对象和元素之间的时空（谓词）关系（以1，000为单位）的能力。前者将提供识别几乎所有对象/名词的能力;后者将使情境推理对AI至关重要。我的长期研究目标是开发这种精确的详细模型，用于大规模的视觉理解;模型可以描述和定位物体和人，推理它们的空间和功能关系，它们的动作和相互作用。该提案解决了在相应的研究思路中实现这一目标的三个基本子挑战： 1.不断增长的细粒度类集需要开发新的数据高效学习算法。随着要识别的类别变得更加具体，每个类别的数据量减少（例如，有数以百万计的汽车图像，但很少有1957年捷豹XKSS）。我们将在最近的工作基础上，开发出迄今为止唯一能够识别多达310，000个类别的方法。 2.超越孤立对象的识别，需要推理空间和时间中与对象，人和场景元素相关的结构。将开发丰富灵活的结构化模型，以实现这种推理。 3.为了减轻现有架构的黑盒性质，不适合决策关键任务，我们将开发算法，使可解释性和更人性化的内省推理。重要的是，该计划还将专注于将开发的算法应用于与媒体搜索/检索，增强现实和医学成像相关的特定识别问题。