INT2-Medium: Understanding the meaning of images

INT2-Medium：理解图像的含义

基本信息

批准号：
0803603
负责人：
David Forsyth
金额：
$ 55万
依托单位：
University of Illinois at Urbana-Champaign
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2008
资助国家：
美国
起止时间：
2008-08-15 至 2012-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0803603&HistoricalAwards=false
关键词：
INT2 Medium Understanding meaning images

项目摘要

The ability to recognize objects in images is a core problem in computer vision. The last decade has seen astonishing advances in our methods to build object detectors. However, images convey richer information about the objects depicted in them: objects may form a scene ("A view of mountains and meadows"); objects are in relations with one another ("The cat sits on the mat"); different instances may look different ("The tabby cat sits on the blue mat"); objects may acting on others ("The cat is chasing the mouse"). This task of identifying the entities depicted in images, their attributes and relations is image understanding. This poses a number of new research questions: What objects should one remark on? What attributes of and relations between the objects depicted the image are important? That is, what is the visually salient information conveyed in an image?Many images (e.g. a large fraction of those on the web) are accompanied by text which describes or gives additional information about the entities depicted in them. The entities referred to in this text are typically visually salient ones. This correspondence between the information conveyed in the text and the image can be used in the creation of image understanding systems. Much current work treats image annotations that consist of individual words. The richer representations of meaning required to train image understanding systems can be obtained if annotating text is treated as sentences (rather than just bags of words). Sentences provide cues to: what is salient in an image; what salient objects likely look like (e.g. color, texture and form); and what relations might appear between them. Exposing this information will provide a rich body of training data for the next generation of computer vision systems.Research in natural language processing has created statistical wide- coverage parsers that can recover the semantic interpretation of sentences. These parsers differ from purely syntactic parsers in that they are based on linguistically expressive grammars that allow such interpretations to be built directly from the syntactic analysis. However, linking sentences with accompanying images requires a level of representation that goes beyond lists of the entities, states and events mentioned in a sentence. The writer of an image caption will typically assume that the reader sees the image, and can therefore refer to the entities depicted in it as known to the reader. There is a need parsers that are able to uncover the information structure of sentences -- what information is assumed to be shared knowledge between speaker and hearer, and what is new information asserted by the sentence. How information structure is encoded in natural language is well understood, and can be modeled with the same kinds of grammars that are used by those parsers that return semantic interpretations. Although there are currently no large corpora annotated with information structure, we will exploit the correspondence between images and their captions to develop novel, partially supervised, training regimes for parsers. These training regimes could also enable the bootstrapping of parsers for languages with no or little annotated training data.This project will build a novel parser that recovers richer linguistic representations, including information structure. It will build a novel image understanding system that recovers the salient entities depicted in an image together with their attributes and relations. The project will train these systems both separately on datasets consisting of sentences marked up with correct parses and images marked up with labels attached to objects, and jointly on a dataset of captioned images.Intellectual merits: The project goals are ambitious, but within reach, because both object recognition and parsing technology has advanced significantly. The project presents the vision and parsing communities with new goals, which are practically important and technically demanding. The aim of integrating natural language processing and computer vision creates a novel impetus to develop parsers that return richer linguistic representations, which will in turn have a deep impact on research within the natural language processing community itself. It will open up key directions in computer vision and natural language processing by demanding and enabling the recovery of richer representations of linguistic and visual information, and by studying how linguistic descriptions are grounded in the visual world.Broader impact: The project has significant practical implications in a number of areas such as image search, natural language interfaces for robotics, and will ultimately pave the way for new applications such as automatic captioning systems. The resulting advances in object recognition offer possibilities for the creation of safer autonomous vehicles, safer homes for better home care, and efficient management of surveillance data.URL: http://luthuli.cs.uiuc.edu/~daf/meaningofimages.html

识别图像中对象的能力是计算机视觉中的核心问题。在过去的十年中，我们建立对象探测器的方法的进步令人惊讶。但是，图像传达了有关它们中描述的对象的更丰富信息：对象可能形成一个场景（“山和草地的视图”）；物体彼此之间的关系（“猫坐在垫子上”）；不同的实例看起来可能不同（“塔比猫位于蓝色垫子上”）；物体可能会对他人作用（“猫正在追逐鼠标”）。这项识别图像中描绘的实体，它们的属性和关系的任务是图像理解。这提出了许多新的研究问题：应该说什么对象？描绘图像的对象之间的哪些属性和关系很重要？也就是说，图像中传达的视觉显着信息是什么？许多图像（例如，网络上的很大一部分）伴随着文本，这些文本描述或提供了有关它们中描述的实体的其他信息。本文中提到的实体通常是视觉上显着的。文本中传达的信息与图像之间的对应关系可用于创建图像理解系统。当前的许多工作处理由单个单词组成的图像注释。如果将注释文本视为句子（而不是单词袋），则可以获得训练图像理解系统所需的含义的富裕表示。句子为：图像中的显着性提供了线索；显着对象可能是什么样的（例如颜色，纹理和形式）；他们之间可能会出现什么关系。公开此信息将为下一代计算机视觉系统提供丰富的培训数据。自然语言处理中的研究创建了统计广泛的覆盖范围解析器，可以恢复句子的语义解释。这些解析器不同于纯粹的句法解析器，因为它们基于语言表达性语法，这些语法允许直接从句法分析中构建此类解释。但是，将句子与随附的图像联系起来需要一定程度的表示，这超出了句子中提到的实体，状态和事件的列表。图像标题的作者通常会假定读者看到图像，因此可以参考读者所知道的实体。有必要的解析器能够揭示句子的信息结构 - 说话者和听众之间的哪些信息是共享的知识，以及句子所主张的新信息。如何以自然语言编码信息结构，可以很好地理解，并且可以用那些返回语义解释的解析器使用的类型的语法来建模。尽管目前尚无信息结构注释的大型语料库，但我们将利用图像及其字幕之间的对应关系来开发针对解析器的新颖，部分监督的培训制度。这些培训制度还可以使解析器的引导程序无需带注释的培训数据。该项目将建立一个新颖的解析器，以恢复更丰富的语言表示，包括信息结构。它将建立一个新颖的图像理解系统，该系统恢复图像中描绘的显着实体以及其属性和关系。该项目将分别训练这些系统在包含标有正确的分析的句子的数据集上，并标记了带有附在物体的标签的标签，并在字幕图像的数据集上共同构建标签。IntlellectualFures.intlellectual Fures.Intellectual：Intarmiation：雄心勃勃，但触手可及，因为对象识别和解析技术均大大提高了。该项目以新的目标为愿景和解析社区展示了这些目标，这些目标实际上很重要且技术要求。整合自然语言处理和计算机视觉的目的创造了一种新颖的动力，以开发返回富裕语言表征的解析器，这反过来又将对自然语言处理社区本身的研究产生深远的影响。它将通过要求和能够恢复语言和视觉信息的更丰富表示形式，并研究语言描述在视觉世界中如何基于语言描述来打开计算机视觉和自然语言处理的关键方向。BOADER的影响：该项目在许多领域具有重要的实际含义，例如图像搜索，自动语言互动的自动座点，以及最终的应用程序，以及新的应用程序。对象识别的结果进步为创建更安全的自动驾驶汽车，更安全的家庭护理的更安全住房以及监视data.url的有效管理提供了可能性。