权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Visual Sense. Tagging visual data with semantic descriptions

视觉感。

基本信息

批准号：
EP/K01904X/2
负责人：
Krystian Mikolajczyk
金额：
$ 7.46万
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2015
资助国家：
英国
起止时间：
2015 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FK01904X%2F2
关键词：
Visual Sense.Tagging visual data

项目摘要

Recent years have witnessed an unprecedented growth in the number of image and video collections, partially due to the increased popularity of photo and video sharing websites. One such website alone (Flickr) stores billions of images. And this is not the only way in which visual content is present on the Web: in fact most web pages contain some form of visual content. However, while most traditional tools for search and retrieval can successfully handle textual content, they are not prepared to handle heterogeneous documents. This new type of content demands the development of new efficient tools for search and retrieval.The large number of readily accessible multi-media data-collections pose both an opportunity and a challenge. The opportunity lies in the potential to mine this data to automatically discover mappings between visual and textual content. The challenge is to develop tools to classify, filter, browse and search such heterogeneous data. In brief, the data is available, but the tools to make sense of it are missing.The Visual Sense project aims to automatically mine the semantic content of visual data to enable "machine reading" of images. In recent years, we have witnessed significant advances in the automatic recognition of visual concepts. These advances allowed for the creation of systems that can automatically generate keyword-based image annotations. However, these annotations, e.g. "man" and "pot", fall far short of the sort of more meaningful descriptive captions necessary for indexing and retrieval of images, for example,"Man cooking in kitchen". The goal of this project is to move a step forward and predict semantic image representations that can be used to generate more informative sentence-based image annotations, thus facilitating search and browsing of large multi-modal collections. It will address the following key open research challenges:1) Develop methods that can derive a semantic representation of visual content. Such representations must go beyond the detection of objects and scenes and also include a wide range of object relations.2) Extend state-of-the-art natural language techniques to the tasks of mining large collections of multi-modal documents and generating image captions using both semantic representations of visual content and object/scene type models derived from semantic representations of the textual component of multi-modal documents.3) Develop learning algorithms that can exploit available multi-modal data to discover mappings between visual and textual content. These algorithms should be able to leverage 'weakly' annotated data and be robust to large amounts of noise.Thus, the main focus of the Visual Sense project is the development of machine learning methods for knowledge and information extraction from large collections of visual and textual content and for the fusion of this information across modalities. The tools and techniques developed in this project will have a variety of applications. To demonstrate them, we will address three case studies: 1) evaluation of generated descriptive image captions in established international image annotation benchmarks, 2) re-ranking for improved image search and 3) automatic illustration of articles with images.To address these broad challenges, the project will build on expertise from multiple disciplines, including computer vision, machine learning and natural language processing (NLP). It brings together four research groups from University of Surrey (Surrey, UK), Institut de Robotica i Informatica Industrial (IRI, Spain), Ecole Centrale de Lyon (ECL, France), and University of Sheffield (Sheffield, UK) having each well established and complementary expertise in their respective areas of research.

近年来，图片和视频收藏的数量出现了前所未有的增长，部分原因是照片和视频分享网站的日益普及。一个这样的网站（Flickr）就存储了数十亿张图片。这并不是视觉内容在网络上呈现的唯一方式：事实上，大多数网页都包含某种形式的视觉内容。然而，尽管大多数传统的搜索和检索工具可以成功地处理文本内容，但它们还没有准备好处理异构文档。这种新型的内容要求开发新的高效的搜索和检索工具。大量易于获取的多媒体数据集既是机遇也是挑战。机会在于挖掘这些数据以自动发现视觉和文本内容之间的映射的潜力。挑战在于开发工具来分类、过滤、浏览和搜索这些异构数据。简而言之，数据是可用的，但缺乏理解数据的工具。Visual Sense项目旨在自动挖掘视觉数据的语义内容，以实现图像的“机器阅读”。近年来，我们在视觉概念的自动识别方面取得了重大进展。这些进步允许创建能够自动生成基于关键字的图像注释的系统。然而，这些注释，例如：“人”和“锅”，远远不够索引和检索图像所需的那种更有意义的描述性说明，例如，“男人在厨房做饭”。这个项目的目标是向前迈进一步，预测语义图像表示，可以用来生成更多基于句子的信息图像注释，从而促进大型多模态集合的搜索和浏览。它将解决以下关键的开放式研究挑战：1)开发可以派生视觉内容的语义表示的方法。这种表征必须超越对象和场景的检测，还包括广泛的对象关系。2)将最先进的自然语言技术扩展到挖掘大型多模态文档集合的任务中，并使用视觉内容的语义表示和从多模态文档文本组件的语义表示派生的对象/场景类型模型来生成图像标题。3)开发学习算法，利用可用的多模态数据来发现视觉和文本内容之间的映射。这些算法应该能够利用“弱”注释数据，并且对大量噪声具有鲁棒性。因此，视觉感知项目的主要重点是开发机器学习方法，用于从大量视觉和文本内容中提取知识和信息，并用于跨模式融合这些信息。本项目开发的工具和技术将有各种各样的应用。为了演示它们，我们将讨论三个案例研究：1)在已建立的国际图像注释基准中评估生成的描述性图像标题，2)为改进的图像搜索重新排序，3)用图像自动说明文章。为了应对这些广泛的挑战，该项目将建立在多个学科的专业知识基础上，包括计算机视觉、机器学习和自然语言处理（NLP）。它汇集了来自萨里大学（萨里，英国）、机器人信息工业研究所（IRI，西班牙）、里昂中央学院（ECL，法国）和谢菲尔德大学（谢菲尔德，英国）的四个研究小组，每个小组在各自的研究领域都有完善的和互补的专业知识。