权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Automatic Alignment of Textto-Video for Semantic Multimedia Analysis

用于语义多媒体分析的文本到视频的自动对齐

基本信息

批准号：
252286362
负责人：
Professor Dr.-Ing. Rainer Stiefelhagen
金额：
--
依托单位：
Institut für Anthropomatik und Robotik (IAR)
依托单位国家：
德国
项目类别：
Research Grants
财政年份：
2014
资助国家：
德国
起止时间：
2013-12-31 至 2017-12-31
项目状态：
已结题

来源：
https://gepris.dfg.de/gepris/projekt/252286362?language=en
关键词：
Automatic Alignment Textto Video Semantic

项目摘要

In this project, we aim to explore rich descriptions of video data (TV series and movies) which opens myriad possibilities for multimedia analysis, understanding and obtaining weak labels for popular computer vision tasks. We wish to focus on two forms of text -- plot synopses and books. The former, plots are obtained via crowdsourcing and describe the episode or movie in a summarized way. In contrast books (from which the video is adapted) provide detailed descriptions of the story and visual world the author wishes to portray.While text in the form of subtitles and transcripts has been successfully used to automate person identification [Everingham 2006] or obtain samples for action recognition [Laptev 2008], those text sources are limited in their potential for understanding or obtaining rich descriptions of the story.To use the plot synopses, we will first align the sentences of the synopsis to shots in the video (WP2). We propose to use anchors, primarily person-id to help guide the alignment. We aim to solve two main challenges associated with this task: possible non-linearity of the plot synopsis, and skipping of shots.In contrast to plot synopses, the first step we take in analyzing books is to align chapters and their corresponding video shots (WP3). We can expect that some dialogues in the books match the ones used in the video adaptation. This allows us to automatically identify characters and learn person models in a second step, and also facilitates fine-grained alignment within a chapter.The alignment can be improved by knowing more about the scene or objects present in the shots. We will investigate this interconnected behaviour of labels and anchors in WP4, first in an iterative manner, and then by jointly modeling the two tasks of obtaining weak labels and performing alignment.We divide the applications into two types: (i) obtaining labels from the text sources and (ii) video-related applications. From plot synopses, we will specifically aim to obtain weak labels for places or scenes (WP5-P1). We will also explore tasks such as Summarization, Indexing and Retrieval (WP5-P2). For example, a coherent video summary based on the story (rather than low-level features) can be generated by first running a text summarizer on the plot, followed by selection of the set of aligned to the retained sentences. Indexing the descriptions for keywords can also lead to easy browsing through the video. From books, we wish to exploit dialogs for obtaining supervision for person identification, and rich descriptions surrounding the dialogs to learn attributes for the characters, scenes and objects (WP5-P1). Another interesting application is to automatically find differences between books and their video adaptations (WP5-P2).

在这个项目中，我们的目标是探索视频数据（电视剧和电影）的丰富描述，为多媒体分析，理解和获得流行的计算机视觉任务的弱标签提供了无数的可能性。我们希望集中在两种形式的文本-情节提要和书籍。前者，情节通过众包获得，并以概括的方式描述剧集或电影。相比之下，书籍（视频改编自）提供作者希望描绘的故事和视觉世界的详细描述。虽然字幕和文字记录形式的文本已成功用于自动化人员识别[Everingham 2006]或获得动作识别的样本[Laptev 2008]，这些文本来源在理解或获得故事的丰富描述方面的潜力有限。2为了使用情节提要，我们将首先将提要的句子与视频（WP 2）中的镜头对齐。我们建议使用锚点，主要是person-id来帮助引导对齐。我们的目标是解决与此任务相关的两个主要挑战：情节大纲可能存在的非线性，以及镜头的跳过。与情节大纲相比，我们在分析书籍时采取的第一步是将章节与相应的视频镜头对齐（WP 3）。我们可以预期，书中的一些对话与视频改编中使用的对话相匹配。这使我们能够在第二步中自动识别人物并学习人物模型，还有助于在章节内进行细粒度对齐。通过了解镜头中存在的场景或对象，可以改进对齐。我们将研究WP 4中标签和锚的这种相互关联的行为，首先以迭代的方式，然后通过联合建模获得弱标签和执行对齐这两个任务，我们将应用分为两种类型：（i）从文本源获得标签和（ii）视频相关的应用。从情节概要中，我们将专门针对地点或场景（WP 5-P1）获得弱标签。我们还将探索诸如摘要、索引和检索（WP 5-P2）等任务。例如，可以通过首先在情节上运行文本摘要器，然后选择与保留的句子对齐的集合来生成基于故事（而不是低级特征）的连贯视频摘要。为关键字的描述编制索引也可以轻松浏览视频。从书中，我们希望利用对话框来获得对人识别的监督，以及围绕对话框的丰富描述来学习角色，场景和对象的属性（WP 5-P1）。另一个有趣的应用是自动查找书籍及其视频改编之间的差异（WP 5-P2）。