权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

ComPLetely Unsupervised Multimodal Character identification On TV series and movies

电视剧和电影中完全无监督的多模态角色识别

基本信息

批准号：
316692988
负责人：
Professor Dr.-Ing. Rainer Stiefelhagen
金额：
--
依托单位：
Institut für Anthropomatik und Robotik (IAR)
依托单位国家：
德国
项目类别：
Research Grants
财政年份：
2016
资助国家：
德国
起止时间：
2015-12-31 至 2020-12-31
项目状态：
已结题

来源：
https://gepris.dfg.de/gepris/projekt/316692988?language=en
关键词：
ComPLetely Unsupervised Multimodal Character identification

项目摘要

Automatic character identification in multimedia videos is an extensive and challenging problem. Person identification serves as foundation and building block for many higher level video analysis tasks, for example semantic indexing, search and retrieval, interaction analysis and video summarization.The goal of this project is to exploit textual, audio and video information to automatically identify characters in TV series and movies without requiring any manual annotation for training character models. A fully automatic and unsupervised approach is especially appealing when considering the huge amount and growth of available multimedia data. Text, audio and video provide complementary cues to the identity of a person, and thus allow to better identify a person than from either modality alone.To this end, we will address three main research questions: unsupervised clustering of speech turns (i.e. speaker diarization) and face tracks in order to group similar tracks of the same person without prior labels or models; unsupervised identification by propagation of automatically generated weak labels from various sources of information (such as subtitles and speech transcripts); and multimodal fusion of acoustic, visual and textual cues at various levels of the identification pipeline.While there exist many generic approaches to unsupervised clustering, they are not adapted to heterogeneous audiovisual data (face tracks vs. speech turns) and do not perform as well on challenging TV series and movies content as they do on other controlled data. Our general approach is therefore to first over-cluster the data and make sure that clusters remain pure, before assigning names to these clusters in a second step. On top of domain specific improvements for either modality alone, we expect joint multimodal clustering to take advantage of three modalities and improve clustering performance over each modality alone.Then, unsupervised identification aims at assigning character names to clusters in a completely automatic manner (i.e. using only available information already present in the speech and video). In TV series and movies, character names are usually introduced and re-iterated throughout the video. We will detect and use addresser-addressee relationships in both speech transcripts (using named entity detection techniques) and video (using mouth movements, viewing direction and focus of attention of faces). This allows to assign names to some clusters, learn discriminative models and assign names to the remaining clusters.For evaluation, we will extend and further annotate a corpus of four TV series (57 episodes) and one movie series (8 movies), a total of about 50 hours of video. This diverse data covers different filming styles, type of stories and challenges contained in both video and audio. We will evaluate the different steps of this project on this corpus, and also make our annotations publicly available for other researchers in the field.

多媒体视频中的字符自动识别是一个广泛而具有挑战性的问题。人物识别是许多更高层次的视频分析任务的基础和构建块，例如语义索引，搜索和检索，交互分析和视频摘要。本项目的目标是利用文本，音频和视频信息自动识别电视剧和电影中的人物，而无需任何人工注释来训练人物模型。当考虑到可用多媒体数据的巨大数量和增长时，全自动和无监督的方法特别有吸引力。文本、音频和视频为人的身份提供了互补的线索，因此比单独使用任何一种模态都能更好地识别人。为此，我们将解决三个主要的研究问题：（即，说话人日志化）和面部轨迹，以便在没有先前标签或模型的情况下对同一个人的相似轨迹进行分组;通过传播来自各种信息源的自动生成的弱标签的无监督识别（如字幕和演讲稿）;以及声学的多模态融合，视觉和文本线索。虽然存在许多通用的无监督聚类方法，它们不适合于不同的视听数据（面部轨迹对语音转向），并且在有挑战性的电视连续剧和电影内容上的表现不如它们在其他受控数据上的表现。因此，我们的一般方法是首先对数据进行过度聚类，并确保聚类保持纯净，然后在第二步中为这些聚类分配名称。除了对每一种模态进行特定领域的改进之外，我们还希望联合多模态聚类能够利用三种模态，并提高每一种模态的聚类性能。然后，无监督识别的目标是以完全自动的方式（即只使用语音和视频中已经存在的可用信息）将字符名称分配给聚类。在电视剧和电影中，角色的名字通常会在整个视频中介绍和重复。我们将检测和使用两个语音成绩单（使用命名实体检测技术）和视频（使用嘴部动作，观看方向和面部的注意力集中）的收件人-收件人的关系。这允许为一些聚类分配名称，学习判别模型并为其余聚类分配名称。为了评估，我们将扩展并进一步注释四部电视剧（57集）和一部电影系列（8部电影）的语料库，总共约50小时的视频。这些不同的数据涵盖了不同的拍摄风格，故事类型以及视频和音频中包含的挑战。我们将在这个语料库上评估这个项目的不同步骤，并将我们的注释公开提供给该领域的其他研究人员。

项目成果

期刊论文数量（8）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Classification-Driven Dynamic Image Enhancement

DOI：
10.1109/cvpr.2018.00424
发表时间：
2017-10
期刊：
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
影响因子：
0
作者：
Vivek Sharma;Ali Diba;D. Neven;M. S. Brown;L. Gool;R. Stiefelhagen
通讯作者：
Vivek Sharma;Ali Diba;D. Neven;M. S. Brown;L. Gool;R. Stiefelhagen

Self-supervised Face-Grouping on Graphs

DOI：
10.1145/3343031.3351071
发表时间：
2019-10
期刊：
Proceedings of the 27th ACM International Conference on Multimedia
影响因子：
0
作者：
Veith Röthlingshöfer;Vivek Sharma;R. Stiefelhagen
通讯作者：
Veith Röthlingshöfer;Vivek Sharma;R. Stiefelhagen

Clustering based Contrastive Learning for Improving Face Representations

DOI：
10.1109/fg47880.2020.00011
发表时间：
2020-04
期刊：
2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
影响因子：
0
作者：
Vivek Sharma;Makarand Tapaswi;M. Sarfraz;R. Stiefelhagen
通讯作者：
Vivek Sharma;Makarand Tapaswi;M. Sarfraz;R. Stiefelhagen

Video Face Clustering With Self-Supervised Representation Learning

DOI：
10.1109/tbiom.2019.2947264
发表时间：
2020-04
期刊：
IEEE Transactions on Biometrics, Behavior, and Identity Science
影响因子：
0
作者：
Vivek Sharma;Makarand Tapaswi;M. Sarfraz;R. Stiefelhagen
通讯作者：
Vivek Sharma;Makarand Tapaswi;M. Sarfraz;R. Stiefelhagen

Self-Supervised Learning of Face Representations for Video Face Clustering

DOI：
10.1109/fg.2019.8756609
发表时间：
2019-03
期刊：
2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019)
影响因子：
0
作者：
Vivek Sharma;Makarand Tapaswi;M. Sarfraz;R. Stiefelhagen
通讯作者：
Vivek Sharma;Makarand Tapaswi;M. Sarfraz;R. Stiefelhagen