权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Learning to Recognise Dynamic Visual Content from Broadcast Footage

学习识别广播镜头中的动态视觉内容

基本信息

批准号：
EP/I011811/1
负责人：
Richard Bowden
金额：
$ 62.41万
依托单位：
University of Surrey
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2011
资助国家：
英国
起止时间：
2011 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FI011811%2F1
关键词：
Learning Recognise Dynamic Visual Content

项目摘要

This research is in the area of computer vision - making computers which can understand what is happening in photographs and video. As humans we are fascinated by other humans, and capture endless images of their activities, for example home movies of our family on holiday, video of sports events or CCTV footage of people in a town center. A computer capable of understanding what people are doing in such images would be able to do many jobs for us, for example finding clips of our children waving, fast forwarding to a goal in a football game, or spotting when someone starts a fight in the street. For Deaf people, who use a language combining hand gestures with facial expression and body language, a computer which could visually understand their actions would allow them to communicate in their native language. While humans are very good at understanding what people are doing (and can learn to understand special actions such as sign language), this has proved extremely challenging for computers.Much work has tried to solve this problem, and works well in particular settings for example the computer can tell if a person is walking so long as they do it clearly and face to the side, or can understand a few sign language gestures as long as the signer cooperates and signs slowly. We will investigate better models for recognising activities by teaching the computer by showing it many example videos. To make sure our method works well for all kinds of setting we will use real world video from movies and TV. For each video we have to tell the computer what it represents, for example throwing a ball or a man hugging a woman . It would be expensive to collect and label lots of videos in this way, so instead we will extract approximate labels automatically from subtitle text and scripts which are available for TV. Our new methods will combine learning from lots of approximately labelled video (cheap because we get the labels automatically), use of contextual information such as which actions people do at the same time, or how one action leads to another ( he hits the man, who falls to the floor ), and computer vision methods for understanding the pose of a person (how they are standing), how they are moving, and the objects which they are using.By having lots of video to learn from, and methods for making use of approximate labels, we will be able to make stronger and more flexible models of human activities. This will lead to recognition methods which work better in the real world and contribute to applications such as interpreting sign language and automatically tagging video with its content.

这项研究属于计算机视觉领域——制造能够理解照片和视频中发生的事情的计算机。作为人类，我们被其他人所吸引，并捕捉他们活动的无休止的图像，例如我们家庭度假的家庭电影，体育赛事的视频或市中心人们的闭路电视镜头。一台能够理解人们在这些图像中做什么的计算机将能够为我们做很多工作，例如找到我们孩子挥手的片段，在足球比赛中快进到一个进球，或者发现有人在街上打架。对于聋哑人来说，他们使用一种结合了手势、面部表情和肢体语言的语言，一台可以从视觉上理解他们动作的计算机将使他们能够用母语交流。虽然人类非常善于理解人们在做什么（并且可以学习理解特殊动作，如手语），但这对计算机来说是极具挑战性的。很多工作都试图解决这个问题，并且在特定的环境下工作得很好，例如，只要一个人走得清楚，脸向一边，计算机就能判断出他是否在走路，或者只要签字人配合并缓慢地做手势，计算机就能理解一些手语手势。我们将通过向计算机展示许多示例视频，来研究更好的活动识别模型。为了确保我们的方法适用于各种设置，我们将使用来自电影和电视的真实世界视频。对于每个视频，我们必须告诉计算机它代表什么，例如扔一个球或一个男人拥抱一个女人。以这种方式收集和标记大量视频将是昂贵的，因此我们将从电视可用的字幕文本和脚本中自动提取近似标签。我们的新方法将结合从大量近似标记的视频中学习（便宜，因为我们自动获得标签），使用上下文信息，例如人们同时做哪些动作，或者一个动作如何导致另一个动作（他击中了那个人，他摔倒在地），以及计算机视觉方法来理解一个人的姿势（他们如何站立），他们如何移动，以及他们正在使用的物体。通过大量的视频学习，以及使用近似标签的方法，我们将能够建立更强大、更灵活的人类活动模型。这将导致识别方法在现实世界中更好地工作，并有助于解释手语和自动标记视频内容等应用。