权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Lodie,Web Scale Information Extraction via Linked Open Data

Lodie，通过链接开放数据提取网络规模信息

基本信息

批准号：
EP/J019488/1
负责人：
Fabio Ciravegna
金额：
$ 68.87万
依托单位：
University of Sheffield
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2012
资助国家：
英国
起止时间：
2012 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FJ019488%2F1
关键词：
Lodie Web Scale Information Extraction

项目摘要

The World Wide Web provides access to tens of billions of pages. These pages contain information that is largely unstructured and only intended for human readability, however we are reliant on computers "reading" these pages in order to find the information we need. The proposed research intends to develop technologies to radically improve the billions of searches which are performed every day by fulfilling the initial vision, by Tim Berners-Lee, for a Web where the webpage content is readable by both humans and machines. Such a vision, disregarded during the initial development of the Web, has now come back in the form of the Web of Data, or Linked Open Data (LOD), where billions of pieces of information are linked together and made available for automated processing. There is however a lack of interconnection between the information in the webpages and that in LOD. A number of initiatives, like RDFa (supported by W3C) or Microformats (used by schema.org and supported by major search engines) are trying to enable machines to make sense of the information contained in human readable pages by providing the ability to annotate webpage content with links into LOD.While the current state of the art in Web Information Extraction (IE) relies on domain specific training data or generic extraction patterns, by leveraging LOD the proposed research aims to develop IE methodologies and technologies providing pervasive, user-driven, Web-scale information extraction where the target of the IE is defined by the user information needs and aimed at the billions of available Web documents covering an unlimited number of domains.In this research we aim to develop models and algorithms to create a continuum between LOD and the human readable Web. The approach will utilise wealth of facts available from LOD and the limited number of pages annotated with RDFa/Microformats to learn to connect unannotated webpage content to the LOD cloud. This will provide the reciprocal advantages of: (i) enabling the search of Web pages via the unambiguous LOD instances and concepts, and (ii) the extension of the LOD with the wealth of information available from webpage content.The key challenge is the development of efficient, Web-scale, semi-supervised, iterative learning methods able to use the initial "seed" data and annotations, by generating models which exploit: (i) the local and global information regularities (e.g. structured information in tables, as well as pages and site-wide regularities); (ii) the redundancy (or repetition) of information; (iii) any ontological restrictions available in LOD. As the learning methods iterate from known interconnections to infer new connections they must cope with the massive amount of noise generated by the number and variety of documents, domains and facts available.In addition to publishing the research and its findings the IE methods developed will be tested on the task of extracting information relevant to schema.org (a task currently promoted by large search engines companies such as Google and Bing) as well as in international public evaluations. As part of such evaluations the project will generate at least one publicly available, Web-scale IE task (inclusive of corpora, linked resources, etc.) to enable comparison of research results by other researchers.The project aims to impact the fields of Natural Language Processing, Machine Learning, Information Retrieval and Web and Semantic Technologies by exploring the extraction of information in Web-scale, user-driven tasks. Success in the project will enable new ways of both creating/using the LOD and providing a paradigm shift in the way information can be retrieved from the Web; away from a reliance on keywords and towards the search and exploration of the concepts and meaning (semantics) embedded in those words.

万维网提供对数百亿个页面的访问。这些页面包含的信息基本上是非结构化的，只适合人类阅读，但我们依赖于计算机“阅读”这些页面，以找到我们需要的信息。拟议的研究旨在开发技术，从根本上改善每天执行的数十亿次搜索，实现Tim Berners-Lee的最初愿景，即网页内容可由人类和机器读取的Web。这种愿景在Web的最初发展过程中被忽视，现在以数据网络或链接开放数据（LOD）的形式回归，其中数十亿条信息被链接在一起并可用于自动化处理。然而，网页中的信息与LOD中的信息之间缺乏相互联系。许多倡议，如RDFa（由W3C支持）或Microformats（由www.example.com使用schema.org，并由主要搜索引擎支持）正在尝试通过提供注释网页内容的能力来理解人类可读页面中包含的信息。虽然Web信息提取（IE）的当前技术水平依赖于特定于域的训练数据或通用提取模式，通过利用LOD，拟议的研究旨在开发IE方法和技术，提供普遍的，用户驱动的，网络-规模信息抽取，其中IE的目标由用户信息需求定义，并且针对覆盖无限数量域的数十亿可用Web文档。在本研究中，旨在开发模型和算法，以创建LOD和人类可读Web之间的连续体。该方法将利用LOD提供的大量事实和RDFa/Microformats注释的有限数量的页面来学习将未注释的网页内容连接到LOD云。这将提供以下互惠优势：（i）通过明确的LOD实例和概念实现网页搜索，以及（ii）通过网页内容中的丰富信息扩展LOD。关键挑战是开发高效的、Web规模的、半监督的、迭代学习方法，该方法能够使用初始“种子”数据和注释，通过生成模型，该模型利用：（i）局部和全局信息库（例如，表格中的结构化信息，以及页面和站点范围内的结构化信息）;（ii）信息的冗余（或重复）;（iii）LOD中可用的任何本体论限制。当学习方法从已知的互连迭代到推断新的连接时，它们必须科普可用文档、域和事实的数量和种类所产生的大量噪音。除了发布研究及其发现之外，开发的IE方法将在提取与schema.org相关的信息的任务（该任务目前由Google和Bing等大型搜索引擎公司推广）以及国际公共评估中进行测试。作为此类评估的一部分，该项目将产生至少一个公开的、网络规模的IE任务（包括语料库、链接资源等）。该项目旨在通过探索Web规模的用户驱动任务中的信息提取来影响自然语言处理，机器学习，信息检索以及Web和语义技术等领域。该项目的成功将使创建/使用LOD的新方法成为可能，并提供从Web检索信息的方式的范式转变;远离对关键字的依赖，并转向搜索和探索嵌入在这些单词中的概念和含义（语义）。