权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Medium: Constructing Knowledge Bases by Extracting Entity-Relations and Meanings from Natural Language via "Universal Schema"

III：媒介：通过“通用模式”从自然语言中提取实体关系和含义来构建知识库

基本信息

批准号：
1514053
负责人：
Andrew McCallum
金额：
$ 100万
依托单位：
University of Massachusetts Amherst
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2015
资助国家：
美国
起止时间：
2015-09-01 至 2020-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1514053&HistoricalAwards=false
关键词：
III Medium Constructing Knowledge Bases

项目摘要

Automated knowledge base (KB) construction from natural language is of fundamental importance to (a) scientists (for example, there has been long-standing interest in building KBs of genes and proteins), (b) social scientists (for example, building social networks from textual data), and (c) national defense (where network analysis of criminals and terrorists have proven useful). The core of a knowledge base is its objects ("entities", such as proteins, people, organizations and locations) and its connections between these objects ("relations", such as one protein increasing production of another, or a person working for an organization). This project aims to greatly increase the accuracy with which entity-relations can be extracted from text, as well as increase the fidelity which many subtle distinctions among types of relations can be represented. The project's technical approach -- which we call "universal schema" -- is a markedly novel departure from traditional methods, based on representing all of the input relation expressions as positions in a common multi-dimensional space, with nearby relations having similar meanings. Broader impacts will include collaboration with industry on applications of economic importance, collaboration with academic non-computer-scientists on a multidisciplinary application, creating and publicly releasing new data sets for benchmark evaluation by ourselves and others (enabling scientific progress through improved performance comparisons), creating and publicly releasing an open-source implementation of our methods (enabling further scientific research, easy large-scale use, rapid commercialization and third-party enhancements). Education impacts include creating and teaching a new course on knowledge base construction for the sciences, organizing a research workshop on embeddings, extraction and knowledge representation, and training multiple undergraduates and graduate students. Most previous research in relation extraction falls into one of two categories. In the first, one must define a pre-fixed schema of relation types (such as lives-in, employed-by and a handful of others), which limits expressivity and hides language ambiguities. Training machine learning models here either relies on labeled training data (which is scarce and expensive), or uses lightly-supervised self-training procedures (which are often brittle and wander farther from the truth with additional iterations). In the second category, one extracts into an "open" schema based on language strings themselves (lacking ability to generalize among them), or attempts to gain generalization with unsupervised clustering of these strings (suffering from clusters that fail to capture reliable synonyms, or even find the desired semantics at all). This project proposes research in relation extraction of "universal schema", where we learn a generalizing model of the union of all input schemas, including multiple available pre-structured KBs as well as all the observed natural language surface forms. The approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into pre-defined boxes), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the probabilistic matrix factorization and vector embedding methods that were so successful in the NetFlix prize competition. Universal schema provide for a nearly limitless diversity of relation types (due to surface forms), and support convenient semi-supervised learning through integration with existing structured data (i.e., the relation types of existing databases). In preliminary experiments, the approach already surpassed by a wide margin the previous state-of-the-art relation extraction methods on a benchmark task. New proposed research includes new training processes, new representations that include multiple-senses for the same surface form as well as embeddings with variances, new methods of incorporating constraints, joint inference between entity- and relation-types, new models of non-binary and higher-order relations, and scalability through parallel distribution. The project web site (http://www.iesl.cs.umass.edu/projects/NSF_USchema.html) will include information on the project and provide access to data sets, source code and documentation, teaching and workshop materials, and publications. In addition, datasets will be disseminated via UCI Machine Learning Repository (or other similar archive location for machine learning data) to facilitate sharing with other researchers and ensure long-term availability, and GitHub will be used to facilitate release, sharing, and archiving of code.

从自然语言构建自动知识库(KB)对于(A)科学家(例如，人们对构建基因和蛋白质知识库的兴趣由来已久)、(B)社会科学家(例如，根据文本数据构建社会网络)以及(C)国防(对罪犯和恐怖分子的网络分析已被证明有用)具有基本重要性。知识库的核心是它的对象(“实体”，如蛋白质、人、组织和地点)及其与这些对象之间的联系(“关系”，如一种蛋白质增加另一种蛋白质的产量，或为一个组织工作的人)。这个项目的目的是大大提高从文本中提取实体关系的准确性，以及增加可以表示关系类型之间许多细微区别的保真度。该项目的技术方法--我们称之为“通用模式”--与传统方法明显不同，它基于将所有输入关系表达式表示为公共多维空间中的位置，具有相似含义的邻近关系。更广泛的影响将包括与工业界就具有经济重要性的应用程序进行合作，与学术界的非计算机科学家就多学科应用程序进行合作，创建并公开发布新的数据集，供我们自己和他人进行基准评估(通过改进性能比较来实现科学进步)，创建并公开发布我们方法的开源实施(使我们的方法能够进一步进行科学研究，易于大规模使用，快速商业化和第三方增强)。教育影响包括创建和教授一门关于科学知识库建设的新课程，组织一次关于嵌入、提取和知识表示的研究讲习班，以及培训多名本科生和研究生。大多数以前的关系提取研究都可以分为两类。在第一种情况下，必须定义一个预先固定的关系类型模式(如Living-in、Employee-by和少数其他类型)，这限制了可表现性并隐藏了语言歧义。这里的训练机器学习模型要么依赖于标记的训练数据(这是稀缺和昂贵的)，要么使用不太受监督的自我训练过程(这些过程往往很脆弱，随着额外的迭代而偏离真相更远)。在第二类中，人们提取到基于语言字符串本身的“开放”模式(缺乏在它们之间进行泛化的能力)，或者试图通过对这些字符串进行无监督的聚类来获得泛化(由于聚类无法捕获可靠的同义词，甚至根本找不到所需的语义)。该项目提出了关系抽取的“通用模式”的研究，在那里我们学习了所有输入模式的联合的泛化模型，包括多个可用的预先构造的知识库以及所有观察到的自然语言表面形式。因此，该方法包含了原始语言表面形式的多样性和多义性(不试图强迫关系进入预定义的框中)，但也通过使用在Netflix大奖竞赛中非常成功的概率矩阵分解和向量嵌入方法的新扩展，通过学习显式和隐式关系之间的非对称蕴含，成功地实现了泛化。通用模式提供了几乎无限的关系类型(由于表面形式)的多样性，并通过与现有结构化数据(即现有数据库的关系类型)的集成来支持方便的半监督学习。在初步实验中，该方法在基准任务上已经远远超过了以往最先进的关系提取方法。新提出的研究包括新的训练过程，包括同一表面形式的多个意义以及带有方差的嵌入的新表示，结合约束的新方法，实体和关系类型之间的联合推理，非二元和高阶关系的新模型，以及通过并行分布的可扩展性。项目网址(http://www.iesl.cs.umass.edu/projects/NSF_USchema.html)将包括关于该项目的信息，并提供查阅数据集、源代码和文件、教学和讲习班材料以及出版物的途径。此外，数据集将通过UCI机器学习存储库(或其他类似的机器学习数据存档位置)传播，以便于与其他研究人员共享并确保长期可用，GitHub将用于促进代码的发布、共享和存档。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Andrew McCallum其他文献

An Interoperable Multimedia Catalog System for Electronic Commerce.

用于电子商务的可互操作多媒体目录系统。

DOI：
发表时间：
2000
期刊：
影响因子：
0
作者：
William W. Cohen;Andrew McCallum;D. Quass
通讯作者：
D. Quass

Scaling Within Document Coreference to Long Texts

文档共指内的缩放到长文本

DOI：
发表时间：
2021
期刊：
Findings
影响因子：
0
作者：
Raghuveer Thirukovalluru;Nicholas Monath;K. Shridhar;M. Zaheer;Mrinmaya Sachan;Andrew McCallum
通讯作者：
Andrew McCallum

ezCoref : A Scalable Approach for Collecting Crowdsourced Annotations for Coreference Resolution

ezCoref：一种收集众包注释以进行共指解析的可扩展方法

DOI：
发表时间：
2022
期刊：
影响因子：
0
作者：
A. Crowdsourced;David Bamman;Olivia Lewke;Rachel Bawden;Rico Sennrich;Alexandra Birch;Ari Bornstein;Arie Cattan;Ido Dagan;Hong Chen;Zhenhua Fan;Hao Lu;Alan Yuille;Eduard Hovy;Mitch Marcus;M. Palmer;Lance;Rodney Huddleston. 2002;Frédéric Landragin;T. Poibeau;Bernard Vic;Belinda Z. Li;Gabriel Stanovsky;Robert L Logan;Andrew McCallum;Sameer Singh
通讯作者：
Sameer Singh

PaRaDe: Passage Ranking using Demonstrations with Large Language Models

PaRaDe：使用大型语言模型的演示进行段落排名

DOI：
10.48550/arxiv.2310.14408
发表时间：
2023
期刊：
ArXiv
影响因子：
0
作者：
Andrew Drozdov;Honglei Zhuang;Zhuyun Dai;Zhen Qin;Razieh Rahimi;Xuanhui Wang;Dana Alon;Mohit Iyyer;Andrew McCallum;Donald Metzler;Kai Hui
通讯作者：
Kai Hui