III: Medium: Constructing Knowledge Bases by Extracting Entity-Relations and Meanings from Natural Language via "Universal Schema"
III:媒介:通过“通用模式”从自然语言中提取实体关系和含义来构建知识库
基本信息
- 批准号:1514053
- 负责人:
- 金额:$ 100万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2015
- 资助国家:美国
- 起止时间:2015-09-01 至 2020-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Automated knowledge base (KB) construction from natural language is of fundamental importance to (a) scientists (for example, there has been long-standing interest in building KBs of genes and proteins), (b) social scientists (for example, building social networks from textual data), and (c) national defense (where network analysis of criminals and terrorists have proven useful). The core of a knowledge base is its objects ("entities", such as proteins, people, organizations and locations) and its connections between these objects ("relations", such as one protein increasing production of another, or a person working for an organization). This project aims to greatly increase the accuracy with which entity-relations can be extracted from text, as well as increase the fidelity which many subtle distinctions among types of relations can be represented. The project's technical approach -- which we call "universal schema" -- is a markedly novel departure from traditional methods, based on representing all of the input relation expressions as positions in a common multi-dimensional space, with nearby relations having similar meanings. Broader impacts will include collaboration with industry on applications of economic importance, collaboration with academic non-computer-scientists on a multidisciplinary application, creating and publicly releasing new data sets for benchmark evaluation by ourselves and others (enabling scientific progress through improved performance comparisons), creating and publicly releasing an open-source implementation of our methods (enabling further scientific research, easy large-scale use, rapid commercialization and third-party enhancements). Education impacts include creating and teaching a new course on knowledge base construction for the sciences, organizing a research workshop on embeddings, extraction and knowledge representation, and training multiple undergraduates and graduate students. Most previous research in relation extraction falls into one of two categories. In the first, one must define a pre-fixed schema of relation types (such as lives-in, employed-by and a handful of others), which limits expressivity and hides language ambiguities. Training machine learning models here either relies on labeled training data (which is scarce and expensive), or uses lightly-supervised self-training procedures (which are often brittle and wander farther from the truth with additional iterations). In the second category, one extracts into an "open" schema based on language strings themselves (lacking ability to generalize among them), or attempts to gain generalization with unsupervised clustering of these strings (suffering from clusters that fail to capture reliable synonyms, or even find the desired semantics at all). This project proposes research in relation extraction of "universal schema", where we learn a generalizing model of the union of all input schemas, including multiple available pre-structured KBs as well as all the observed natural language surface forms. The approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into pre-defined boxes), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the probabilistic matrix factorization and vector embedding methods that were so successful in the NetFlix prize competition. Universal schema provide for a nearly limitless diversity of relation types (due to surface forms), and support convenient semi-supervised learning through integration with existing structured data (i.e., the relation types of existing databases). In preliminary experiments, the approach already surpassed by a wide margin the previous state-of-the-art relation extraction methods on a benchmark task. New proposed research includes new training processes, new representations that include multiple-senses for the same surface form as well as embeddings with variances, new methods of incorporating constraints, joint inference between entity- and relation-types, new models of non-binary and higher-order relations, and scalability through parallel distribution. The project web site (http://www.iesl.cs.umass.edu/projects/NSF_USchema.html) will include information on the project and provide access to data sets, source code and documentation, teaching and workshop materials, and publications. In addition, datasets will be disseminated via UCI Machine Learning Repository (or other similar archive location for machine learning data) to facilitate sharing with other researchers and ensure long-term availability, and GitHub will be used to facilitate release, sharing, and archiving of code.
从自然语言构建自动知识库(KB)对于(A)科学家(例如,人们对构建基因和蛋白质知识库的兴趣由来已久)、(B)社会科学家(例如,根据文本数据构建社会网络)以及(C)国防(对罪犯和恐怖分子的网络分析已被证明有用)具有基本重要性。知识库的核心是它的对象(“实体”,如蛋白质、人、组织和地点)及其与这些对象之间的联系(“关系”,如一种蛋白质增加另一种蛋白质的产量,或为一个组织工作的人)。这个项目的目的是大大提高从文本中提取实体关系的准确性,以及增加可以表示关系类型之间许多细微区别的保真度。该项目的技术方法--我们称之为“通用模式”--与传统方法明显不同,它基于将所有输入关系表达式表示为公共多维空间中的位置,具有相似含义的邻近关系。更广泛的影响将包括与工业界就具有经济重要性的应用程序进行合作,与学术界的非计算机科学家就多学科应用程序进行合作,创建并公开发布新的数据集,供我们自己和他人进行基准评估(通过改进性能比较来实现科学进步),创建并公开发布我们方法的开源实施(使我们的方法能够进一步进行科学研究,易于大规模使用,快速商业化和第三方增强)。教育影响包括创建和教授一门关于科学知识库建设的新课程,组织一次关于嵌入、提取和知识表示的研究讲习班,以及培训多名本科生和研究生。大多数以前的关系提取研究都可以分为两类。在第一种情况下,必须定义一个预先固定的关系类型模式(如Living-in、Employee-by和少数其他类型),这限制了可表现性并隐藏了语言歧义。这里的训练机器学习模型要么依赖于标记的训练数据(这是稀缺和昂贵的),要么使用不太受监督的自我训练过程(这些过程往往很脆弱,随着额外的迭代而偏离真相更远)。在第二类中,人们提取到基于语言字符串本身的“开放”模式(缺乏在它们之间进行泛化的能力),或者试图通过对这些字符串进行无监督的聚类来获得泛化(由于聚类无法捕获可靠的同义词,甚至根本找不到所需的语义)。该项目提出了关系抽取的“通用模式”的研究,在那里我们学习了所有输入模式的联合的泛化模型,包括多个可用的预先构造的知识库以及所有观察到的自然语言表面形式。因此,该方法包含了原始语言表面形式的多样性和多义性(不试图强迫关系进入预定义的框中),但也通过使用在Netflix大奖竞赛中非常成功的概率矩阵分解和向量嵌入方法的新扩展,通过学习显式和隐式关系之间的非对称蕴含,成功地实现了泛化。通用模式提供了几乎无限的关系类型(由于表面形式)的多样性,并通过与现有结构化数据(即现有数据库的关系类型)的集成来支持方便的半监督学习。在初步实验中,该方法在基准任务上已经远远超过了以往最先进的关系提取方法。新提出的研究包括新的训练过程,包括同一表面形式的多个意义以及带有方差的嵌入的新表示,结合约束的新方法,实体和关系类型之间的联合推理,非二元和高阶关系的新模型,以及通过并行分布的可扩展性。项目网址(http://www.iesl.cs.umass.edu/projects/NSF_USchema.html)将包括关于该项目的信息,并提供查阅数据集、源代码和文件、教学和讲习班材料以及出版物的途径。此外,数据集将通过UCI机器学习存储库(或其他类似的机器学习数据存档位置)传播,以便于与其他研究人员共享并确保长期可用,GitHub将用于促进代码的发布、共享和存档。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Andrew McCallum其他文献
An Interoperable Multimedia Catalog System for Electronic Commerce.
用于电子商务的可互操作多媒体目录系统。
- DOI:
- 发表时间:
2000 - 期刊:
- 影响因子:0
- 作者:
William W. Cohen;Andrew McCallum;D. Quass - 通讯作者:
D. Quass
Scaling Within Document Coreference to Long Texts
文档共指内的缩放到长文本
- DOI:
- 发表时间:
2021 - 期刊:
- 影响因子:0
- 作者:
Raghuveer Thirukovalluru;Nicholas Monath;K. Shridhar;M. Zaheer;Mrinmaya Sachan;Andrew McCallum - 通讯作者:
Andrew McCallum
ezCoref : A Scalable Approach for Collecting Crowdsourced Annotations for Coreference Resolution
ezCoref:一种收集众包注释以进行共指解析的可扩展方法
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
A. Crowdsourced;David Bamman;Olivia Lewke;Rachel Bawden;Rico Sennrich;Alexandra Birch;Ari Bornstein;Arie Cattan;Ido Dagan;Hong Chen;Zhenhua Fan;Hao Lu;Alan Yuille;Eduard Hovy;Mitch Marcus;M. Palmer;Lance;Rodney Huddleston. 2002;Frédéric Landragin;T. Poibeau;Bernard Vic;Belinda Z. Li;Gabriel Stanovsky;Robert L Logan;Andrew McCallum;Sameer Singh - 通讯作者:
Sameer Singh
PaRaDe: Passage Ranking using Demonstrations with Large Language Models
PaRaDe:使用大型语言模型的演示进行段落排名
- DOI:
10.48550/arxiv.2310.14408 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Andrew Drozdov;Honglei Zhuang;Zhuyun Dai;Zhen Qin;Razieh Rahimi;Xuanhui Wang;Dana Alon;Mohit Iyyer;Andrew McCallum;Donald Metzler;Kai Hui - 通讯作者:
Kai Hui
Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
每个答案都很重要:用概率度量评估常识
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Qi Cheng;Michael Boratko;Pranay Kumar Yelugam;T. O’Gorman;Nalini Singh;Andrew McCallum;X. Li - 通讯作者:
X. Li
Andrew McCallum的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Andrew McCallum', 18)}}的其他基金
Collaborative Research: SOS-DCI / HNDS-R: Advancing Semantic Network Analysis to Better Understand How Evaluative Exchanges Shape Scientific Arguments
合作研究:SOS-DCI / HNDS-R:推进语义网络分析,以更好地理解评估性交流如何塑造科学论证
- 批准号:
2244805 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
RI: Medium: Probabilistic Box Embeddings
RI:中:概率框嵌入
- 批准号:
2106391 - 财政年份:2021
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
DMREF: Collaborative Research: The Synthesis Genome: Data Mining for Synthesis of New Materials
DMREF:协作研究:合成基因组:新材料合成的数据挖掘
- 批准号:
1922090 - 财政年份:2019
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
DMREF: Collaborative Research: The Synthesis Genome: Data Mining for Synthesis of New Materials
DMREF:协作研究:合成基因组:新材料合成的数据挖掘
- 批准号:
1534431 - 财政年份:2015
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
The Fourth Northeast Student Colloquium on Artificial Intelligence
第四届东北学生人工智能学术研讨会
- 批准号:
1036017 - 财政年份:2010
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
CI-ADDO-EN: Flexible Machine Learning for Natural Language in the MALLET Toolkit
CI-ADDO-EN:MALLET 工具包中自然语言的灵活机器学习
- 批准号:
0958392 - 财政年份:2010
- 资助金额:
$ 100万 - 项目类别:
Continuing Grant
RI-Medium: Collaborative Research: Dynamically-Structured Conditional Random Fields for Complex, Natural Domains
RI-Medium:协作研究:复杂自然域的动态结构条件随机场
- 批准号:
0803847 - 财政年份:2008
- 资助金额:
$ 100万 - 项目类别:
Continuing Grant
CRI: Collaborative Research: Improving Experimental Computer Science with a Searchable Web Portal for Data Sets
CRI:协作研究:通过可搜索的数据集门户网站改进实验计算机科学
- 批准号:
0551597 - 财政年份:2006
- 资助金额:
$ 100万 - 项目类别:
Continuing Grant
ITR: Collaborative Research: (ACS+NHS)-(dmc+soc): Machine Learning for Sequences and Structured Data: Tools for Non-Experts
ITR:协作研究:(ACS NHS)-(dmc soc):序列和结构化数据的机器学习:非专家工具
- 批准号:
0427594 - 财政年份:2004
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
相似海外基金
RII Track-4:@NASA: Bluer and Hotter: From Ultraviolet to X-ray Diagnostics of the Circumgalactic Medium
RII Track-4:@NASA:更蓝更热:从紫外到 X 射线对环绕银河系介质的诊断
- 批准号:
2327438 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: Topological Defects and Dynamic Motion of Symmetry-breaking Tadpole Particles in Liquid Crystal Medium
合作研究:液晶介质中对称破缺蝌蚪粒子的拓扑缺陷与动态运动
- 批准号:
2344489 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: AF: Medium: The Communication Cost of Distributed Computation
合作研究:AF:媒介:分布式计算的通信成本
- 批准号:
2402836 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Continuing Grant
Collaborative Research: AF: Medium: Foundations of Oblivious Reconfigurable Networks
合作研究:AF:媒介:遗忘可重构网络的基础
- 批准号:
2402851 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Continuing Grant
Collaborative Research: CIF: Medium: Snapshot Computational Imaging with Metaoptics
合作研究:CIF:Medium:Metaoptics 快照计算成像
- 批准号:
2403122 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403134 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
- 批准号:
2321102 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Transforming the Molecular Science Research Workforce through Integration of Programming in University Curricula
协作研究:网络培训:实施:中:通过将编程融入大学课程来改变分子科学研究人员队伍
- 批准号:
2321045 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
- 批准号:
2321103 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: CPS: Medium: Automating Complex Therapeutic Loops with Conflicts in Medical Cyber-Physical Systems
合作研究:CPS:中:自动化医疗网络物理系统中存在冲突的复杂治疗循环
- 批准号:
2322534 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant