权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Small: Accessible and Interpretable Machine Reading Methods for Extracting Structured Information from Text

III：小：从文本中提取结构化信息的可访问且可解释的机器阅读方法

基本信息

批准号：
2006583
负责人：
Mihai Surdeanu
金额：
$ 49.99万
依托单位：
University of Arizona
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-07-15 至 2024-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2006583&HistoricalAwards=false
关键词：
III Small Accessible Interpretable Machine

项目摘要

Computers, the Internet, and cheap storage promote the acquisition and collection of vast quantities of data. There is a seemingly infinite supply of text documents which contain critical scientific, socio-political, and business insights – far more than can be read by a human. Within the natural language processing (NLP) domain, the field of information extraction (IE) targets exactly this problem, but it requires its practitioners to have expertise either in linguistics, machine learning, or both. Consequently, the majority of the advancements in the field of IE are difficult to access by domain experts such as epidemiologists, biologists, and economists. This project will empower these domain experts to develop and deploy IE systems targeting their own particular needs without requiring expertise in NLP, linguistics, or machine learning, which, in turn, will dramatically impact the process, pace, and productivity of conducting critical scientific research and collaboration, as experts could have far more ready access to the knowledge most essential to them and their research (both in their domain and adjacent domains). The products of this work will be shared across the scientific community through a series of outreach efforts such as video courses, publications, and a workshop at a high-visibility conference. To broaden participation, outreach activities (including deepening collaborations with institutional colleagues and local community outreach) will be done with an emphasis on groups who are historically underrepresented in academia. The planned work will be accomplished through a human-technology partnership, where domain experts specify their information need at the level they find intuitive, (e.g., phosphorylation acts on proteins). The system will then extend techniques from the adjacent field of program synthesis to convert these high-level, abstract specifications into low-level grammars (i.e., sets of hierarchical information extraction rules) which can be executed in order to extract the desired information from text. Crucially, the specification requires no linguistic knowledge, making it accessible to a broader population. The need for domain-specific entities (e.g., names of proteins) will be addressed through an entity discovery procedure that incorporates techniques for detecting multi-word entity candidates and inferring their semantic types (e.g., PROTEIN). To ensure that the product of the system is readily interpretable and easily extensible, a series of user studies will be conducted to discover the key characteristics of rules and grammars that affect their interpretability and maintainability. Through this combined effort, several datasets and software products will be produced and made available to the wider community. This includes (but is not limited to) (a) a dataset of event specifications and the corresponding automatically synthesized rules for several domains (b) a dataset of human judgements of grammar interpretability, and (c) models which can serve as automatic proxies for the more expensive human evaluation of interpretability. All data will be anonymized and released under the Open Data Commons Public Domain Dedication & License, which allows users to freely share, modify, and use this data, in the hope that this effort will be exploited further. To ensure as wide an audience as possible, the software and techniques developed in this work including the rule synthesis framework, a pipeline for entity discovery, and any generated user interfaces, will be released as open-source software products (under an Apache 2.0 open source license).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

计算机、互联网和廉价的存储促进了大量数据的获取和收集。似乎有无限的文本文档，其中包含重要的科学，社会政治和商业见解-远远超过了人类可以阅读的范围。在自然语言处理（NLP）领域，信息提取（IE）领域正是针对这个问题，但它要求其从业者拥有语言学、机器学习或两者兼而有之的专业知识。因此，IE领域的大多数进展很难被流行病学家，生物学家和经济学家等领域专家所访问。该项目将使这些领域专家能够针对自己的特定需求开发和部署IE系统，而不需要NLP，语言学或机器学习方面的专业知识，这反过来将极大地影响进行关键科学研究和合作的过程，速度和生产力。因为专家可以更方便地获得对他们及其研究（在他们的领域和邻近领域）最重要的知识。这项工作的成果将通过视频课程、出版物和高知名度会议上的研讨会等一系列外联工作在整个科学界共享。为了扩大参与，将开展外联活动（包括深化与机构同事的合作和地方社区外联），重点是历来在学术界代表性不足的群体。计划中的工作将通过人与技术的伙伴关系来完成，领域专家在他们认为直观的层面上指定他们的信息需求（例如，磷酸化作用于蛋白质）。然后，该系统将扩展来自程序合成的相邻领域的技术，以将这些高级抽象规范转换为低级语法（即，分层信息提取规则集），其可以被执行以便从文本中提取期望的信息。至关重要的是，该规范不需要语言知识，使其能够被更广泛的人群所访问。对特定领域实体的需求（例如，蛋白质的名称）将通过实体发现过程来解决，该实体发现过程结合了用于检测多词实体候选并推断其语义类型的技术（例如，蛋白质）。为确保该系统的产品易于解释和易于扩展，将进行一系列用户研究，以发现影响其可解释性和可维护性的规则和语法的主要特征。通过这一共同努力，将制作若干数据集和软件产品，并提供给更广泛的社区。这包括（但不限于）（a）事件规范的数据集和用于若干域的相应自动合成规则（B）语法可解释性的人类判断的数据集，以及（c）可以用作更昂贵的人类可解释性评估的自动代理的模型。所有数据都将被匿名化，并在开放数据共享公共领域专用许可证下发布，该许可证允许用户自由共享，修改和使用这些数据，希望这一努力将被进一步利用。为了确保尽可能广泛的受众，在这项工作中开发的软件和技术，包括规则合成框架，实体发现管道，以及任何生成的用户界面，将作为开源软件产品发布（基于Apache 2.0开源许可证）该奖项反映了NSF的法定使命，并通过使用基金会的智力价值进行评估，更广泛的影响审查标准。

项目成果

期刊论文数量（9）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision

使用自我监督的信息提取规则的神经引导程序合成

DOI：
发表时间：
2022
期刊：
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
影响因子：
0
作者：
Noriega-Atala, Enrique;Vacareanu, Robert;Hahn-Powell, Gus;Valenzuela-Escárcega, Marco A.
通讯作者：
Valenzuela-Escárcega, Marco A.

Bootstrapping Neural Relation and Explanation Classifiers

自举神经关系和解释分类器

DOI：
发表时间：
2023
期刊：
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL
影响因子：
0
作者：
Zheng, Tang;Surdeanu, Mihai
通讯作者：
Surdeanu, Mihai

From Examples to Rules: Neural Guided Rule Synthesis for Information Extraction

DOI：
发表时间：
2022-01
期刊：
ArXiv
影响因子：
0
作者：
Robert Vacareanu;M. A. Valenzuela-Escarcega;George C. G. Barbosa;Rebecca Sharp;M. Surdeanu
通讯作者：
Robert Vacareanu;M. A. Valenzuela-Escarcega;George C. G. Barbosa;Rebecca Sharp;M. Surdeanu

Syntax-driven Data Augmentation for Named Entity Recognition

用于命名实体识别的语法驱动的数据增强

DOI：
发表时间：
2022
期刊：
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
影响因子：
0
作者：
Sutiono, Arie;Hahn-Powell, Gus
通讯作者：
Hahn-Powell, Gus

Do Transformer Networks Improve the Discovery of Inference Rules from Text?

Transformer 网络是否可以改进从文本中发现推理规则？

DOI：
发表时间：
2022
期刊：
LREC proceedings
影响因子：
0
作者：
Rahimi, Mahdi;Surdeanu, Mihai
通讯作者：
Surdeanu, Mihai

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Mihai Surdeanu其他文献

Information Extraction from Legal Wills: How Well Does GPT-4 Do?

从法律遗嘱中提取信息：GPT-4 做得如何？

DOI：
发表时间：
2023
期刊：
Conference on Empirical Methods in Natural Language Processing
影响因子：
0
作者：
A. Kwak;Cheonkam Jeong;Gaetano Forte;Derek E. Bambauer;Clayton T. Morrison;Mihai Surdeanu
通讯作者：
Mihai Surdeanu

On Learning Bipolar Gradual Argumentation Semantics with Neural Networks

用神经网络学习双极渐进论证语义

DOI：
10.5220/0012448300003636
发表时间：
2024
期刊：
影响因子：
0
作者：
Caren Al Anaissy;Sandeep Suntwal;Mihai Surdeanu;Srdjan Vesic
通讯作者：
Srdjan Vesic

Retrieval Augmented Generation of Subjective Explanations for Socioeconomic Scenarios

社会经济情景主观解释的检索增强生成

DOI：
发表时间：
2024
期刊：
NLPCSS
影响因子：
0
作者：
Razvan;Maria Alexeeva;K. Alcock;Nargiza Ludgate;Cheonkam Jeong;Zara Fatima Abdurahaman;Prateek Puri;Brian Kirchhoff;Santadarshan Sadhu;Mihai Surdeanu
通讯作者：
Mihai Surdeanu