III: Small: Accessible and Interpretable Machine Reading Methods for Extracting Structured Information from Text
III:小:从文本中提取结构化信息的可访问且可解释的机器阅读方法
基本信息
- 批准号:2006583
- 负责人:
- 金额:$ 49.99万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-07-15 至 2024-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Computers, the Internet, and cheap storage promote the acquisition and collection of vast quantities of data. There is a seemingly infinite supply of text documents which contain critical scientific, socio-political, and business insights – far more than can be read by a human. Within the natural language processing (NLP) domain, the field of information extraction (IE) targets exactly this problem, but it requires its practitioners to have expertise either in linguistics, machine learning, or both. Consequently, the majority of the advancements in the field of IE are difficult to access by domain experts such as epidemiologists, biologists, and economists. This project will empower these domain experts to develop and deploy IE systems targeting their own particular needs without requiring expertise in NLP, linguistics, or machine learning, which, in turn, will dramatically impact the process, pace, and productivity of conducting critical scientific research and collaboration, as experts could have far more ready access to the knowledge most essential to them and their research (both in their domain and adjacent domains). The products of this work will be shared across the scientific community through a series of outreach efforts such as video courses, publications, and a workshop at a high-visibility conference. To broaden participation, outreach activities (including deepening collaborations with institutional colleagues and local community outreach) will be done with an emphasis on groups who are historically underrepresented in academia. The planned work will be accomplished through a human-technology partnership, where domain experts specify their information need at the level they find intuitive, (e.g., phosphorylation acts on proteins). The system will then extend techniques from the adjacent field of program synthesis to convert these high-level, abstract specifications into low-level grammars (i.e., sets of hierarchical information extraction rules) which can be executed in order to extract the desired information from text. Crucially, the specification requires no linguistic knowledge, making it accessible to a broader population. The need for domain-specific entities (e.g., names of proteins) will be addressed through an entity discovery procedure that incorporates techniques for detecting multi-word entity candidates and inferring their semantic types (e.g., PROTEIN). To ensure that the product of the system is readily interpretable and easily extensible, a series of user studies will be conducted to discover the key characteristics of rules and grammars that affect their interpretability and maintainability. Through this combined effort, several datasets and software products will be produced and made available to the wider community. This includes (but is not limited to) (a) a dataset of event specifications and the corresponding automatically synthesized rules for several domains (b) a dataset of human judgements of grammar interpretability, and (c) models which can serve as automatic proxies for the more expensive human evaluation of interpretability. All data will be anonymized and released under the Open Data Commons Public Domain Dedication & License, which allows users to freely share, modify, and use this data, in the hope that this effort will be exploited further. To ensure as wide an audience as possible, the software and techniques developed in this work including the rule synthesis framework, a pipeline for entity discovery, and any generated user interfaces, will be released as open-source software products (under an Apache 2.0 open source license).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
计算机、互联网和廉价的存储促进了大量数据的获取和收集。似乎有无限的文本文档,其中包含重要的科学,社会政治和商业见解-远远超过了人类可以阅读的范围。在自然语言处理(NLP)领域,信息提取(IE)领域正是针对这个问题,但它要求其从业者拥有语言学、机器学习或两者兼而有之的专业知识。因此,IE领域的大多数进展很难被流行病学家,生物学家和经济学家等领域专家所访问。 该项目将使这些领域专家能够针对自己的特定需求开发和部署IE系统,而不需要NLP,语言学或机器学习方面的专业知识,这反过来将极大地影响进行关键科学研究和合作的过程,速度和生产力。因为专家可以更方便地获得对他们及其研究(在他们的领域和邻近领域)最重要的知识。 这项工作的成果将通过视频课程、出版物和高知名度会议上的研讨会等一系列外联工作在整个科学界共享。为了扩大参与,将开展外联活动(包括深化与机构同事的合作和地方社区外联),重点是历来在学术界代表性不足的群体。计划中的工作将通过人与技术的伙伴关系来完成,领域专家在他们认为直观的层面上指定他们的信息需求(例如,磷酸化作用于蛋白质)。然后,该系统将扩展来自程序合成的相邻领域的技术,以将这些高级抽象规范转换为低级语法(即,分层信息提取规则集),其可以被执行以便从文本中提取期望的信息。至关重要的是,该规范不需要语言知识,使其能够被更广泛的人群所访问。对特定领域实体的需求(例如,蛋白质的名称)将通过实体发现过程来解决,该实体发现过程结合了用于检测多词实体候选并推断其语义类型的技术(例如,蛋白质)。为确保该系统的产品易于解释和易于扩展,将进行一系列用户研究,以发现影响其可解释性和可维护性的规则和语法的主要特征。通过这一共同努力,将制作若干数据集和软件产品,并提供给更广泛的社区。这包括(但不限于)(a)事件规范的数据集和用于若干域的相应自动合成规则(B)语法可解释性的人类判断的数据集,以及(c)可以用作更昂贵的人类可解释性评估的自动代理的模型。所有数据都将被匿名化,并在开放数据共享公共领域专用许可证下发布,该许可证允许用户自由共享,修改和使用这些数据,希望这一努力将被进一步利用。为了确保尽可能广泛的受众,在这项工作中开发的软件和技术,包括规则合成框架,实体发现管道,以及任何生成的用户界面,将作为开源软件产品发布(基于Apache 2.0开源许可证)该奖项反映了NSF的法定使命,并通过使用基金会的智力价值进行评估,更广泛的影响审查标准。
项目成果
期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision
使用自我监督的信息提取规则的神经引导程序合成
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Noriega-Atala, Enrique;Vacareanu, Robert;Hahn-Powell, Gus;Valenzuela-Escárcega, Marco A.
- 通讯作者:Valenzuela-Escárcega, Marco A.
Bootstrapping Neural Relation and Explanation Classifiers
自举神经关系和解释分类器
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Zheng, Tang;Surdeanu, Mihai
- 通讯作者:Surdeanu, Mihai
From Examples to Rules: Neural Guided Rule Synthesis for Information Extraction
- DOI:
- 发表时间:2022-01
- 期刊:
- 影响因子:0
- 作者:Robert Vacareanu;M. A. Valenzuela-Escarcega;George C. G. Barbosa;Rebecca Sharp;M. Surdeanu
- 通讯作者:Robert Vacareanu;M. A. Valenzuela-Escarcega;George C. G. Barbosa;Rebecca Sharp;M. Surdeanu
Syntax-driven Data Augmentation for Named Entity Recognition
用于命名实体识别的语法驱动的数据增强
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Sutiono, Arie;Hahn-Powell, Gus
- 通讯作者:Hahn-Powell, Gus
Do Transformer Networks Improve the Discovery of Inference Rules from Text?
Transformer 网络是否可以改进从文本中发现推理规则?
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Rahimi, Mahdi;Surdeanu, Mihai
- 通讯作者:Surdeanu, Mihai
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mihai Surdeanu其他文献
Information Extraction from Legal Wills: How Well Does GPT-4 Do?
从法律遗嘱中提取信息:GPT-4 做得如何?
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
A. Kwak;Cheonkam Jeong;Gaetano Forte;Derek E. Bambauer;Clayton T. Morrison;Mihai Surdeanu - 通讯作者:
Mihai Surdeanu
On Learning Bipolar Gradual Argumentation Semantics with Neural Networks
用神经网络学习双极渐进论证语义
- DOI:
10.5220/0012448300003636 - 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Caren Al Anaissy;Sandeep Suntwal;Mihai Surdeanu;Srdjan Vesic - 通讯作者:
Srdjan Vesic
Retrieval Augmented Generation of Subjective Explanations for Socioeconomic Scenarios
社会经济情景主观解释的检索增强生成
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Razvan;Maria Alexeeva;K. Alcock;Nargiza Ludgate;Cheonkam Jeong;Zara Fatima Abdurahaman;Prateek Puri;Brian Kirchhoff;Santadarshan Sadhu;Mihai Surdeanu - 通讯作者:
Mihai Surdeanu
Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels
逐层量化:一种实用且有效的方法,用于量化超越整数位级别的 LLM
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Razvan;Vikas Yadav;Rishabh Maheshwary;Paul;Sathwik Tejaswi Madhusudhan;Mihai Surdeanu - 通讯作者:
Mihai Surdeanu
Mihai Surdeanu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
- 批准号:n/a
- 批准年份:2022
- 资助金额:10.0 万元
- 项目类别:省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
- 批准号:32000033
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
- 批准号:31972324
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
- 批准号:81900988
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
- 批准号:31870821
- 批准年份:2018
- 资助金额:56.0 万元
- 项目类别:面上项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
- 批准号:31802058
- 批准年份:2018
- 资助金额:26.0 万元
- 项目类别:青年科学基金项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
- 批准号:31772128
- 批准年份:2017
- 资助金额:60.0 万元
- 项目类别:面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
- 批准号:81704176
- 批准年份:2017
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
- 批准号:91640114
- 批准年份:2016
- 资助金额:85.0 万元
- 项目类别:重大研究计划
相似海外基金
Powering Small Craft with a Novel Ammonia Engine
用新型氨发动机为小型船只提供动力
- 批准号:
10099896 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Collaborative R&D
"Small performances": investigating the typographic punches of John Baskerville (1707-75) through heritage science and practice-based research
“小型表演”:通过遗产科学和基于实践的研究调查约翰·巴斯克维尔(1707-75)的印刷拳头
- 批准号:
AH/X011747/1 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Research Grant
Fragment to small molecule hit discovery targeting Mycobacterium tuberculosis FtsZ
针对结核分枝杆菌 FtsZ 的小分子片段发现
- 批准号:
MR/Z503757/1 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Research Grant
Bacteriophage control of host cell DNA transactions by small ORF proteins
噬菌体通过小 ORF 蛋白控制宿主细胞 DNA 交易
- 批准号:
BB/Y004426/1 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Research Grant
Windows for the Small-Sized Telescope (SST) Cameras of the Cherenkov Telescope Array (CTA)
切伦科夫望远镜阵列 (CTA) 小型望远镜 (SST) 相机的窗口
- 批准号:
ST/Z000017/1 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Research Grant
CSR: Small: Leveraging Physical Side-Channels for Good
CSR:小:利用物理侧通道做好事
- 批准号:
2312089 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
CSR: Small: Multi-FPGA System for Real-time Fraud Detection with Large-scale Dynamic Graphs
CSR:小型:利用大规模动态图进行实时欺诈检测的多 FPGA 系统
- 批准号:
2317251 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
AF: Small: Problems in Algorithmic Game Theory for Online Markets
AF:小:在线市场的算法博弈论问题
- 批准号:
2332922 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
Collaborative Research: FET: Small: Algorithmic Self-Assembly with Crisscross Slats
合作研究:FET:小型:十字交叉板条的算法自组装
- 批准号:
2329908 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant
NeTS: Small: ML-Driven Online Traffic Analysis at Multi-Terabit Line Rates
NeTS:小型:ML 驱动的多太比特线路速率在线流量分析
- 批准号:
2331111 - 财政年份:2024
- 资助金额:
$ 49.99万 - 项目类别:
Standard Grant