权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Large-Scale Learning for Information Extraction

职业：信息提取的大规模学习

基本信息

批准号：
1845670
负责人：
Alan Ritter
金额：
$ 50万
依托单位：
Ohio State University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-09-01 至 2020-10-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1845670&HistoricalAwards=false
关键词：
CAREER Large Scale Learning Information

项目摘要

Much of human knowledge is encoded in text. This project aims to substantially advance the capability of machines to read large document collections and reason about the knowledge contained within them using minimal human effort. This will help people to overcome information overload and make better decisions by analyzing vital information that is locked away in unstructured text. Recent years have seen tremendous progress on tasks such as speech recognition and machine translation, by applying deep learning methods on massive, high-quality datasets; however, most available datasets for information extraction are either small or very noisy. The project will address these challenges by developing new methods that can learn more effectively from big, but noisy datasets that are constructed using distant supervision from an existing knowledge base (KB). To demonstrate the new methods' effectiveness, they will be used to support several novel applications. These include the detection of cyber-threats reported online and the analysis of experts' opinions about their severity. Recent studies have found that 75% of software vulnerabilities are first reported online, giving attackers time to exploit the vulnerability. Systems that can automatically read computer security blogs and analyze new threats could help security practitioners to track and prioritize them more effectively. The project includes a plan for integrating research and education. Outreach efforts aim to help attract a more diverse group of students to study computer science. These include hands-on workshops to expose freshmen to exciting natural language processing and artificial intelligence applications. The project will also help to engage advanced undergraduate students in research through new course materials on cutting-edge information extraction techniques.The research will address the machine reading data bottleneck by inventing new methods that can learn effectively from large, noisy datasets using distant supervision. These methods will address the challenge of label noise inherent in distant supervision by performing inference over latent variables during learning, filling in missing information, and resolving ambiguities. The approach combines the benefits of structured learning and neural networks; the structured learning component of the model can override noisy labels in cases where it is sufficiently confident -- this is balanced against a model of missing data in the KB. This will catalyze the rapid development of extractors for many new tasks and domains. To demonstrate this, extensive experiments will compare against state of the art methods using standard benchmark datasets for information extraction, including the Freebase/NYT corpus, TAC KBP datasets, and TACRED. Furthermore, the research will push the boundaries of minimal supervision for Information Extraction by exploring new applications that demonstrate the generality of the approach, including entity, relation and event extraction, time normalization and learning to extract a real-time feed of cyber-threat intelligence using distant supervision from the National Vulnerability Database (NVD). These applications are supported by a comprehensive evaluation plan that includes the development of new corpora and metrics. The project will produce a number of new datasets in addition to a toolkit for minimally supervised information extraction, that will be shared as open source software. This research effort will support the rapid development of information systems for a broad range of new tasks and domains using minimal human effort.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

人类的大部分知识都是以文本形式编码的。该项目旨在大幅提高机器读取大型文档集合并以最少的人力推理其中包含的知识的能力。这将帮助人们克服信息过载，并通过分析非结构化文本中锁定的重要信息来做出更好的决策。近年来，通过在海量高质量数据集上应用深度学习方法，语音识别和机器翻译等任务取得了巨大进展；然而，大多数可用于信息提取的数据集要么很小，要么非常嘈杂。该项目将通过开发新方法来解决这些挑战，这些新方法可以从使用现有知识库 (KB) 的远程监督构建的大型但嘈杂的数据集中更有效地学习。为了证明新方法的有效性，它们将用于支持多种新颖的应用。其中包括检测在线报告的网络威胁以及分析专家对其严重性的意见。最近的研究发现，75% 的软件漏洞是首先在网上报告的，这给了攻击者利用漏洞的时间。能够自动读取计算机安全博客并分析新威胁的系统可以帮助安全从业人员更有效地跟踪威胁并确定其优先级。该项目包括一项整合研究和教育的计划。外展工作旨在帮助吸引更多元化的学生群体学习计算机科学。其中包括让新生接触令人兴奋的自然语言处理和人工智能应用的实践研讨会。该项目还将通过有关尖端信息提取技术的新课程材料，帮助高年级本科生参与研究。该研究将通过发明新方法来解决机器读取数据瓶颈，这些方法可以使用远程监督从大型、嘈杂的数据集中有效学习。这些方法将通过在学习过程中对潜在变量进行推理、填充缺失信息并解决歧义来解决远程监督中固有的标签噪声的挑战。该方法结合了结构化学习和神经网络的优点；在足够自信的情况下，模型的结构化学习组件可以覆盖噪声标签——这与知识库中缺失数据的模型相平衡。这将促进许多新任务和领域的提取器的快速开发。为了证明这一点，我们将进行大量实验，与使用标准基准数据集（包括 Freebase/NYT 语料库、TAC KBP 数据集和 TACRED）进行信息提取的最先进方法进行比较。此外，该研究将通过探索展示该方法通用性的新应用来突破信息提取最小监督的界限，包括实体、关系和事件提取、时间规范化以及学习使用国家漏洞数据库（NVD）的远程监督来提取网络威胁情报的实时反馈。这些应用程序得到了全面评估计划的支持，其中包括开发新的语料库和指标。除了用于最低限度监督的信息提取的工具包之外，该项目还将产生许多新的数据集，这些工具包将作为开源软件共享。这项研究工作将支持以最少的人力快速开发适用于各种新任务和领域的信息系统。该奖项反映了 NSF 的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Alan Ritter其他文献

Stanceosaurus 2.0 - Classifying Stance Towards Russian and Spanish Misinformation

Stanceosaurus 2.0 - 对俄罗斯和西班牙错误信息的立场进行分类

DOI：
发表时间：
2024
期刊：
WNUT
影响因子：
0
作者：
Anton Lavrouk;Ian Ligon;Tarek Naous;Jonathan Zheng;Alan Ritter;Wei Xu
通讯作者：
Wei Xu

Extracting COVID-19 Events from Twitter

从 Twitter 中提取 COVID-19 事件

DOI：
发表时间：
2020
期刊：
arXiv.org
影响因子：
0
作者：
Shi Zong;Ashutosh Baheti;Wei Xu;Alan Ritter
通讯作者：
Alan Ritter

Why do they stay? : an analysis of factors influencing retention of international school teachers : a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, Massey University, Albany, New Zealand

他们为什么留下来？

DOI：
发表时间：
2016
期刊：
影响因子：
0
作者：
Alan Ritter
通讯作者：
Alan Ritter

“i have a feeling trump will win..................”: Forecasting Winners and Losers from User Predictions on Twitter

“我有一种感觉特朗普会赢……”：根据 Twitter 上的用户预测预测赢家和输家

DOI：
发表时间：
2017
期刊：
Conference on Empirical Methods in Natural Language Processing
影响因子：
0
作者：
Sandesh Swamy;Alan Ritter;M. Marneffe
通讯作者：
M. Marneffe

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

张量信任：来自在线游戏的可解释的即时注入攻击

DOI：
10.48550/arxiv.2311.01011
发表时间：
2023
期刊：
ArXiv
影响因子：
0
作者：
S. Toyer;Olivia Watkins;Ethan Mendes;Justin Svegliato;Luke Bailey;Tiffany Wang;Isaac Ong;Karim Elmaaroufi;Pieter Abbeel;Trevor Darrell;Alan Ritter;Stuart Russell
通讯作者：
Stuart Russell