Elements: Towards a Robust Cyberinfrastructure for NLP-based Search and Discoverability over Scientific Literature
要素:建立一个强大的网络基础设施,用于基于 NLP 的科学文献搜索和发现
基本信息
- 批准号:2104025
- 负责人:
- 金额:$ 39.96万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-05-01 至 2025-04-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
This project creates an open platform for accessing and mining information from scientific texts that provides access to an array of software, computing resources, and publication data. Current search technologies typically find many relevant documents, but do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content. Natural Language Processing (NLP) strategies are a recognized means to approach this problem, and this project develops the cyberinfrastructure to support sophisticated search and retrieval from scientific publications, use and augmentation of facilities for advanced and well-established natural language processing and machine learning tools, and extraction and aggregation of data from scientific publications. The project leverages two NSF-funded projects: the Language Applications (LAPPS) Grid, which has already proven to be an effective platform for development of NLP applications; and University of Wisconsin’s xDD (formerly, GeoDeepDive), a scalable, dependable infrastructure capable of rapidly growing a digital library of scientific publications, currently including over 13 million documents from multiple distributed commercial and open-access providers. The effort significantly enhances the value of these existing NSF-funded infrastructures by providing access to services for mining scientific publications and lowering the barriers to entry resulting from licensing, redistribution, and intellectual property issues. Scientists may perform large-scale text retrieval and mining using the University of Wisconsin’s high performance computing (HPC) infrastructure through a web-based interface. Iterative domain adaptation capabilities allow scientists to easily adapt existing services to specialized areas without configuring or installing additional components. The potential impact of the cyberinfrastructure is applicable to any community that relies on computational tools for mining large textual datasets, including researchers in sociology, psychology, economics, education, linguistics, digital media, and the humanities.This project extends the LAPPS Grid to provide access to UW-xDD’s collection of scientific publications and UW’s High Performance Computing facilities, as well as means to rapidly adapt existing, well-established natural language processing and machine learning software tools to new domains and evaluate results. The LAPPs Grid provides a large collection of NLP tools from a wide variety of sources exposed as web services, together with multiple commonly used resources and a front-end document retrieval engine currently configured to access PubMed/PubMedCentral as well as nightly updates of the CORD-19 dataset. The LAPPS Grid is open source, and can be run from the web, on a user’s laptop or desktop, in the cloud, or as a self-contained docker image when it is necessary to protect sensitive or licensed data, when there is no network connection available, or for deployment on remote HPC facilities. All tools and resources can be used interoperably, eliminating the effort required to convert input and output formats to use a set of tools or resources together. xDD is one of the world’s largest single repositories of scientific publications that spans all domains of knowledge, incorporates new documents automatically and updates API endpoints every hour. xDD has accumulated millions of documents from multiple commercial and open-access publishers (over 13M publications). The xDD infrastructure is an integral part of the developing UW-COSMOS pipeline, which consists of a suite of services supporting document processing, including ingestion and parsing of PDFs; extraction of individual document objects such as text sections, figures, tables, and captions; and recall, which creates searchable Anserini and ElasticSearch indexes on the contexts and objects to enable retrieval of information. Specific project activities include implementing efficient retrieval and analysis of xDD’s vast holdings of scientific publications; extending the NLP capabilities of the LAPPS Grid for scientific publication mining and domain adaptation; developing full interoperability between the Grid and xDD/COSMOS; scaling LAPPS Grid services to handle the very large textual datasets available from UW-xDD; and surveying visualization techniques and integrating them into the Grid.This award by the Office of Advanced Cyberinfrastructure is jointly supported by the NSF Division of Information and Intelligent Systems within the Directorate for Computer and Information Science and Engineering, and the NSF Public Access program.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目创建了一个开放平台,用于访问和挖掘科学文本中的信息,提供对一系列软件,计算资源和出版数据的访问。 当前的搜索技术通常找到许多相关文档,但不提取和组织这些文档的信息内容或基于该组织的内容提出新的科学假设。自然语言处理(NLP)战略是解决这一问题的公认手段,该项目开发了网络基础设施,以支持科学出版物的复杂搜索和检索,使用和增强先进和完善的自然语言处理和机器学习工具的设施,以及从科学出版物中提取和汇总数据。该项目利用了两个NSF资助的项目:语言应用(LAPPS)网格,它已经被证明是开发NLP应用程序的有效平台;以及威斯康星州大学的xDD(以前称为GeoDeepDive),这是一个可扩展的、可靠的基础设施,能够快速发展科学出版物的数字图书馆,目前包括来自多个分布式商业和开放访问提供商的1300多万份文档。通过为采矿科学出版物提供服务,降低许可证、再分配和知识产权问题造成的进入壁垒,这一努力显著提高了这些现有NSF资助基础设施的价值。科学家们可以通过基于网络的界面,使用威斯康星州大学的高性能计算(HPC)基础设施进行大规模的文本检索和挖掘。迭代域适应功能使科学家能够轻松地将现有服务适应特定领域,而无需配置或安装额外的组件。网络基础设施的潜在影响适用于任何依赖计算工具挖掘大型文本数据集的社区,包括社会学,心理学,经济学,教育学,语言学,数字媒体和人文科学的研究人员。该项目扩展了LAPPS网格,以提供对UW-xDD科学出版物和UW高性能计算设施的访问,以及使现有的、完善的自然语言处理和机器学习软件工具快速适应新领域并评估结果的手段。LAPPs Grid提供了大量的NLP工具,这些工具来自各种各样的Web服务,以及多个常用的资源和一个前端文档检索引擎,当前配置为访问PubMed/PubMedCentral以及CORD-19数据集的夜间更新。LAPPS网格是开源的,可以从Web、用户的笔记本电脑或台式机、云中运行,或者在需要保护敏感或许可数据时作为自包含的Docker映像运行,当没有可用的网络连接时,或者用于部署在远程HPC设施上。所有工具和资源都可以互操作地使用,从而消除了将输入和输出格式转换为一起使用一组工具或资源所需的工作。 xDD是世界上最大的科学出版物单一存储库之一,涵盖所有知识领域,自动合并新文档并每小时更新API端点。 xDD已经积累了来自多个商业和开放获取出版商的数百万文档(超过1300万出版物)。 xDD基础设施是开发中的UW-COSMOS管道的一个组成部分,它由一套支持文档处理的服务组成,包括PDF的摄取和解析;提取单个文档对象,如文本部分,图形,表格和标题;以及召回,它在上下文和对象上创建可搜索的Anserini和ElasticSearch索引,以实现信息检索。具体的项目活动包括对xDD的大量科学出版物进行有效的检索和分析;扩展LAPPS网格的自然语言处理能力,用于科学出版物挖掘和领域调整;开发网格与xDD/COSMOS之间的全面互操作性;扩展LAPPS网格服务,以处理UW-xDD提供的非常大的文本数据集;以及调查可视化技术并将其集成到网格中。高级网络基础设施办公室的这一奖项得到了计算机和信息科学与工程理事会内的NSF信息和智能系统部门的共同支持,该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Evaluating Retrieval for Multi-domain Scientific Publications
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Nancy Ide;Keith Suderman;Jingxuan Tu;M. Verhagen;Shanan Peters;Ian Ross;John Lawson;Andrew Borg-Andre
- 通讯作者:Nancy Ide;Keith Suderman;Jingxuan Tu;M. Verhagen;Shanan Peters;Ian Ross;John Lawson;Andrew Borg-Andre
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
James Pustejovsky其他文献
Introduction to Special Issue on Advances in Question Answering
- DOI:
10.1007/s10579-005-7883-6 - 发表时间:
2006-02-28 - 期刊:
- 影响因子:1.800
- 作者:
James Pustejovsky;Janyce Wiebe - 通讯作者:
Janyce Wiebe
Situated UMR for Multimodal Interactions
用于多模式交互的定位 UMR
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Kenneth Lai;R. Brutti;Lucia Donatelli;James Pustejovsky - 通讯作者:
James Pustejovsky
Scalar Anaphora: Annotating Degrees of Coreference in Text
标量照应:注释文本中的共指程度
- DOI:
10.18653/v1/2023.crac-main.4 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Bingyang Ye;Jingxuan Tu;James Pustejovsky - 通讯作者:
James Pustejovsky
Integrated Annotation of Event Structure, Object States, and Entity Coreference
事件结构、对象状态和实体共指的集成注释
- DOI:
10.18653/v1/2023.crac-main.9 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Kyeongmin Rim;James Pustejovsky - 通讯作者:
James Pustejovsky
Encoding Gesture in Multimodal Dialogue: Creating a Corpus of Multimodal AMR
多模态对话中的手势编码:创建多模态 AMR 语料库
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Kenneth Lai;R. Brutti;Lucia Donatelli;James Pustejovsky - 通讯作者:
James Pustejovsky
James Pustejovsky的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('James Pustejovsky', 18)}}的其他基金
EAGER: Integrating Dense Paraphrased-Enriched Representations with Large Language Models
EAGER:将密集释义丰富的表示与大型语言模型相集成
- 批准号:
2326985 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Travel Support for North American Summer School for Logic, Language, and Information (NASSLLI)
北美逻辑、语言和信息暑期学校 (NASSSLLI) 的差旅支持
- 批准号:
2002141 - 财政年份:2020
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Collaborative Research: NSF2026: EAGER: A Playground and Proposal for Growing an AGI
合作研究:NSF2026:EAGER:发展 AGI 的游乐场和提案
- 批准号:
2033932 - 财政年份:2020
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
EAGER: Collaborative Research: Mining Scientific Literature with the LAPPS Grid
EAGER:协作研究:使用 LAPPS 网格挖掘科学文献
- 批准号:
1811402 - 财政年份:2018
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Workshop: The International Linguistics Olympiad
研讨会:国际语言学奥林匹克竞赛
- 批准号:
1632453 - 财政年份:2016
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Workshop: The International Linguistics Olympiad in Blagoevgrad, Bulgaria: July 20-24, 2015.
研讨会:保加利亚布拉戈耶夫格勒国际语言学奥林匹克竞赛:2015 年 7 月 20 日至 24 日。
- 批准号:
1547270 - 财政年份:2015
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Workshop:The International Linguistics Olympiad
研讨会:国际语言学奥林匹克竞赛
- 批准号:
1442079 - 财政年份:2014
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Outstanding Student Research at GL2013
GL2013 杰出学生研究
- 批准号:
1348830 - 财政年份:2013
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
SI2-SSI: The Language Application Grid: A Framework for Rapid Adaptation and Reuse
SI2-SSI:语言应用网格:快速适应和重用的框架
- 批准号:
1147912 - 财政年份:2012
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
RI: Small: Interpreting Linguistic Spatiotemporal Relations in Static and Dynamic Contexts
RI:小:解释静态和动态上下文中的语言时空关系
- 批准号:
1017765 - 财政年份:2010
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
相似海外基金
ERI: Towards Robust and Secure Intelligent 3D Sensing Systems
ERI:迈向稳健、安全的智能 3D 传感系统
- 批准号:
2347426 - 财政年份:2024
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Towards Motion-Robust and Efficient Functional MRI Using Implicit Function Learning
使用内隐功能学习实现运动稳健且高效的功能 MRI
- 批准号:
EP/Y002016/1 - 财政年份:2024
- 资助金额:
$ 39.96万 - 项目类别:
Research Grant
Towards Robust Hydrogen Electrode for High-Rate Alkaline Electrolysis
用于高速率碱性电解的坚固氢电极
- 批准号:
DP230102504 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Discovery Projects
Collaborative Research: SaTC: CORE: Small: Towards Robust, Scalable, and Resilient Radio Fingerprinting
协作研究:SaTC:核心:小型:迈向稳健、可扩展和有弹性的无线电指纹识别
- 批准号:
2225161 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
CRII: CNS: Towards Robust and Efficient Dynamic Spectrum Sharing with Knowledge Transfer
CRII:CNS:通过知识转移实现稳健、高效的动态频谱共享
- 批准号:
2245918 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Towards Robust Measurements of H0 with Dark Standard Candles
使用深色标准烛光对 H0 进行稳健测量
- 批准号:
2307026 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Robust ESG data for biodiversity: towards a spatially-sensitive approach to Sustainable Finance
生物多样性的可靠 ESG 数据:采用空间敏感的可持续金融方法
- 批准号:
NE/X016471/1 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Research Grant
Collaborative Research: CISE-MSI: DP: RI: Towards Scalable, Resilient and Robust Foraging with Heterogeneous Robot Swarms
合作研究:CISE-MSI:DP:RI:利用异构机器人群实现可扩展、有弹性和稳健的觅食
- 批准号:
2318682 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant
Robust ESG data for biodiversity: towards a spatially-sensitive approach to Sustainable Finance
生物多样性的可靠 ESG 数据:采用空间敏感的可持续金融方法
- 批准号:
NE/X01634X/1 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Research Grant
RUI: Towards Robust Sparsity in Nonuniform Sampling Multidimensional NMR
RUI:非均匀采样多维 NMR 中的鲁棒稀疏性
- 批准号:
2305086 - 财政年份:2023
- 资助金额:
$ 39.96万 - 项目类别:
Standard Grant