EAGER: SSMCDAT2023: Natural Language Processing and Large Language Models for Automated Extraction of Materials Chemistry Data from Scientific Literature

EAGER:SSMCDAT2023:用于从科学文献中自动提取材料化学数据的自然语言处理和大型语言模型

基本信息

  • 批准号:
    2334411
  • 负责人:
  • 金额:
    $ 20万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-09-01 至 2025-08-31
  • 项目状态:
    未结题

项目摘要

NONTECHNICAL SUMMARYThis award is made on an EAGER proposal. It supports progress on a project advanced at the SSMCDAT 2023 Datathon held at Lehigh University. This project addresses the problem of materials data in scientific journal papers being hard to access and use in modern computer assisted research environments because of the use of portable document format (PDF) files. Previous attempts to encourage better formats have not worked well. To tackle this problem, the team will use advanced artificial intelligence and natural language processing which enables computers to “understand” text, to automatically extract materials data from scientific research papers. A crucial element of the team's approach involves leveraging valuable commercial resources, such as the Pauling File, which provides well-curated examples of materials data. The team will utilize this training data to enhance the performance of the large language models employed in the work. Large language models can “understand” text and it can generate text in a seemingly human way. Access to organized materials data has the potential to transform solid-state materials chemistry and enable faster progress in the field. The project also seeks to engage with materials scientists, authors, and editors from diverse subdisciplines of materials science to integrate this technology seamlessly into academic publishing workflows. This project also supports training graduate and undergraduate students, and creating outreach materials such as podcast episodes and YouTube courses to promote a wider understanding of artificial intelligence and materials science.TECHNICAL SUMMARYThis award is made on an EAGER proposal. It supports progress on a project advanced at the SSMCDAT 2023 Datathon held at Lehigh University. This activity addresses the challenge of limited machine-readable materials data in academic literature, mainly due to the prevalence of PDF formats. Prior attempts to encourage machine-readable formats have been unsuccessful. The result has been the emergence of inaccurate and labor-intensive information extraction tools. The team aims to capitalize on very recent advances in natural language processing and large language models, and combine them with the Pauling File's hand-labeled data. This approach eliminates the need for manual labeling, empowering materials chemists to write papers as they always have, while using artificial intelligence to extract and organize materials data into machine-readable formats accurately and automatically. The approach includes steps for machine-learned versus rules-based token size reduction, comparison of open-source versus commercial large language models, expert analysis of errors and incompletions, and expansion to materials property extraction data in addition to synthesis data. Success in this endeavor would be potentially transformative to solid-state materials chemistry by leveraging progress that has been made in materials informatics. The activity aims to transform the materials data landscape, enabling widespread materials informatics progress by automating data extraction from research articles. The project's broader impacts extend to other academic domains, with potential applications in different scientific fields. It also promotes bilingual outreach and education including unique social media content delivered through YouTube and podcast formats. Finally, the activity will substantially bring authors, editors, data practitioners, and publishers together to assess data extraction performance.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
非技术性总结这个奖项是在一个热切的提议上颁发的。它支持在利哈伊大学举行的SSMCDAT 2023数据马拉松上推进的一个项目的进展。该项目解决了在现代计算机辅助研究环境中,由于使用可移植文档格式(PDF)文件而难以访问和使用科学期刊论文中的材料数据的问题。之前鼓励更好的格式的尝试并不奏效。为了解决这个问题,该团队将使用先进的人工智能和自然语言处理,使计算机能够“理解”文本,自动从科学研究论文中提取材料数据。该团队方法的一个关键要素涉及利用宝贵的商业资源,例如提供精心挑选的材料数据示例的Pauling文件。该小组将利用这些训练数据来提高工作中采用的大型语言模型的性能。大型语言模型可以“理解”文本,它可以以一种看似人类的方式生成文本。对有组织材料数据的访问有可能改变固态材料化学,并使该领域能够更快地取得进展。该项目还寻求与材料科学不同子学科的材料科学家、作者和编辑接触,以将这项技术无缝地整合到学术出版工作流中。该项目还支持培训研究生和本科生,并创建扩展材料,如播客节目和YouTube课程,以促进对人工智能和材料科学的更广泛理解。技术总结该奖项是由一个热切的提议颁发的。它支持在利哈伊大学举行的SSMCDAT 2023数据马拉松上推进的一个项目的进展。这项活动解决了学术文献中机器可读材料数据有限的挑战,这主要是由于PDF格式的流行。之前鼓励机器可读格式的尝试都没有成功。其结果是出现了不准确和劳动密集型的信息提取工具。该团队的目标是利用自然语言处理和大型语言模型方面的最新进展,并将它们与鲍林文件的手工标记数据结合起来。这种方法消除了手动标记的需要,使材料化学家能够像往常一样写论文,同时使用人工智能准确和自动地将材料数据提取和组织成机器可读的格式。该方法包括以下步骤:机器学习与基于规则的令牌大小缩减、开放源码与商业大型语言模型的比较、错误和不完整的专家分析、以及除合成数据外对材料特性提取数据的扩展。这一努力的成功将通过利用材料信息学已经取得的进展,潜在地对固态材料化学产生变革。该活动旨在改变材料数据格局,通过自动从研究文章中提取数据来实现广泛的材料信息学进展。该项目的更广泛影响延伸到其他学术领域,在不同的科学领域具有潜在的应用。它还促进双语外联和教育,包括通过YouTube和播客格式提供的独特社交媒体内容。最后,这项活动将把作者、编辑、数据从业者和出版商聚集在一起,评估数据提取的表现。这一奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Taylor Sparks其他文献

Taylor Sparks的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Taylor Sparks', 18)}}的其他基金

REU Site: Research Experience in Utah for Sustainable Materials Engineering (ReUSE)
REU 网站:犹他州可持续材料工程(再利用)的研究经验
  • 批准号:
    1950589
  • 财政年份:
    2020
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Collaborative Research: SSMCDAT2020: Solid-State and Materials Chemistry Data Science Hackathon
合作研究:SSMCDAT2020:固态和材料化学数据科学黑客马拉松
  • 批准号:
    1938734
  • 财政年份:
    2019
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
CAREER: SusChEM: Data Mining to Reduce the Risk in Discovering New Sustainable Thermoelectric Materials
职业:SusChEM:通过数据挖掘降低发现新型可持续热电材料的风险
  • 批准号:
    1651668
  • 财政年份:
    2017
  • 资助金额:
    $ 20万
  • 项目类别:
    Continuing Grant
Collaborative Research: Guided Discovery of Sustainable Superhard Materials via Bond Optimization
合作研究:通过键优化引导可持续超硬材料的发现
  • 批准号:
    1562226
  • 财政年份:
    2016
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant

相似海外基金

EAGER: SSMCDAT2023: Revealing Local Symmetry Breaking in Intermetallics: Combining Statistical Mechanics and Machine Learning in PDF Analysis
EAGER:SSMCDAT2023:揭示金属间化合物中的局部对称性破缺:在 PDF 分析中结合统计力学和机器学习
  • 批准号:
    2334261
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
EAGER: SSMCDAT2023: Database generation to identify trends in inter- and intra-polyhedral connectivity and energy storage behavior
EAGER:SSMCDAT2023:生成数据库以确定多面体间和多面体内连接和能量存储行为的趋势
  • 批准号:
    2334240
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: SSMCDAT2023: Data-driven Predictive Understanding of Oxidation Resistance in High-Entropy Alloy Nanoparticles
合作研究:EAGER:SSMCDAT2023:数据驱动的高熵合金纳米颗粒抗氧化性预测理解
  • 批准号:
    2334386
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
EAGER: SSMCDAT2023: Deep learning Gibbs free energy functions to guide solid-state material synthesis
EAGER:SSMCDAT2023:深度学习吉布斯自由能函数指导固态材料合成
  • 批准号:
    2334275
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: SSMCDAT2023: Data-driven Predictive Understanding of Oxidation Resistance in High-Entropy Alloy Nanoparticles
合作研究:EAGER:SSMCDAT2023:数据驱动的高熵合金纳米颗粒抗氧化性预测理解
  • 批准号:
    2334385
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了