EAGER: SSMCDAT2023: Natural Language Processing and Large Language Models for Automated Extraction of Materials Chemistry Data from Scientific Literature

EAGER:SSMCDAT2023:用于从科学文献中自动提取材料化学数据的自然语言处理和大型语言模型

基本信息

  • 批准号:
    2334411
  • 负责人:
  • 金额:
    $ 20万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-09-01 至 2025-08-31
  • 项目状态:
    未结题

项目摘要

NONTECHNICAL SUMMARYThis award is made on an EAGER proposal. It supports progress on a project advanced at the SSMCDAT 2023 Datathon held at Lehigh University. This project addresses the problem of materials data in scientific journal papers being hard to access and use in modern computer assisted research environments because of the use of portable document format (PDF) files. Previous attempts to encourage better formats have not worked well. To tackle this problem, the team will use advanced artificial intelligence and natural language processing which enables computers to “understand” text, to automatically extract materials data from scientific research papers. A crucial element of the team's approach involves leveraging valuable commercial resources, such as the Pauling File, which provides well-curated examples of materials data. The team will utilize this training data to enhance the performance of the large language models employed in the work. Large language models can “understand” text and it can generate text in a seemingly human way. Access to organized materials data has the potential to transform solid-state materials chemistry and enable faster progress in the field. The project also seeks to engage with materials scientists, authors, and editors from diverse subdisciplines of materials science to integrate this technology seamlessly into academic publishing workflows. This project also supports training graduate and undergraduate students, and creating outreach materials such as podcast episodes and YouTube courses to promote a wider understanding of artificial intelligence and materials science.TECHNICAL SUMMARYThis award is made on an EAGER proposal. It supports progress on a project advanced at the SSMCDAT 2023 Datathon held at Lehigh University. This activity addresses the challenge of limited machine-readable materials data in academic literature, mainly due to the prevalence of PDF formats. Prior attempts to encourage machine-readable formats have been unsuccessful. The result has been the emergence of inaccurate and labor-intensive information extraction tools. The team aims to capitalize on very recent advances in natural language processing and large language models, and combine them with the Pauling File's hand-labeled data. This approach eliminates the need for manual labeling, empowering materials chemists to write papers as they always have, while using artificial intelligence to extract and organize materials data into machine-readable formats accurately and automatically. The approach includes steps for machine-learned versus rules-based token size reduction, comparison of open-source versus commercial large language models, expert analysis of errors and incompletions, and expansion to materials property extraction data in addition to synthesis data. Success in this endeavor would be potentially transformative to solid-state materials chemistry by leveraging progress that has been made in materials informatics. The activity aims to transform the materials data landscape, enabling widespread materials informatics progress by automating data extraction from research articles. The project's broader impacts extend to other academic domains, with potential applications in different scientific fields. It also promotes bilingual outreach and education including unique social media content delivered through YouTube and podcast formats. Finally, the activity will substantially bring authors, editors, data practitioners, and publishers together to assess data extraction performance.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
非技术性摘要奖是根据渴望的提议颁发的。它支持在Lehigh University在Lehigh University举行的SSMCDAT 2023 DATATHON上高级项目的进展。由于使用便携式文档格式(PDF)文件,因此该项目解决了科学期刊论文中材料数据的问题很难访问和使用。以前的鼓励更好格式的尝试效果不佳。为了解决这个问题,团队将使用先进的人工智能和自然语言处理,使计算机能够“理解”文本,从而自动从科学研究论文中提取材料数据。团队方法的关键要素涉及利用宝贵的商业资源,例如Pauling File,它提供了精心策划的材料数据示例。团队将利用此培训数据来增强大语言模型的性能,大语言模型可以“理解”文本,并且可以以人为人的方式生成文本。访问有组织的材料数据有可能改变固态材料化学,并在现场更快地进展。该项目还旨在与材料科学潜水学子学科的材料科学家,作者和编辑互动,以将这项技术无缝整合到学术出版工作流程中。该项目还支持培训研究生和本科生,并创建诸如播客情节和YouTube课程之类的外展材料,以促进对人工智能和材料科学的更广泛理解。技术摘要奖是根据渴望的提议颁发的。它支持在Lehigh University在Lehigh University举行的SSMCDAT 2023 DATATHON上高级项目的进展。这项活动应对学术文献中有限的机器可读材料数据的挑战,这主要是由于PDF格式的普遍性。事先鼓励机器可读格式的尝试没有成功。结果是出现不准确和实验室密集型信息提取工具。该团队的目标是利用自然语言处理和大型语言模型的最新进展,并将其与Pauling File的手工标记的数据相结合。这种方法消除了对手动标记的需求,赋予材料化学家能够像往常一样撰写论文,同时使用人工智能将材料数据提取和组织到机器可读格式中,以准确,自动自动。该方法包括机器学习与基于规则的令牌尺寸减小的步骤,开源与商业大语言模型的比较,错误和不完整的专家分析以及除了综合数据外,还扩展了材料属性提取数据。在这项工作中的成功将通过利用在材料信息方面取得的进展来实现对固态材料化学的变化。该活动旨在改变材料数据格局,从而通过从研究文章中提取数据来自动化材料信息的宽度材料信息。该项目的更广泛的影响扩展到其他学术领域,并在不同的科学领域进行了潜在的应用。它还促进了双语外展和教育,包括提供独特的社交媒体内容。通过YouTube和播客格式。最后,这项活动将实质上使作者,编辑,数据从业人员和出版商聚集在一起,以评估数据提取性能。该奖项反映了NSF的法定任务,并使用基金会的知识分子优点和更广泛的影响审查标准,被认为是通过评估而被视为珍贵的支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Taylor Sparks其他文献

Taylor Sparks的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Taylor Sparks', 18)}}的其他基金

REU Site: Research Experience in Utah for Sustainable Materials Engineering (ReUSE)
REU 网站:犹他州可持续材料工程(再利用)的研究经验
  • 批准号:
    1950589
  • 财政年份:
    2020
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Collaborative Research: SSMCDAT2020: Solid-State and Materials Chemistry Data Science Hackathon
合作研究:SSMCDAT2020:固态和材料化学数据科学黑客马拉松
  • 批准号:
    1938734
  • 财政年份:
    2019
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
CAREER: SusChEM: Data Mining to Reduce the Risk in Discovering New Sustainable Thermoelectric Materials
职业:SusChEM:通过数据挖掘降低发现新型可持续热电材料的风险
  • 批准号:
    1651668
  • 财政年份:
    2017
  • 资助金额:
    $ 20万
  • 项目类别:
    Continuing Grant
Collaborative Research: Guided Discovery of Sustainable Superhard Materials via Bond Optimization
合作研究:通过键优化引导可持续超硬材料的发现
  • 批准号:
    1562226
  • 财政年份:
    2016
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant

相似海外基金

EAGER: SSMCDAT2023: Revealing Local Symmetry Breaking in Intermetallics: Combining Statistical Mechanics and Machine Learning in PDF Analysis
EAGER:SSMCDAT2023:揭示金属间化合物中的局部对称性破缺:在 PDF 分析中结合统计力学和机器学习
  • 批准号:
    2334261
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
EAGER: SSMCDAT2023: Database generation to identify trends in inter- and intra-polyhedral connectivity and energy storage behavior
EAGER:SSMCDAT2023:生成数据库以确定多面体间和多面体内连接和能量存储行为的趋势
  • 批准号:
    2334240
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: SSMCDAT2023: Data-driven Predictive Understanding of Oxidation Resistance in High-Entropy Alloy Nanoparticles
合作研究:EAGER:SSMCDAT2023:数据驱动的高熵合金纳米颗粒抗氧化性预测理解
  • 批准号:
    2334386
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
EAGER: SSMCDAT2023: Deep learning Gibbs free energy functions to guide solid-state material synthesis
EAGER:SSMCDAT2023:深度学习吉布斯自由能函数指导固态材料合成
  • 批准号:
    2334275
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: SSMCDAT2023: Data-driven Predictive Understanding of Oxidation Resistance in High-Entropy Alloy Nanoparticles
合作研究:EAGER:SSMCDAT2023:数据驱动的高熵合金纳米颗粒抗氧化性预测理解
  • 批准号:
    2334385
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了