Navigating Chemical Space with Natural Language Processing and Deep Learning

利用自然语言处理和深度学习驾驭化学空间

基本信息

  • 批准号:
    EP/Y004167/1
  • 负责人:
  • 金额:
    $ 11.41万
  • 依托单位:
  • 依托单位国家:
    英国
  • 项目类别:
    Research Grant
  • 财政年份:
    2024
  • 资助国家:
    英国
  • 起止时间:
    2024 至 无数据
  • 项目状态:
    未结题

项目摘要

Natural language processing (NLP) lies at the intersection between linguistics and computer science which aims to process and analyse human language, typically provided as written text. NLP is now strongly focused on the use of machine learning for challenging tasks with some revolutionary algorithms having been developed in the last few years. They now underpin a wide range of real-life applications, such as ChatGPT, virtual assistants and automatic text completion when we write emails. Innovative research ideas often come from integrating techniques and concepts across disciplines. For this discipline-hopping grant, we would like to explore how Transformer models, a ground-breaking deep learning algorithm developed by Google in 2017 which fuels majority of the current cutting-edge research in NLP, can be adapted to solve research challenges in chemistry. Chemical structures are usually three dimensional. However, they are also often converted into sequences, called SMILES. SMILES has a simple vocabulary of chemical elements and bond symbols and a few grammatical rules of how the chemical elements are positioned. Owing to this direct analogy to text sequences, through SMILES it is possible to use NLP algorithms to analyse chemical structures in a similar fashion as they are used to analyse text. For the proposed research, Dr Pang, a chemist will work with Dr Vulic, an NLP and machine learning expert in order to get up to speed with the latest developments in the field of NLP and to examine their further applicability in her domain of expertise. We will explore and utilise a concept which is now pervasive in machine learning and NLP, termed transfer learning, which 1) pretrains large general-purpose models, and 2) fine-tunes (i.e., specialises) those general models for specific tasks and applications, where labelled data are expensive to create (as they require expert knowledge and complex annotation protocols) and thus inherently scarce. Specifically, we will pretrain Transformer models to learn a latent representation of the chemical space defined by tens of millions of SMILES. This learned latent representation can then be used to predict molecular properties for a given chemical structure during fine-tuning. The advantage of this type of approach is that the resulting machine learning models rely less on the so-called labelled data (molecules with experimentally determined properties), which are time-consuming or even impossible to generate in chemistry considering the associated cost and experimental challenges. We will aim to make the Transformer models more computationally efficient and accurate using two latest machine learning techniques, termed sentence encoding and contrastive learning. We hope that this new molecular representation can complement existing molecular representation methods and provide an alternative approach to evaluate molecular structures against their properties, which underpins many research and development tasks in the chemical and pharmaceutical industries.
自然语言处理(NLP)位于语言学和计算机科学之间的交叉点,旨在处理和分析人类语言,通常以书面文本的形式提供。NLP现在非常关注使用机器学习来完成具有挑战性的任务,在过去几年中开发了一些革命性的算法。它们现在支持广泛的现实生活应用程序,例如ChatGPT,虚拟助手和我们写电子邮件时的自动文本完成。创新的研究思路往往来自跨学科的技术和概念的整合。对于这个跨学科的资助,我们想探索Transformer模型,这是Google在2017年开发的一种突破性的深度学习算法,它为当前NLP的大部分前沿研究提供了动力,可以用来解决化学领域的研究挑战。化学结构通常是三维的。然而,它们也经常被转换成序列,称为SMILES。SMILES有一个简单的化学元素和键符号的词汇表,以及一些关于化学元素如何定位的语法规则。由于这种与文本序列的直接类比,通过SMILES,可以使用NLP算法以类似于分析文本的方式分析化学结构。对于拟议的研究,化学家Pang博士将与NLP和机器学习专家Vulic博士合作,以加快NLP领域的最新发展,并研究其在其专业领域的进一步适用性。我们将探索和利用一个现在在机器学习和NLP中普遍存在的概念,称为迁移学习,它1)预训练大型通用模型,2)微调(即,专门化)那些用于特定任务和应用的通用模型,其中标记数据创建起来是昂贵的(因为它们需要专业知识和复杂的注释协议),因此本质上是稀缺的。具体来说,我们将预训练Transformer模型,以学习由数千万个SMILES定义的化学空间的潜在表示。然后,这种学习的潜在表示可以用于在微调期间预测给定化学结构的分子性质。这种方法的优点是,所产生的机器学习模型较少依赖于所谓的标记数据(具有实验确定属性的分子),考虑到相关的成本和实验挑战,这些数据在化学中是耗时甚至不可能生成的。我们的目标是使用两种最新的机器学习技术,即句子编码和对比学习,使Transformer模型在计算上更加高效和准确。我们希望这种新的分子表示可以补充现有的分子表示方法,并提供一种替代方法来评估分子结构对其性质的影响,这是化学和制药工业中许多研究和开发任务的基础。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Jiayun Pang其他文献

New insights into the multi-step reaction pathway of the reductive half-reaction catalysed by aromatic amine dehydrogenase: a QM/MM study.
对芳香胺脱氢酶催化还原半反应多步反应途径的新见解:QM/MM 研究。
  • DOI:
    10.1039/c003107k
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    4.9
  • 作者:
    Jiayun Pang;Nigel S. Scrutton;Sam P de Visser;M. Sutcliffe
  • 通讯作者:
    M. Sutcliffe
Integrating computational methods with experiment uncovers the role of dynamics in enzyme catalysed H-tunnelling reactions
将计算方法与实验相结合揭示了动力学在酶催化氢隧道反应中的作用
  • DOI:
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    L. Johannissen;Sam Hay;Jiayun Pang;M. Sutcliffe;N. Scrutton
  • 通讯作者:
    N. Scrutton
Assignment of the vibrational spectra of enzyme-bound tryptophan tryptophyl quinones using a combined QM/MM approach.
使用 QM/MM 组合方法分配酶结合色氨酸色氨酸醌的振动光谱。
  • DOI:
    10.1021/jp910161k
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jiayun Pang;N. Scrutton;S. D. de Visser;M. Sutcliffe
  • 通讯作者:
    M. Sutcliffe
Delivering Antisense Oligonucleotides across the Blood-Brain Barrier by Tumor Cell-Derived Small Apoptotic Bodies
  • DOI:
    10.1002/advs.202004929.
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    15.1
  • 作者:
    Yulian Wang;Jiayun Pang;Qingyun Wang;Luocheng Yan;Lintao Wang;Zhen Xing;Chunming Wang;Junfeng Zhang;Lei Dong
  • 通讯作者:
    Lei Dong
Using natural language processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters
使用自然语言处理 (NLP) 启发的分子嵌入方法来预测汉森溶解度参数
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jiayun Pang;Alexander W. R. Pine;Abdulai Sulemana
  • 通讯作者:
    Abdulai Sulemana

Jiayun Pang的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

Chinese Journal of Chemical Engineering
  • 批准号:
    21224004
  • 批准年份:
    2012
  • 资助金额:
    20.0 万元
  • 项目类别:
    专项基金项目
Chinese Journal of Chemical Engineering
  • 批准号:
    21024805
  • 批准年份:
    2010
  • 资助金额:
    20.0 万元
  • 项目类别:
    专项基金项目

相似海外基金

Collaborative Research: IIBR: Innovation: Bioinformatics: Linking Chemical and Biological Space: Deep Learning and Experimentation for Property-Controlled Molecule Generation
合作研究:IIBR:创新:生物信息学:连接化学和生物空间:属性控制分子生成的深度学习和实验
  • 批准号:
    2318829
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Continuing Grant
Space Chemistry: Exploring our chemical origins and habitality
空间化学:探索我们的化学起源和习性
  • 批准号:
    2872450
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Studentship
Collaborative Research: IIBR: Innovation: Bioinformatics: Linking Chemical and Biological Space: Deep Learning and Experimentation for Property-Controlled Molecule Generation
合作研究:IIBR:创新:生物信息学:连接化学和生物空间:属性控制分子生成的深度学习和实验
  • 批准号:
    2318830
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Continuing Grant
Collaborative Research: IIBR: Innovation: Bioinformatics: Linking Chemical and Biological Space: Deep Learning and Experimentation for Property-Controlled Molecule Generation
合作研究:IIBR:创新:生物信息学:连接化学和生物空间:属性控制分子生成的深度学习和实验
  • 批准号:
    2318831
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Continuing Grant
Correlating Digital and Experimental Chemical Space to Pharmaceutical Manufacturing Processes
将数字和实验化学空间与药品制造过程相关联
  • 批准号:
    2898544
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Studentship
A pan-Canadian chemical library to explore uncharted chemistry space for drug discovery
泛加拿大化学图书馆,探索药物发现的未知化学空间
  • 批准号:
    489198
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Operating Grants
A Chemical Genetic Approach to Exploring Novel Therapeutic Space for Colorectal Cancer
探索结直肠癌新治疗空间的化学遗传学方法
  • 批准号:
    10908073
  • 财政年份:
    2023
  • 资助金额:
    $ 11.41万
  • 项目类别:
Unlocking the Chemical Space of Cancer-Associated Perturbations
解锁癌症相关扰动的化学空间
  • 批准号:
    10478520
  • 财政年份:
    2022
  • 资助金额:
    $ 11.41万
  • 项目类别:
Creation of materials and functions through presice chemical modifications of discrete molecular nano-space incorporating metal ions
通过对包含金属离子的离散分子纳米空间进行精确化学修饰来创建材料和功能
  • 批准号:
    22H02090
  • 财政年份:
    2022
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
CAREER: Exploring Novel Chemical Space: Modular Synthesis of Biologically Relevant Strained Molecules
职业:探索新的化学空间:生物相关应变分子的模块化合成
  • 批准号:
    2143925
  • 财政年份:
    2022
  • 资助金额:
    $ 11.41万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了