权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution

BBSRC-NSF/BIO：基于人工智能的域分类平台，可用于 2 亿个蛋白质 3D 模型，以揭示蛋白质进化

基本信息

批准号：
BB/Y001117/1
负责人：
Christine Orengo
金额：
$ 34.21万
依托单位：
University College London
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2024
资助国家：
英国
起止时间：
2024 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=BB%2FY001117%2F1
关键词：
BBSRC NSF BIO AI based

项目摘要

Proteins play a major role in most important processes in life, such as the digestion of nutrients, immune response, and cellular regulation. They are comprised of long polymers that fold into compact globular forms known as domains. Most proteins have at least two domains and some are composed of dozens. Domains tend to be associated with specific functions, although sometimes an important function will result from combining multiple domains. 3D structure data and models are particularly valuable for detecting the pockets and surface features linked to domain function. Determining the structure and orientations of the constituent domains is important for understanding the overall function of the protein and the dynamic conformational changes linked to that. Until recently, structural data for proteins was very sparse, with <1% of all known proteins experimentally characterised. Whilst structures can be predicted with reasonable accuracy when the structure of a close relative is known, for a significant proportion of proteins such data did not exist. Even for important organisms like humans or wheat, <50% of proteins had structural data accurate enough to understand the structural impacts of changes in the genes coding the proteins.This situation changed dramatically in 2021 when DeepMind's AlphaFold AI system succeeded in predicting protein structures of comparable quality to experimentally characterised proteins. In August 2022, DeepMind released >214 million protein structures for all known proteins. Whilst recent analyses showed that in some cases AlphaFold models are not accurate enough for detailed studies, largely because the data needed to make the prediction is still too sparse, the AlphaFold data still massively increases the amount of high-quality structural data available for understanding the mechanisms by which proteins function.Identifying constituent domains in a protein is not trivial. This project will exploit powerful AI technologies to more accurately predict domain boundaries. Preliminary studies are already showing significant improvements. We will apply multiple domain detection algorithms independently developed by two world-renowned protein domain classification teams (ECOD, CATH), both of whom have long track records in successfully automating domain detection. Their methods employ complementary strategies that can be combined to give a consensus prediction where agreement in assignments reflects higher confidence levels. Another major challenge will be coping with the scale of the data. Even allowing for a 50% loss due to poor model quality, the data represents a >200-fold increase in the data already classified in these evolutionary resources. An existing domain assignment and classification pipeline (3D-SCAFOLD) built to integrate experimental domain data from two resources (SCOP, CATH) will be re-engineered to incorporate ECOD (which is much more comprehensive than SCOP) and capture the vast predicted data from AlphaFold. This will require new and more efficient workflows that parallelise the processes. Furthermore, the pipeline will be more complex as additional steps will be necessary to determine the model quality and remove poor models. We will also adapt access to the webpages and APIs to allow users to request targeted subsets and perform more complex queries needed by the increase in the scale of the data.In addition, we expect that many large, more complex multidomain proteins will be very challenging, leading to discrepancies between the results provided by the different resources. We will hold workshops for the teams to agree on consensus assignments.To cope with the scale of the data, we will initially target proteins in pathogenic organisms, crops essential for food security, and protein families linked to human health and well-being, including enzyme families important for environmental remediation and the production of commercially valuable compounds.

蛋白质在生命中最重要的过程中发挥着重要作用，如营养物质的消化，免疫反应和细胞调节。它们由长的聚合物组成，这些聚合物折叠成紧凑的球形形式，称为域。大多数蛋白质至少有两个结构域，有些由几十个结构域组成。域往往与特定的功能相关联，尽管有时一个重要的功能将由多个域的组合产生。3D结构数据和模型对于检测与域函数相关的口袋和表面特征特别有价值。确定组成结构域的结构和方向对于理解蛋白质的整体功能以及与之相关的动态构象变化非常重要。直到最近，蛋白质的结构数据非常稀少，所有已知蛋白质的实验特征不到1%。虽然当近亲的结构已知时，可以合理准确地预测结构，但对于相当大比例的蛋白质，这样的数据并不存在。即使对于人类或小麦等重要生物，<50%的蛋白质具有足够准确的结构数据，可以了解编码蛋白质的基因变化对结构的影响。2021年，DeepMind的AlphaFold AI系统成功预测了与实验表征蛋白质质量相当的蛋白质结构，这一情况发生了巨大变化。2022年8月，DeepMind发布了所有已知蛋白质的2.14亿个蛋白质结构。虽然最近的分析表明，在某些情况下，AlphaFold模型对于详细的研究来说不够准确，主要是因为进行预测所需的数据仍然太少，但AlphaFold数据仍然大大增加了可用于理解蛋白质功能机制的高质量结构数据的数量。该项目将利用强大的人工智能技术来更准确地预测域边界。初步研究已经显示出显著的改善。我们将应用由两个世界知名的蛋白质结构域分类团队（ECOD，CATH）独立开发的多个结构域检测算法，这两个团队在成功自动化结构域检测方面都有很长的记录。他们的方法采用互补的策略，可以结合起来给出一个共识预测，其中分配的一致性反映了更高的置信水平。另一个主要挑战是处理数据的规模。即使考虑到由于模型质量差而造成的50%的损失，这些数据也代表了在这些进化资源中已经分类的数据的200倍以上的增加。为整合来自两个资源（SCOP，CATH）的实验领域数据而构建的现有领域分配和分类管道（3D-SCAFOLD）将被重新设计，以纳入ECOD（比SCOP更全面）并捕获来自AlphaFold的大量预测数据。这将需要新的和更有效的工作流程，并行的过程。此外，管道将更加复杂，因为需要额外的步骤来确定模型质量并删除不良模型。我们还将调整对网页和API的访问，以允许用户请求目标子集，并执行数据规模增加所需的更复杂查询。此外，我们预计许多大型、更复杂的多结构域蛋白质将非常具有挑战性，导致不同资源提供的结果之间存在差异。我们将举办研讨会，让团队就共识任务达成一致。为了应对数据的规模，我们将首先针对病原生物中的蛋白质、对粮食安全至关重要的作物以及与人类健康和福祉相关的蛋白质家族，包括对环境修复和生产具有商业价值的化合物至关重要的酶家族。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Christine Orengo其他文献

Understanding the structural and functional diversity of ATP-PPases using protein domains and functional families in the CATH database

利用CATH数据库中的蛋白质结构域和功能家族来理解ATP-PP酶的结构与功能多样性

DOI：
10.1016/j.str.2024.12.016
发表时间：
2025-03-06
期刊：
STRUCTURE
影响因子：
4.300
作者：
Jialin Yin;Vaishali P. Waman;Neeladri Sen;Mohd Firdaus-Raih;Su Datt Lam;Christine Orengo
通讯作者：
Christine Orengo