UKRI/BBSRC-NSF/BIO: Unifying Pfam protein sequence and ECOD structural classifications with structure models

UKRI/BBSRC-NSF/BIO:通过结构模型统一 Pfam 蛋白质序列和 ECOD 结构分类

基本信息

项目摘要

Proteins are complex organic molecules essential for all functions of life. A study of a protein begins from learning about its evolutionary relatives. Proteins tend to retain their function as they evolve. Therefore, a more closely studied relative can inform researchers about the properties of a less understood relative, suggesting hypotheses to be tested. A protein’s relatives are cataloged in public, freely available, classification databases. Although amino-acid sequences are known for most proteins, spatial structures have only been experimentally determined for a small fraction. Protein classification can be based on either protein sequences (including those with yet unknown 3D structure), or protein spatial structures (augmented with sequences). Structure-based classifications are more accurate but include fewer proteins, necessarily missing those whose structures are unknown. Recently developed revolutionary structure prediction methods that can produce accurate 3D models for any protein sequence bridge these sequence-only and structure/sequence classifications. Now is the time to bring the two classification types together through a synergistic collaboration. The teams in the United States (ECOD: Evolutionary Classification of protein Domains database, mostly structure-based) and the United Kingdom (Pfam: Protein families database, mostly sequence-based) will work together to make their two databases consistent with each other and more accurate for the benefit of scientists and the broader community. The results of these protein classifications are readily incorporated into many other resources, such as Wikipedia pages, and thus are widely used by an audience in science and education.The ECOD and Pfam teams will collaboratively classify more than 1 million protein structure models generated by AlphaFold and RoseTTAfold into protein families of close evolutionary relatives. Existing families will be expanded with additional proteins. New families will be defined by sequence profile similarity aided by structure analysis in a manner consistent with the Pfam classification standards. The project requires upgrading the software infrastructure to process millions of models and to synchronize the two classifications in terms of domain identifiers and family names. Synchronization of the ECOD and Pfam classifications will be achieved in four ways: 1) By defining new families in Pfam for domains currently present only in ECOD; 2) By rectifying the ECOD classification such that all domains are classified into a Pfam family (defining new ones where necessary); 3) By splitting Pfam domain families containing multiple ECOD domains into multiple families containing single domains. 4) By making consistent collaborative decisions about domain boundaries and family classification using proteins with 3D models in both databases. These consistent domain definitions and classifications will facilitate broad generation of functional inference and detection of evolutionary insights in the scientific community and the public at large. Lastly, the internet architecture will be upgraded to serve these domain data to the broader scientific community through web portals. The results of this project can be found incorporated into both ECOD http://prodata.swmed.edu/ecod and Pfam http://pfam.xfam.org.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
蛋白质是复杂的有机分子,对生命的所有功能都至关重要。对蛋白质的研究始于对其进化亲属的了解。蛋白质在进化过程中往往会保留其功能。因此,一个更仔细研究的亲戚可以告诉研究人员一个不太了解的亲戚的属性,建议假设进行测试。蛋白质的亲属在公共的、免费获得的分类数据库中被编目。虽然大多数蛋白质的氨基酸序列是已知的,但空间结构只在实验上确定了一小部分。蛋白质分类可以基于蛋白质序列(包括那些未知的3D结构)或蛋白质空间结构(用序列增强)。基于结构的分类更准确,但包含的蛋白质较少,必然会遗漏那些结构未知的蛋白质。最近开发的革命性结构预测方法可以为任何蛋白质序列产生精确的3D模型,从而将这些仅序列分类和结构/序列分类联系起来。现在是时候通过协同合作将这两种分类类型结合在一起了。美国(ECOD:蛋白质结构域进化分类数据库,主要基于结构)和英国(Pfam:蛋白质家族数据库,主要基于序列)的团队将共同努力,使他们的两个数据库相互一致,更准确,以造福科学家和更广泛的社区。这些蛋白质分类的结果很容易被纳入许多其他资源,如维基百科页面,因此被科学和教育领域的受众广泛使用。ECOD和Pfam团队将合作将AlphaFold和RoseTTAfold生成的100多万个蛋白质结构模型分类为进化关系密切的蛋白质家族。现有的家庭将扩大与其他蛋白质。新的家族将通过与Pfam分类标准一致的方式通过结构分析辅助的序列谱相似性来定义。该项目需要升级软件基础设施,以处理数百万个模型,并在域名识别码和姓氏方面同步两种分类。ECOD和Pfam分类的同步将通过四种方式实现:1)通过在Pfam中为目前仅存在于ECOD中的域定义新的族; 2)通过纠正ECOD分类,使所有域都被分类到Pfam族中(必要时定义新的域); 3)通过将包含多个ECOD域的Pfam域族拆分为包含单个域的多个族。4)通过在两个数据库中使用具有3D模型的蛋白质对域边界和家族分类做出一致的协作决策。这些一致的领域定义和分类将促进科学界和广大公众广泛生成功能推理和检测进化见解。最后,互联网架构将进行升级,以便通过门户网站向更广泛的科学界提供这些领域数据。该项目的成果可以在ECOD http://prodata.swmed.edu/ecod和Pfam http://pfam.xfam.org.This中找到,反映了NSF的法定使命,并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估来支持。

项目成果

期刊论文数量(3)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
DPAM : A domain parser for AlphaFold models
DPAM:AlphaFold 模型的域解析器
  • DOI:
    10.1002/pro.4548
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    8
  • 作者:
    Zhang, Jing;Schaeffer, R. Dustin;Durham, Jesse;Cong, Qian;Grishin, Nick V.
  • 通讯作者:
    Grishin, Nick V.
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Nick Grishin其他文献

Insights into virulence: structure classification of the emVibrio parahaemolyticus/em RIMD mobilome
洞察毒力:副溶血性弧菌 RIMD 移动基因元件的结构分类
  • DOI:
    10.1128/msystems.00796-23
  • 发表时间:
    2023-10-26
  • 期刊:
  • 影响因子:
    4.600
  • 作者:
    Lisa N. Kinch;R. Dustin Schaeffer;Jing Zhang;Qian Cong;Kim Orth;Nick Grishin
  • 通讯作者:
    Nick Grishin

Nick Grishin的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution
BBSRC-NSF/BIO:基于人工智能的域分类平台,可用于 2 亿个蛋白质 3D 模型,以揭示蛋白质进化
  • 批准号:
    BB/Y000455/1
  • 财政年份:
    2024
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
BBSRC-NSF/BIO: An AI-based domain classification platform for 200 million 3D-models of proteins to reveal protein evolution
BBSRC-NSF/BIO:基于人工智能的域分类平台,可用于 2 亿个蛋白质 3D 模型,以揭示蛋白质进化
  • 批准号:
    BB/Y001117/1
  • 财政年份:
    2024
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
22-BBSRC/NSF-BIO Building synthetic regulatory units to understand the complexity of mammalian gene expression
22-BBSRC/NSF-BIO 构建合成调控单元以了解哺乳动物基因表达的复杂性
  • 批准号:
    BB/Y008898/1
  • 财政年份:
    2024
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
20-BBSRC/NSF-BIO Regulatory control of innate immune response in marine invertebrates
20-BBSRC/NSF-BIO 海洋无脊椎动物先天免疫反应的调节控制
  • 批准号:
    BB/W017865/1
  • 财政年份:
    2024
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
22-BBSRC/NSF-BIO - Interpretable & Noise-robust Machine Learning for Neurophysiology
22-BBSRC/NSF-BIO - 可解释
  • 批准号:
    BB/Y008758/1
  • 财政年份:
    2024
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
22-BBSRC/NSF-BIO: Community-dependent CRISPR-cas evolution and robust community function
22-BBSRC/NSF-BIO:群落依赖性 CRISPR-cas 进化和强大的群落功能
  • 批准号:
    BB/Y008774/1
  • 财政年份:
    2024
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
UKRI/BBSRC-NSF/BIO: Interpretable and Noise-Robust Machine Learning for Neurophysiology
UKRI/BBSRC-NSF/BIO:用于神经生理学的可解释且抗噪声的机器学习
  • 批准号:
    2321840
  • 财政年份:
    2023
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Continuing Grant
UKRI/BBSRC-NSF/BIO:Hidden costs of infection: mechanisms by which parasites disrupt host-microbe symbioses and alter development
UKRI/BBSRC-NSF/BIO:感染的隐性成本:寄生虫破坏宿主-微生物共生并改变发育的机制
  • 批准号:
    2322173
  • 财政年份:
    2023
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Continuing Grant
21-BBSRC/NSF-BIO: Developing large serine integrases as tools for constructing and manipulating synthetic replicons.
21-BBSRC/NSF-BIO:开发大型丝氨酸整合酶作为构建和操作合成复制子的工具。
  • 批准号:
    BB/X012085/1
  • 财政年份:
    2023
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
UKRI/BBSRC-NSF/BIO Determining the Roles of Fusarium Effector Proteases in Plant Pathogenesis
UKRI/BBSRC-NSF/BIO 确定镰刀菌效应蛋白酶在植物发病机制中的作用
  • 批准号:
    BB/X012131/1
  • 财政年份:
    2023
  • 资助金额:
    $ 114.63万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了