Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam
利用数据驱动的计算方法来理解 InterPro 和 Pfam 中的蛋白质结构和功能
基本信息
- 批准号:BB/S020381/1
- 负责人:
- 金额:$ 103.95万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2019
- 资助国家:英国
- 起止时间:2019 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes. These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families). InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources.We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.
Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes. These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families). InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources.We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.
项目成果
期刊论文数量(6)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
EMBL's European Bioinformatics Institute (EMBL-EBI) in 2022.
- DOI:10.1093/nar/gkac1098
- 发表时间:2023-01-06
- 期刊:
- 影响因子:14.9
- 作者:Thakur, Matthew;Bateman, Alex;Brooksbank, Cath;Freeberg, Mallory;Harrison, Melissa;Hartley, Matthew;Keane, Thomas;Kleywegt, Gerard;Leach, Andrew;Levchenko, Mariia;Morgan, Sarah;McDonagh, Ellen M.;Orchard, Sandra;Papatheodorou, Irene;Velankar, Sameer;Vizcaino, Juan Antonio;Witham, Rick;Zdrazil, Barbara;McEntyre, Johanna
- 通讯作者:McEntyre, Johanna
The InterPro protein families and domains database: 20 years on.
- DOI:10.1093/nar/gkaa977
- 发表时间:2021-01-08
- 期刊:
- 影响因子:14.9
- 作者:Blum M;Chang HY;Chuguransky S;Grego T;Kandasaamy S;Mitchell A;Nuka G;Paysan-Lafosse T;Qureshi M;Raj S;Richardson L;Salazar GA;Williams L;Bork P;Bridge A;Gough J;Haft DH;Letunic I;Marchler-Bauer A;Mi H;Natale DA;Necci M;Orengo CA;Pandurangan AP;Rivoire C;Sigrist CJA;Sillitoe I;Thanki N;Thomas PD;Tosatto SCE;Wu CH;Bateman A;Finn RD
- 通讯作者:Finn RD
Pfam: The protein families database in 2021.
- DOI:10.1093/nar/gkaa913
- 发表时间:2021-01-08
- 期刊:
- 影响因子:14.9
- 作者:Mistry J;Chuguransky S;Williams L;Qureshi M;Salazar GA;Sonnhammer ELL;Tosatto SCE;Paladin L;Raj S;Richardson LJ;Finn RD;Bateman A
- 通讯作者:Bateman A
The European Bioinformatics Institute (EMBL-EBI) in 2021.
- DOI:10.1093/nar/gkab1127
- 发表时间:2022-01-07
- 期刊:
- 影响因子:14.9
- 作者:Cantelli G;Bateman A;Brooksbank C;Petrov AI;Malik-Sheriff RS;Ide-Smith M;Hermjakob H;Flicek P;Apweiler R;Birney E;McEntyre J
- 通讯作者:McEntyre J
Reciprocal best structure hits: using AlphaFold models to discover distant homologues.
- DOI:10.1093/bioadv/vbac072
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:
- 通讯作者:
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Alex Bateman其他文献
Bioinformatics Applications Note Databases and Ontologies Codex: Exploration of Semantic Changes between Ontology Versions
生物信息学应用笔记数据库和本体法典:本体版本之间语义变化的探索
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Michael Hartung;Anika Groß;E. Rahm;Alex Bateman - 通讯作者:
Alex Bateman
Bioinformatics Advance Access published May 31, 2007
生物信息学高级访问发表于 2007 年 5 月 31 日
- DOI:
10.1007/s10015-009-0735-5 - 发表时间:
2007 - 期刊:
- 影响因子:0.9
- 作者:
Alex Bateman - 通讯作者:
Alex Bateman
Alex Bateman的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Alex Bateman', 18)}}的其他基金
Improving accuracy, coverage, and sustainability of functional protein annotation in InterPro, Pfam and FunFam using Deep Learning methods
使用深度学习方法提高 InterPro、Pfam 和 FunFam 中功能蛋白注释的准确性、覆盖范围和可持续性
- 批准号:
BB/X018660/1 - 财政年份:2024
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
UKRI/BBSRC-NSF/BIO: Unifying Pfam protein sequence and ECOD structural classifications with structure models
UKRI/BBSRC-NSF/BIO:通过结构模型统一 Pfam 蛋白质序列和 ECOD 结构分类
- 批准号:
BB/X012492/1 - 财政年份:2023
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
Rfam: The community resource for RNA families
Rfam:RNA 家族的社区资源
- 批准号:
BB/S020462/1 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
RNAcentral, the RNA sequence database
RNAcentral,RNA 序列数据库
- 批准号:
BB/N019199/1 - 财政年份:2017
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
Rfam: Towards a sustainable resource for understanding the genomic functional ncRNA repertoire
Rfam:寻找了解基因组功能 ncRNA 库的可持续资源
- 批准号:
BB/M011690/1 - 财政年份:2015
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
Keeping pace with protein sequence annotation; consolidating and enhancing Pfam and InterPro's methodologies for functional prediction
与蛋白质序列注释保持同步;
- 批准号:
BB/L024136/1 - 财政年份:2014
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
The RNAcentral database of non-coding RNAs
非编码RNA的RNA中央数据库
- 批准号:
BB/J019232/1 - 财政年份:2012
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
Embracing new technologies to streamline improve and sustain InterPro and its contributing databases
采用新技术来简化、改进和维护 InterPro 及其贡献数据库
- 批准号:
BB/F010435/1 - 财政年份:2008
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
相似国自然基金
Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:合作创新研究团队
Data-driven Recommendation System Construction of an Online Medical Platform Based on the Fusion of Information
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国青年学者研究基金项目
Development of a Linear Stochastic Model for Wind Field Reconstruction from Limited Measurement Data
- 批准号:
- 批准年份:2020
- 资助金额:40 万元
- 项目类别:
基于高频信息下高维波动率矩阵估计及应用
- 批准号:71901118
- 批准年份:2019
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
半参数空间自回归面板模型的有效估计与应用研究
- 批准号:71961011
- 批准年份:2019
- 资助金额:16.0 万元
- 项目类别:地区科学基金项目
高频数据波动率统计推断、预测与应用
- 批准号:71971118
- 批准年份:2019
- 资助金额:50.0 万元
- 项目类别:面上项目
基于个体分析的投影式非线性非负张量分解在高维非结构化数据模式分析中的研究
- 批准号:61502059
- 批准年份:2015
- 资助金额:19.0 万元
- 项目类别:青年科学基金项目
基于Linked Open Data的Web服务语义互操作关键技术
- 批准号:61373035
- 批准年份:2013
- 资助金额:77.0 万元
- 项目类别:面上项目
体数据表达与绘制的新方法研究
- 批准号:61170206
- 批准年份:2011
- 资助金额:55.0 万元
- 项目类别:面上项目
一类新Regime-Switching模型及其在金融建模中的应用研究
- 批准号:11061041
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:地区科学基金项目
相似海外基金
Understanding and exploiting novel therapeutic vulnerabilities of RIT1-driven lung cancer
了解和利用 RIT1 驱动的肺癌的新治疗漏洞
- 批准号:
10211377 - 财政年份:2021
- 资助金额:
$ 103.95万 - 项目类别:
Understanding and exploiting novel therapeutic vulnerabilities of RIT1-driven lung cancer
了解和利用 RIT1 驱动的肺癌的新治疗漏洞
- 批准号:
10378686 - 财政年份:2021
- 资助金额:
$ 103.95万 - 项目类别:
Understanding and exploiting novel therapeutic vulnerabilities of RIT1-driven lung cancer
了解和利用 RIT1 驱动的肺癌的新治疗漏洞
- 批准号:
10641671 - 财政年份:2021
- 资助金额:
$ 103.95万 - 项目类别:
Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam
利用数据驱动的计算方法来理解 InterPro 和 Pfam 中的蛋白质结构和功能
- 批准号:
BB/S020039/1 - 财政年份:2020
- 资助金额:
$ 103.95万 - 项目类别:
Research Grant
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
- 批准号:
10737854 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别:
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
- 批准号:
10381296 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别:
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
- 批准号:
10064023 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别:
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
- 批准号:
10528617 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别:
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
- 批准号:
10533732 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别:
Exploiting Ecology and Evolution to Prevent Therapy Resistance in EGFR-Driven Lung Cancer
利用生态学和进化来预防 EGFR 驱动的肺癌的治疗耐药性
- 批准号:
10524202 - 财政年份:2019
- 资助金额:
$ 103.95万 - 项目类别: