权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Exploiting data driven computational approaches for understanding protein structure and function in InterPro and Pfam

利用数据驱动的计算方法来理解 InterPro 和 Pfam 中的蛋白质结构和功能

基本信息

批准号：
BB/S020381/1
负责人：
Alex Bateman
金额：
$ 103.95万
依托单位：
European Bioinformatics Institute
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2019
资助国家：
英国
起止时间：
2019 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=BB%2FS020381%2F1
关键词：
Exploiting data driven computational approaches

项目摘要

Proteins are biological macromolecules that perform a diverse array of crucial functions, from enzymes (e.g. the entities responsible for fermentation) to transporters (e.g. hemoglobin in the blood) to mechanical structures (e.g. actin and myosin in muscle). Proteins are synthesized as linear polymers of building blocks called amino acids. They usually fold into complex three-dimensional (3D) structures, and typically interact with other proteins and molecules to perform their function. Knowledge of protein sequences can facilitate insights into hitherto undiscovered enzymes with potential applications in the biotechnology sector, or novel drugs of interest to the pharmaceutical industry. Detailed understanding of the functional architecture of proteins, including the arrangement of amino acids in a 3D structure, enables scientists to diagnose diseases as well as design more effective enzymes. These days, our ability to generate new protein sequences based on modern high-throughput DNA sequencing (HTS) techniques far outstrips our ability to functionally characterise them. Thus, most sequences are computationally annotated, by identifying similarities between new sequences and the few experimentally characterised examples, using these to infer function (i.e. annotate). More recently, HTS has been applied directly to environmental samples to discover previously uncultured bacteria and single cell eukaryotes, and to enable the reconstruction of large and complex genomes, like plants. Such approaches are correcting many of the historical biases in the protein sequence databases. However, for humankind to understand and utilise these data, sequences need to be functionally annotated, which is best accomplished using the information gleaned from sets of related sequences (known as protein families). InterPro is a world leading protein family resource that merges information from 13 different specialist databases to present the user with comprehensive functional analysis of sequences. One of its member databases, Pfam, is a collection of protein domain families containing functional annotations. Both InterPro and Pfam are well-established primary resources in the field of protein research. In this application, we propose crucial developments to both of these resources in order to augment their utility, functionality and scalability, as well as uniquely position them to tackle imminent advances in the field. We will leverage pre-established links with other protein databases and concurrently build additional pipelines to develop and exchange the latest information between these existing and new resources.We will improve coverage of protein sequences originating from environmental sources by building families for novel sets (or clusters) of related proteins. Considering the fundamental association between protein structure and function, we will develop a pipeline that will not only import structural models for Pfam entries and present them via the website, but will also ensure that the models remain up to date. To increase coverage and functional annotations in both resources, we will integrate new resources to provide sub-domain classifications, and improve annotations through combined literature searches and enhanced curation tools. To refine annotations, we will adopt a new algorithm called TreeGrafter to InterProScan (our software package that performs automatic annotations of protein sequences), and integrate controlled vocabularies for protein attributes from databases like PANTHER with those already in InterPro. We will evaluate the performance of an upgraded version of the HMMER software that is widely used to build protein families, including Pfam, to improve future scalability. Finally, we will focus on eight genomes of agricultural importance, including chicken, salmon, and wheat, by systematically annotating 2000 associated entries in Pfam and by extension, InterPro.

项目成果

期刊论文数量（6）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

EMBL's European Bioinformatics Institute (EMBL-EBI) in 2022.

DOI：
10.1093/nar/gkac1098
发表时间：
2023-01-06
期刊：
NUCLEIC ACIDS RESEARCH
影响因子：
14.9
作者：
Thakur, Matthew;Bateman, Alex;Brooksbank, Cath;Freeberg, Mallory;Harrison, Melissa;Hartley, Matthew;Keane, Thomas;Kleywegt, Gerard;Leach, Andrew;Levchenko, Mariia;Morgan, Sarah;McDonagh, Ellen M.;Orchard, Sandra;Papatheodorou, Irene;Velankar, Sameer;Vizcaino, Juan Antonio;Witham, Rick;Zdrazil, Barbara;McEntyre, Johanna
通讯作者：
McEntyre, Johanna

The InterPro protein families and domains database: 20 years on.

DOI：
10.1093/nar/gkaa977
发表时间：
2021-01-08
期刊：
Nucleic acids research
影响因子：
14.9
作者：
Blum M;Chang HY;Chuguransky S;Grego T;Kandasaamy S;Mitchell A;Nuka G;Paysan-Lafosse T;Qureshi M;Raj S;Richardson L;Salazar GA;Williams L;Bork P;Bridge A;Gough J;Haft DH;Letunic I;Marchler-Bauer A;Mi H;Natale DA;Necci M;Orengo CA;Pandurangan AP;Rivoire C;Sigrist CJA;Sillitoe I;Thanki N;Thomas PD;Tosatto SCE;Wu CH;Bateman A;Finn RD
通讯作者：
Finn RD

Pfam: The protein families database in 2021.

DOI：
10.1093/nar/gkaa913
发表时间：
2021-01-08
期刊：
Nucleic acids research
影响因子：
14.9
作者：
Mistry J;Chuguransky S;Williams L;Qureshi M;Salazar GA;Sonnhammer ELL;Tosatto SCE;Paladin L;Raj S;Richardson LJ;Finn RD;Bateman A
通讯作者：
Bateman A

The European Bioinformatics Institute (EMBL-EBI) in 2021.

DOI：
10.1093/nar/gkab1127
发表时间：
2022-01-07
期刊：
Nucleic acids research
影响因子：
14.9
作者：
Cantelli G;Bateman A;Brooksbank C;Petrov AI;Malik-Sheriff RS;Ide-Smith M;Hermjakob H;Flicek P;Apweiler R;Birney E;McEntyre J
通讯作者：
McEntyre J

Reciprocal best structure hits: using AlphaFold models to discover distant homologues.