Comparative Analysis Of Completely Sequenced Genomes
完全测序的基因组的比较分析
基本信息
- 批准号:8558101
- 负责人:
- 金额:$ 334.06万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:
- 资助国家:美国
- 起止时间:至
- 项目状态:未结题
- 来源:
- 关键词:AffectAgreementAlienAlternative SplicingAnimalsAntiviral AgentsArchaeaBacteriaBase PairingCodeDNA VirusesDatabasesEukaryotaEvolutionFamilyFunctional RNAGenesGenetic RecombinationGenomeGenomicsGuide RNAHomologous GeneHorizontal Gene TransferHumanImmune responseImmune systemImmunityIndividualInfectious AgentIntronsInvertebratesLifeMammalsMapsMarkov ChainsMediatingMethodsMusOrganismOrthologous GenePatternPhylogenetic AnalysisPhylogenyPlantsPlasmidsProcessProteinsRNARaceRecording of previous eventsReportingResearchRoleSet proteinSourceSystemTreesVertebratesViralVirusWorkarmbasecomparativecostdensitygenome sequencinginsightmammalian genomemarkov modelmathematical modelparalogous geneprotein expressionreconstructiontrend
项目摘要
The rapidly growing database of completely sequenced genomes of bacteria, archaea, eukaryotes and viruses (several thousand genomes already available and many more in progress) creates both new opportunities and new challenges for genome research. Over the last year, we performed several studies that took advantage of the genomic information to establish fundamental principles of genome evolution and function. In particular, we investigated the evolution of the numerous long non-coding RNAs (lncRNAs) encoded in mammalian genomes. The functions of the lncRNAs remain largely unknown but their evolution appears to be constrained by purifying selection, albeit relatively weakly. To gain insights into the mode of evolution and the functional range of the lncRNA, they can be compared with much better characterized protein-coding genes. The evolutionary rate of the protein-coding genes shows a universal negative correlation with expression: highly expressed genes are on average more conserved during evolution than the genes with lower expression levels. This correlation was conceptualized in the misfolding-driven protein evolution hypothesis according to which misfolding is the principal cost incurred by protein expression. We sought to determine whether long intergenic ncRNAs (lincRNAs) follow the same evolutionary trend and indeed detected a moderate but statistically significant negative correlation between the evolutionary rate and expression level of human and mouse lincRNA genes. The magnitude of the correlation for the lincRNAs is similar to that for equal-sized sets of protein-coding genes with similar levels of sequence conservation. Additionally, the expression level of the lincRNAs is significantly and positively correlated with the predicted extent of lincRNA molecule folding (base-pairing), however, the contributions of evolutionary rates and folding to the expression level are independent. Thus, the anticorrelation between evolutionary rate and expression level appears to be a general feature of gene evolution that might be caused by similar deleterious effects of protein and RNA misfolding and/or other factors, for example, the number of interacting partners of the gene product.
A separate project was dedicated to the phylogenomics of prokaryotic defense systems. The recently discovered CRISPR-Cas adaptive immune system is present in almost all archaea and many bacteria. It consists of cassettes of CRISPR repeats that incorporate spacers homologous to fragments of viral or plasmid genomes that are employed as guide RNAs in the immune response, along with numerous CRISPR-associated (cas) genes that encode proteins possessing diverse, only partially characterized activities required for the action of the system. Here, we investigate the evolution of the cas genes and show that they evolve under purifying selection that is typically much weaker than the median strength of purifying selection affecting genes in the respective genomes. The exceptions are the cas1 and cas2 genes that typically evolve at levels of purifying selection close to the genomic median. Thus, although these genes are implicated in the acquisition of spacers from alien genomes, they do not appear to be directly involved in an arms race between bacterial and archaeal hosts and infectious agents. These genes might possess functions distinct from and additional to their role in the CRISPR-Cas-mediated immune response. Taken together with evidence of the frequent horizontal transfer of cas genes reported previously and with the wide-spread microscale recombination within these genes detected in this work, these findings reveal the highly dynamic evolution of cas genes. This conclusion is in line with the involvement of CRISPR-Cas in antiviral immunity that is likely to entail a coevolutionary arms race with rapidly evolving viruses. However, we failed to detect evidence of strong positive selection in any of the cas genes.
We extensively studied the evolution of the Nucleo-Cytoplasmic Large DNA Viruses (NCLDV) that constitute an apparently monophyletic group consisting of at least 6 families of viruses infecting a broad variety of eukaryotic hosts. A comprehensive genome comparison and maximum-likelihood reconstruction of the NCLDV evolution revealed a set of approximately 50 conserved, core genes that could be mapped to the genome of the common ancestor of this class of eukaryotic viruses.
We performed a detailed phylogenetic analysis of these core NCLDV genes and applied the constrained tree approach to show that the majority of the core genes are unlikely to be monophyletic. Several of the core genes have been independently acquired from different sources by different NCLDV lineages whereas for the majority of these genes displacement by homologs from cellular organisms in one or more groups of the NCLDV was demonstrated. Thus, a detailed study of the evolution of the genomic core of the NCLDV reveals substantial complexity and diversity of evolutionary scenarios that was largely unsuspected previously. The phylogenetic coherence between the core genes is sufficient to validate the hypothesis on the evolution of all NCLDV from a common ancestral virus although the set of ancestral genes might be smaller than previously inferred from patterns of gene presence-absence.
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6-7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
完全增长的细菌,古细菌,真核生物和病毒的完全测序基因组(已经可用的基因组以及正在进行的更多基因组)为基因组研究带来了新的机会和新的挑战。在过去的一年中,我们进行了几项研究,利用基因组信息来建立基因组进化和功能的基本原理。特别是,我们研究了哺乳动物基因组中编码的许多长的非编码RNA(LNCRNA)的演变。 lncRNA的功能在很大程度上仍然未知,但它们的进化似乎受到纯化选择的约束,尽管相对较弱。为了了解进化模式和lncRNA的功能范围,可以将它们与更好的特征性蛋白质编码基因进行比较。蛋白质编码基因的进化速率显示出与表达的通用阴性相关性:在进化过程中,高表达基因比表达水平较低的基因更为保守。这种相关性在错误折叠驱动的蛋白进化假设中概念化,这是根据蛋白质表达所产生的主要成本。我们试图确定长长的基因间NCRNA(lincrNA)是否遵循相同的进化趋势,并且确实检测到了人类和小鼠lincrna基因的进化速率和表达水平之间的中等但统计学上显着的负相关。 LincrNA的相关性的大小与具有相似序列保守水平的蛋白质编码基因相似的蛋白质编码基因相似。此外,LincRNA的表达水平与Lincrna分子折叠的预测程度显着,正相关(碱基对),但是,进化速率和折叠到表达水平的贡献是独立的。因此,进化速率和表达水平之间的反相关似乎是基因进化的一般特征,它可能是由于蛋白质和/或其他因素的类似有害影响和/或其他因素(例如,基因产物的相互作用伴侣的数量)引起的。
一个单独的项目专门针对原核防御系统的系统基因组。最近发现的CRISPR-CAS适应性免疫系统几乎存在于所有古细菌和许多细菌中。它由CRISPR重复的录音带组成,它们结合了与病毒或质粒基因组片段同源的隔离器,这些隔离剂在免疫反应中用作指导RNA,以及编码具有多种蛋白质的蛋白质所需的多种特征性活性所需的CRISPR相关(CAS)基因。在这里,我们研究了CAS基因的演变,并表明它们在纯化的选择下进化,通常比纯化选择影响基因组中基因的中位强度弱得多。例外是CAS1和CAS2基因,通常在接近基因组中值的纯化选择水平上进化。因此,尽管这些基因与从外星基因组中获取间隔物有关,但它们似乎并未直接参与细菌和古细菌宿主和传染性剂之间的武器种族。这些基因可能具有与CRISPR-CAS介导的免疫反应中不同的功能,并且具有其他作用。与以前报道的CAS基因的频繁水平转移以及在这项工作中检测到的这些基因中广泛的显微镜重组的证据一起,这些发现揭示了CAS基因的高度动态演化。这一结论符合CRISPR-CAS参与抗病毒免疫的参与,这可能需要与迅速发展的病毒相结合的武器竞赛。但是,我们无法检测到任何CAS基因中强烈选择的证据。
我们广泛研究了核断细胞质大型DNA病毒(NCLDV)的演变,该病毒构成了一个明显的单系组,该基团由至少6个感染各种真核生物宿主的病毒家族组成。 NCLDV进化的全面基因组比较和最大样子重建显示了一组约50个保守的核心基因,可以映射到这类真核病病毒的共同祖先的基因组。
我们对这些核心NCLDV基因进行了详细的系统发育分析,并应用了约束的树方法,以表明大多数核心基因不太可能是单一的。几个核心基因是通过不同的NCLDV谱系从不同来源获取的,而对于大多数这些基因的基因,来自nCLDV的一个或多个组中的同源物的位移。因此,对NCLDV基因组核心进化的详细研究揭示了以前在很大程度上没有引起的进化场景的实质性复杂性和多样性。核心基因之间的系统发育相干性足以验证所有NCLDV从共同祖先病毒进化的假设,尽管祖先基因的集合可能比以前从基因存在模式的模式中所推断出的一组。
真核生物中的蛋白质编码基因被内含子中断,但内含子密度在真核谱系之间存在很大差异。脊椎动物,某些无脊椎动物和绿色植物具有富含器的基因,每个千倍酶的编码序列具有6-7个内含子,而其他大多数真核生物都具有内含子弱的基因。我们使用概率的马尔可夫模型(Markov Chain Monte Carlo,MCMC)重建了内含子增益和损失的历史,该模型来自99个基因组的245个直系同源基因,这些基因代表了真核生物的五个超级组中的三个基因组中的三个基因组,可用多个基因组序列。富含内含子的祖先对每个主要群体都充满信心,其中53%至74%的人内含子密度以95%的置信度推断了最后一个真核共同祖先(LECA)。将MCMC重建的结果与使用最大似然(ML)和Dollo Parsimony方法获得的重建进行了比较。 MCMC和ML推论之间的一个极好的一致性被证明了,而Dollo Parsimony在估计中引入了明显的偏见,通常比MCMC和ML产生的祖先内含子密度较低。真核基因的进化以内含子损失为主,仅在包括植物和动物在内的几个主要分支的基础上获得了可观的增益。对于动物的最后一个共同的祖先,推断出最高的内含子密度为120%至130%的人类价值。重建表明,从LECA到哺乳动物的整个下降线都是内含子富含子的,这是一个有利于替代剪接进化的状态。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Eugene V Koonin其他文献
Identification of dephospho-CoA kinase in Thermococcus kodakarensis and the complete CoA biosynthesis pathway
Thermococcus kodakarensis 中去磷酸 CoA 激酶的鉴定及完整 CoA 生物合成途径
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Takahiro Shimosaka;Kira S Makarova;Eugene V Koonin;Haruyuki Atomi - 通讯作者:
Haruyuki Atomi
超好熱性アーキアThermococcus kodakarensisにおける新規dephospho-CoA kinaseの同定および解析
超嗜热古菌 Thermococcus kodakarensis 中新型去磷酸 CoA 激酶的鉴定和分析
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Takahiro Shimosaka;Kira S Makarova;Eugene V Koonin;Haruyuki Atomi - 通讯作者:
Haruyuki Atomi
超好熱性アーキアThermococcus kodakarensisにおけるアーキア特異的な新規 dephospho-CoA kinaseの同定および解析
超嗜热古菌 Thermococcus kodakarensis 中新型古菌特异性去磷酸 CoA 激酶的鉴定和分析
- DOI:
- 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
Takahiro Shimosaka;Kira S Makarova;Eugene V Koonin;Haruyuki Atomi - 通讯作者:
Haruyuki Atomi
Eugene V Koonin的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Eugene V Koonin', 18)}}的其他基金
Finding Protein Sequence Motifs--Methods and Application
寻找蛋白质序列基序--方法与应用
- 批准号:
6988455 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Finding Protein Sequence Motifs--methods And Application
寻找蛋白质序列基序--方法与应用
- 批准号:
6681337 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Comparative Analysis Of Completely Sequenced Genomes
完全测序的基因组的比较分析
- 批准号:
7969213 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Finding Protein Sequence Motifs--methods And Applications
寻找蛋白质序列基序——方法和应用
- 批准号:
8943217 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Comparative Analysis Of Completely Sequenced Genomes
完全测序的基因组的比较分析
- 批准号:
9160910 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Finding Protein Sequence Motifs--methods And Applications
寻找蛋白质序列基序——方法和应用
- 批准号:
9555730 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Finding Protein Sequence Motifs--methods And Applications
寻找蛋白质序列基序——方法和应用
- 批准号:
7594460 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Finding Protein Sequence Motifs--methods And Applications
寻找蛋白质序列基序——方法和应用
- 批准号:
7735068 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
COMPARATIVE ANALYSIS OF COMPLETELY SEQUENCED GENOMES
全测序基因组的比较分析
- 批准号:
6111075 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
Comparative Analysis Of Completely Sequenced Genomes
完全测序的基因组的比较分析
- 批准号:
6988458 - 财政年份:
- 资助金额:
$ 334.06万 - 项目类别:
相似国自然基金
卫星互联网端到端安全传输模型与安全路由协议研究
- 批准号:62302389
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
中继通信协议下2-D网络化系统的递推状态估计研究
- 批准号:62373103
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
新型实用化量子密码协议的高安全等级理论分析
- 批准号:12374473
- 批准年份:2023
- 资助金额:52 万元
- 项目类别:面上项目
云边端架构下联邦学习下行通信压缩算法与协议研究
- 批准号:62372487
- 批准年份:2023
- 资助金额:50.00 万元
- 项目类别:面上项目
面向实际应用的测量设备无关类量子密钥分发协议研究
- 批准号:62371244
- 批准年份:2023
- 资助金额:53.00 万元
- 项目类别:面上项目
相似海外基金
Parent-adolescent informant discrepancies: Predicting suicide risk and treatment outcomes
父母与青少年信息差异:预测自杀风险和治疗结果
- 批准号:
10751263 - 财政年份:2024
- 资助金额:
$ 334.06万 - 项目类别:
The Proactive and Reactive Neuromechanics of Instability in Aging and Dementia with Lewy Bodies
衰老和路易体痴呆中不稳定的主动和反应神经力学
- 批准号:
10749539 - 财政年份:2024
- 资助金额:
$ 334.06万 - 项目类别:
A study for cross borders Indonesian nurses and care workers: Case of Japan-Indonesia Economic Partnership Agreement
针对跨境印度尼西亚护士和护理人员的研究:日本-印度尼西亚经济伙伴关系协定的案例
- 批准号:
22KJ0334 - 财政年份:2023
- 资助金额:
$ 334.06万 - 项目类别:
Grant-in-Aid for JSPS Fellows