A computational approach to identify non-linear sequence similarity between lncRNAs

识别 lncRNA 之间非线性序列相似性的计算方法

基本信息

项目摘要

The goal of this proposal is to develop a computational approach to identify non-linear sequence similarity between long noncoding RNAs (lncRNAs). A substantial portion of the RNAs produced by eukaryotic genomes can be classified as lncRNAs, which have little or no potential to encode for protein. LncRNAs are essential across kingdoms of life owing to critical roles in gene regulation. However, progress in the field has been stifled by a lack of computational tools to identify meaningful similarity between lncRNAs. Unlike protein-coding genes, lncRNAs are not constrained by codon usage, evolve rapidly, and achieve function by employing structures or proteins in ways that are not well-understood. Thus, lncRNAs with similar functions often lack any semblance of linear sequence similarity, yet owing to a lack of other options, linear alignment remains the predominant approach for sequence comparison in the lncRNA field. As a result, studies of one lncRNA rarely inform the understanding of others, and among the thousands of unstudied lncRNAs, it is nearly impossible to computationally identify those that encode meaningful functions. However, prior research has demonstrated the proof-of-principle that when compared to linear alignment, non-linear forms of sequence comparisons can provide exponentially more information about the biological properties of lncRNAs, including a modest ability to infer molecular function. In this project, researchers will develop and validate software that will enable any biologists, regardless of computational expertise, to perform quantitative, non-linear sequence comparisons from essentially any computing resource, including a personal laptop. In concert with the development and validation of the new software, the project will provide high-quality mentored research experiences and sustained career guidance for undergraduate students hailing from underrepresented or underprivileged backgrounds, thereby encouraging their entry into science and promoting equity, diversity, and overall excellence in computational biology. The investigative team recently developed an approach called SEEKR (sequence evaluation through k-mer representation), which compares sequences by their relative abundance of substrings called k-mers. SEEKR provided some of the first evidence that lncRNAs with analogous functions can harbor similarities that are invisible by conventional forms of linear sequence alignment. Despite this success, SEEKR remains limited in its utility. Most notably, it is unable to identify regional similarity between lncRNAs, and has no means to consider local nucleotide context in similarity evaluations, each of which are essential components of lncRNA functionality. Moreover, SEEKR is qualitative and provides end-users with no ability to assess significance of its similarity scores, a critical component of all broadly-used sequence comparison tools. Thus, while SEEKR was an important proof-of-principle, it falls well short of the reliable and broadly applicable tool that the field needs to identify meaningful non-linear similarity between lncRNAs. To address these shortcomings and provide biologists with better tools to identify relationships between sequence and function in lncRNAs, this research will apply a statistical approach called the hidden Markov model (HMM) to develop a python-based software package, hmmSEEKR, that would give biologists the ability to identify regional and whole-transcript similarities in k-mer content between any set of lncRNAs. An early version of hmmSEEKR enabled the identification of known protein-binding domains and functionally characterized lncRNAs from within the mammalian transcriptome, feats that to the knowledge of the investigative team, have not previously been achieved, including using SEEKR. hmmSEEKR will be rigorously validated and tuned by identifying commonalities in protein-binding profiles between a set of known lncRNA functional domains and other lncRNA-like domains from within the transcriptome. Findings will be published in open access journals, including recommendations for default parameters, and a vetted version of hmmSEEKR will be deposited in GitHub and the Python Package Index. A usage manual and links to results will also be posted on https://www.med.unc.edu/pharm/calabreselab/seekr/.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
本提案的目标是开发一种计算方法来识别长链非编码rna (lncRNAs)之间的非线性序列相似性。真核生物基因组产生的rna中有很大一部分可以归类为lncrna,它们很少或没有编码蛋白质的潜力。由于lncrna在基因调控中的关键作用,lncrna在生命王国中是必不可少的。然而,由于缺乏计算工具来识别lncrna之间有意义的相似性,该领域的进展受到了抑制。与蛋白质编码基因不同,lncrna不受密码子使用的限制,进化迅速,并通过使用尚不清楚的结构或蛋白质来实现功能。因此,具有相似功能的lncRNA往往缺乏线性序列相似性,但由于缺乏其他选择,线性比对仍然是lncRNA领域序列比较的主要方法。因此,对一种lncRNA的研究很少能够为对其他lncRNA的理解提供信息,并且在数千种未被研究的lncRNA中,几乎不可能通过计算识别出那些编码有意义功能的lncRNA。然而,先前的研究已经证明,与线性比对相比,非线性形式的序列比较可以提供更多关于lncrna生物学特性的信息,包括推断分子功能的适度能力。在这个项目中,研究人员将开发并验证软件,使任何生物学家,无论其计算专业知识如何,都可以从任何计算资源(包括个人笔记本电脑)进行定量、非线性序列比较。随着新软件的开发和验证,该项目将为来自弱势群体或贫困背景的本科生提供高质量的指导研究经验和持续的职业指导,从而鼓励他们进入科学领域,促进计算生物学的公平性、多样性和整体卓越性。研究小组最近开发了一种名为SEEKR(通过k-mer表示进行序列评估)的方法,该方法通过k-mer子串的相对丰度来比较序列。SEEKR提供了一些第一个证据,证明具有类似功能的lncrna可以具有传统形式的线性序列比对所看不到的相似性。尽管取得了这样的成功,SEEKR的实用性仍然有限。最值得注意的是,它无法识别lncRNA之间的区域相似性,也无法在相似性评估中考虑局部核苷酸背景,而每一个都是lncRNA功能的重要组成部分。此外,SEEKR是定性的,最终用户无法评估其相似性分数的重要性,而相似性分数是所有广泛使用的序列比较工具的关键组成部分。因此,虽然SEEKR是一个重要的原理证明,但它远远不够可靠和广泛适用的工具,该领域需要识别lncrna之间有意义的非线性相似性。为了解决这些缺点,并为生物学家提供更好的工具来识别lncrna中序列和功能之间的关系,本研究将应用一种称为隐马尔可夫模型(HMM)的统计方法来开发一个基于python的软件包hmmSEEKR,这将使生物学家能够识别任何一组lncrna之间k-mer内容的区域和全转录物相似性。hmmSEEKR的早期版本能够从哺乳动物转录组中识别已知的蛋白质结合结构域和功能表征的lncrna,据调查小组所知,这是以前没有实现的,包括使用SEEKR。hmmSEEKR将通过鉴定转录组中一组已知lncRNA功能域和其他lncRNA样结构域之间的蛋白质结合谱的共性来严格验证和调整。研究结果将发表在开放获取期刊上,包括对默认参数的建议,hmmSEEKR的审查版本将存放在GitHub和Python包索引中。使用手册和结果链接也将发布在https://www.med.unc.edu/pharm/calabreselab/seekr/.This上,这反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Joseph Calabrese其他文献

705. Using iPS Derived Neurons and GWAS Together to Identify Genes for Lithium Response
  • DOI:
    10.1016/j.biopsych.2017.02.772
  • 发表时间:
    2017-05-15
  • 期刊:
  • 影响因子:
  • 作者:
    John Kelsoe;Mike McCarthy;Caroline Nievergelt;Paul Shilling;John Nurnberger;Elliot Gershon;William Coryell;Melvin McInnis;Wade Berrettini;Ketil Odegaard;Joseph Calabrese;Peter Zandi;Martin Alda;Mark Frye;David Craig;Jerome Mertens;Kristen Brennand;Jun Yao;Fred Gage
  • 通讯作者:
    Fred Gage

Joseph Calabrese的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

量化 domain 的拓扑性质
  • 批准号:
    11771310
  • 批准年份:
    2017
  • 资助金额:
    48.0 万元
  • 项目类别:
    面上项目
基于Riemann-Hilbert方法的相关问题研究
  • 批准号:
    11026205
  • 批准年份:
    2010
  • 资助金额:
    3.0 万元
  • 项目类别:
    数学天元基金项目
EnSite array指导下对Stepwise approach无效的慢性房颤机制及消融径线设计的实验研究
  • 批准号:
    81070152
  • 批准年份:
    2010
  • 资助金额:
    10.0 万元
  • 项目类别:
    面上项目
MBR中溶解性微生物产物膜污染界面微距作用机制定量解析
  • 批准号:
    50908133
  • 批准年份:
    2009
  • 资助金额:
    20.0 万元
  • 项目类别:
    青年科学基金项目
新型低碳马氏体高强钢在不同低温下解理断裂物理模型的研究
  • 批准号:
    50671047
  • 批准年份:
    2006
  • 资助金额:
    30.0 万元
  • 项目类别:
    面上项目
基于生态位理论与方法优化沙区人工植物群落的研究
  • 批准号:
    30470298
  • 批准年份:
    2004
  • 资助金额:
    15.0 万元
  • 项目类别:
    面上项目

相似海外基金

A full spectrum rational approach to identify antiarrhythmic agents targeting IKs Channels
识别针对 IK 通道的抗心律失常药物的全谱理性方法
  • 批准号:
    10734513
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
Novel approach to identify RNA-bound small molecules in vivo
体内鉴定 RNA 结合小分子的新方法
  • 批准号:
    10646626
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
Multimodal omics approach to identify health to cardiometabolic disease transitions
多模式组学方法确定健康状况向心脏代谢疾病的转变
  • 批准号:
    10753664
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
Dissecting the tumor cell-immune TME axis to identify therapeutically actionable vulnerabilities that potentiate immunotherapy in GBM
剖析肿瘤细胞免疫 TME 轴,以确定可增强 GBM 免疫治疗的治疗上可操作的漏洞
  • 批准号:
    10743534
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
A novel proteomics approach to identify alcohol-induced changes in synapse-specific presynaptic protein interactions.
一种新的蛋白质组学方法,用于识别酒精引起的突触特异性突触前蛋白质相互作用的变化。
  • 批准号:
    10651991
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
Limited interaction cohort to identify determinants of viral suppression in MSM and transfeminine individuals living with HIV: A multilevel approach
有限的相互作用队列来确定 MSM 和跨性别女性 HIV 感染者病毒抑制的决定因素:多层次方法
  • 批准号:
    10685845
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
A mechanistic and dyadic approach to identify how interpersonal conscientiousness supports cognitive health and lowers risk of dementia
采用机械和二元方法来确定人际责任感如何支持认知健康并降低痴呆风险
  • 批准号:
    10739837
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
A Prostate Cancer Dependency Map to Identify Tumor Subtype-Specific Vulnerabilities
用于识别肿瘤亚型特异性漏洞的前列腺癌依赖性图
  • 批准号:
    10578640
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
A machine learning approach to identify carbon dioxide-binding proteins for sustainability and health
一种机器学习方法来识别二氧化碳结合蛋白以实现可持续发展和健康
  • 批准号:
    2838427
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
    Studentship
Engaging Hospitalized Patients and Family Caregivers to Identify and Prevent Delirium Superimposed on Dementia: An Intervention Mapping Approach.
让住院患者和家庭护理人员参与识别和预防痴呆症叠加的谵妄:一种干预映射方法。
  • 批准号:
    10642332
  • 财政年份:
    2023
  • 资助金额:
    $ 66.26万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了