Developing novel deep-learning based methods for deciphering non-coding gene regulatory code

开发基于深度学习的新型方法来破译非编码基因调控密码

基本信息

项目摘要

SUMMARY This project will contribute novel pre-trained DNA Bidirectional Encoder Representations from Transformers, called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate integration of gene regulatory information from rapidly accumulating sequence data with NLM’s genetic databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by helping identify the genetic components of disease. While the genetic code explaining how DNA is translated into proteins is universal, the regulatory code that determines when and how the genes are expressed varies across different cell-types and organisms. Non-coding DNA is highly complex due to the existence of polysemy and distant semantic relationship, from a language modeling perspective. Recently, deep learning methods have been used in unraveling the gene regulatory code, but failed to globally and robustly model such language features in the genome, especially in data-scarce scenarios. To address this challenge, we propose DNABERT to model DNA as a language, by adapting the idea of Bidirectional Encoder Representations from Transformers (BERT). Based on recent observations in natural language processing research, we hypothesize that pre-trained transformer-based neural network model offer a promising, and yet not fully explored, deep learning approach for a variety of sequence prediction tasks in the analysis of non-coding DNA. Our preliminary results showed that DNABERT on the human genome achieved state-of-the-art performance on promoter and splice-site prediction tasks, after easy fine-tuning on small task-specific data (Ji, Y. et al. 2020). The goal of our proposed research is to develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state- of-the-art deep-learning based methods. Specific aims are (1) develop novel deep-learning methods by adapting BERT; (2) apply the proposed deep-learning methods to specifically target non-coding DNA sequence analyses and predictions; and (3) predict and validate functional non-coding genetic variants by applying DNABERT prediction models. A major contribution of the proposed research is development of pre-trained DNABERT model and prediction algorithms, which present new powerful methods for analyses and predictions of DNA sequences. Since the pre-training of DNABERT is resource-intensive, we will provide the source code and pre-trained model at Github for future academic research. We will also develop an integrated web server to (1) deploy DNABERT model, (2) database to store the identified sequence features and predictions, and (3) tutorials to help users to apply DNABERT to their specific research problems. We anticipate that DNABERT can bring new advancements and insights to the bioinformatics community by bringing advanced language modeling perspective to gene regulation analyses.
摘要 这个项目将贡献来自变形金刚的新的预先训练的DNA双向编码器表示, DNABERT,以及相关的深度学习工具,用于破译非编码DNA的语言并促进 从快速积累的序列数据中获得的基因调控信息与NLM的基因整合 数据库(例如,DBSNP、DBGaP和ClinVar),通过以下方式为科学家和公共卫生提供服务 帮助识别疾病的遗传成分。而解释DNA如何翻译的遗传密码 转化为蛋白质是普遍存在的,决定基因表达时间和方式的调控密码各不相同 跨越不同的细胞类型和生物体。由于多义性的存在,非编码DNA高度复杂 和遥远的语义关系,从语言建模的角度来看。最近,深度学习方法已经 曾被用于解开基因调控代码,但未能在全球范围内有力地模拟这种语言 基因组中的特征,特别是在数据稀缺的情况下。为了应对这一挑战,我们提出了DNABERT 将DNA建模为一种语言,采用来自Transformers的双向编码器表示法的思想 (伯特)。根据最近在自然语言处理研究中的观察,我们假设预先训练的 基于变压器的神经网络模型提供了一种前景看好但尚未充分探索的深度学习方法 用于分析非编码DNA中的各种序列预测任务。我们的初步结果显示 人类基因组上的DNABERT在启动子和剪接位点上取得了最先进的表现 预测任务,在对小任务特定数据进行轻松微调后(Ji,Y.等人)。2020)。我们提议的目标是 研究是为各种序列预测任务开发DNABERT,并与现有状态进行基准测试- 最先进的基于深度学习的方法。具体目标是(1)通过适应发展新的深度学习方法 BERT;(2)将所提出的深度学习方法应用于特定的非编码DNA序列分析 和预测;以及(3)应用DNABERT预测和验证功能性非编码遗传变异 预测模型。拟议研究的一个主要贡献是开发了预先训练的DNABERT模型 以及预测算法,为DNA序列的分析和预测提供了新的强有力的方法。 由于DNABERT的预培训是资源密集型的,我们将提供源代码和预培训模型 为未来的学术研究做准备。我们还将开发一个集成的Web服务器,以(1)部署DNABERT 模型,(2)存储识别的序列特征和预测的数据库,以及(3)帮助用户 将DNABERT应用于他们特定的研究问题。我们期待DNABERT能够带来新的进步 并通过将先进的语言建模观点引入基因来为生物信息学社区提供见解 监管分析。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

RAMANA V DAVULURI其他文献

RAMANA V DAVULURI的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('RAMANA V DAVULURI', 18)}}的其他基金

Developing novel deep-learning based methods for deciphering non-coding gene regulatory code
开发基于深度学习的新型方法来破译非编码基因调控密码
  • 批准号:
    10615784
  • 财政年份:
    2021
  • 资助金额:
    $ 33.07万
  • 项目类别:
Informatics Platform for Mammalian Gene Regulation at Isoform-level
异构体水平的哺乳动物基因调控信息学平台
  • 批准号:
    10273985
  • 财政年份:
    2020
  • 资助金额:
    $ 33.07万
  • 项目类别:
Informatics Platform for Mammalian Gene Regulation at Isoform-level
异构体水平的哺乳动物基因调控信息学平台
  • 批准号:
    9922347
  • 财政年份:
    2013
  • 资助金额:
    $ 33.07万
  • 项目类别:
Informatics Platform for Mammalian Gene Regulation at Isoform-level
异构体水平的哺乳动物基因调控信息学平台
  • 批准号:
    8843951
  • 财政年份:
    2013
  • 资助金额:
    $ 33.07万
  • 项目类别:
Informatics platform for mammalian gene regulation at isoform-level
异构体水平的哺乳动物基因调控信息学平台
  • 批准号:
    8658144
  • 财政年份:
    2013
  • 资助金额:
    $ 33.07万
  • 项目类别:
Bioinformatics Facility
生物信息学设施
  • 批准号:
    7945001
  • 财政年份:
    2009
  • 资助金额:
    $ 33.07万
  • 项目类别:
Genomewide discovery & analysis of alternative promoters
全基因组发现
  • 批准号:
    7678211
  • 财政年份:
    2006
  • 资助金额:
    $ 33.07万
  • 项目类别:
Genomewide discovery & analysis of alternative promoters
全基因组发现
  • 批准号:
    7226994
  • 财政年份:
    2006
  • 资助金额:
    $ 33.07万
  • 项目类别:
Genomewide discovery & analysis of alternative promoters
全基因组发现
  • 批准号:
    7033451
  • 财政年份:
    2006
  • 资助金额:
    $ 33.07万
  • 项目类别:
Genomewide discovery & analysis of alternative promoters
全基因组发现
  • 批准号:
    7371108
  • 财政年份:
    2006
  • 资助金额:
    $ 33.07万
  • 项目类别:

相似国自然基金

企业绩效评价的DEA-Benchmarking方法及动态博弈研究
  • 批准号:
    70571028
  • 批准年份:
    2005
  • 资助金额:
    16.5 万元
  • 项目类别:
    面上项目

相似海外基金

An innovative EDI data, insights & peer benchmarking platform enabling global business leaders to build data-led EDI strategies, plans and budgets.
创新的 EDI 数据、见解
  • 批准号:
    10100319
  • 财政年份:
    2024
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Collaborative R&D
BioSynth Trust: Developing understanding and confidence in flow cytometry benchmarking synthetic datasets to improve clinical and cell therapy diagnos
BioSynth Trust:发展对流式细胞仪基准合成数据集的理解和信心,以改善临床和细胞治疗诊断
  • 批准号:
    2796588
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Studentship
Collaborative Research: SHF: Medium: A Comprehensive Modeling Framework for Cross-Layer Benchmarking of In-Memory Computing Fabrics: From Devices to Applications
协作研究:SHF:Medium:内存计算结构跨层基准测试的综合建模框架:从设备到应用程序
  • 批准号:
    2347024
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Standard Grant
Elements: CausalBench: A Cyberinfrastructure for Causal-Learning Benchmarking for Efficacy, Reproducibility, and Scientific Collaboration
要素:CausalBench:用于因果学习基准测试的网络基础设施,以实现有效性、可重复性和科学协作
  • 批准号:
    2311716
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Standard Grant
Benchmarking collisional rates and hot electron transport in high-intensity laser-matter interaction
高强度激光-物质相互作用中碰撞率和热电子传输的基准测试
  • 批准号:
    2892813
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Studentship
FET: Medium: Quantum Algorithms, Complexity, Testing and Benchmarking
FET:中:量子算法、复杂性、测试和基准测试
  • 批准号:
    2311733
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Continuing Grant
Collaborative Research: BeeHive: A Cross-Problem Benchmarking Framework for Network Biology
合作研究:BeeHive:网络生物学的跨问题基准框架
  • 批准号:
    2233969
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Continuing Grant
Establishing and benchmarking advanced methods to comprehensively characterize somatic genome variation in single human cells
建立先进方法并对其进行基准测试,以全面表征单个人类细胞的体细胞基因组变异
  • 批准号:
    10662975
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
QUARREFOUR - Benchmarking Multi-core Quantum Computing Systems
QUARREFOUR - 多核量子计算系统基准测试
  • 批准号:
    10074653
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Collaborative R&D
Benchmarking Quantum Advantage
量子优势基准测试
  • 批准号:
    EP/Y004418/1
  • 财政年份:
    2023
  • 资助金额:
    $ 33.07万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了