权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Developing novel deep-learning based methods for deciphering non-coding gene regulatory code

开发基于深度学习的新型方法来破译非编码基因调控密码

基本信息

批准号：
10615784
负责人：
RAMANA V DAVULURI
金额：
$ 33.08万
依托单位：
STATE UNIVERSITY NEW YORK STONY BROOK
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-01 至 2025-04-30
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10615784
关键词：
Address Benchmarking Binding Bioinformatics Biological Assay Bipolar Disorder CRISPR/Cas technology ChIP-seq ClinVar Code Communities Complex Computer Vision Systems Consumption DNA DNA Sequence DNA Sequence Analysis Data Data Set Databases Development Disease Distant Family member Future Gene Expression Regulation Genes Genetic Genetic Code Genetic Databases Genome Goals Human Human Cell Line Human Genome Label Language Luciferases Malignant Neoplasms Methods Modeling Mus Natural Language Processing Neural Network Simulation Organism Parkinson Disease Performance Protein Isoforms Proteins Public Health RNA Splicing Regulator Genes Regulatory Element Reporter Research Research Personnel Resources Schizophrenia Scientist Semantics Site Source Code Specificity Techniques Time Tissues Training Translating United States National Library of Medicine Untranslated RNA Variant Visualization autism spectrum disorder candidate validation cell type database of Genotypes and Phenotypes dbSNP deep learning deep neural network genetic variant genome editing human DNA insight learning strategy neuropsychiatric disorder novel prediction algorithm predictive modeling promoter public health relevance tool transcriptome transfer learning web server

项目摘要

SUMMARY This project will contribute novel pre-trained DNA Bidirectional Encoder Representations from Transformers, called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate integration of gene regulatory information from rapidly accumulating sequence data with NLM’s genetic databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by helping identify the genetic components of disease. While the genetic code explaining how DNA is translated into proteins is universal, the regulatory code that determines when and how the genes are expressed varies across different cell-types and organisms. Non-coding DNA is highly complex due to the existence of polysemy and distant semantic relationship, from a language modeling perspective. Recently, deep learning methods have been used in unraveling the gene regulatory code, but failed to globally and robustly model such language features in the genome, especially in data-scarce scenarios. To address this challenge, we propose DNABERT to model DNA as a language, by adapting the idea of Bidirectional Encoder Representations from Transformers (BERT). Based on recent observations in natural language processing research, we hypothesize that pre-trained transformer-based neural network model offer a promising, and yet not fully explored, deep learning approach for a variety of sequence prediction tasks in the analysis of non-coding DNA. Our preliminary results showed that DNABERT on the human genome achieved state-of-the-art performance on promoter and splice-site prediction tasks, after easy fine-tuning on small task-specific data (Ji, Y. et al. 2020). The goal of our proposed research is to develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state- of-the-art deep-learning based methods. Specific aims are (1) develop novel deep-learning methods by adapting BERT; (2) apply the proposed deep-learning methods to specifically target non-coding DNA sequence analyses and predictions; and (3) predict and validate functional non-coding genetic variants by applying DNABERT prediction models. A major contribution of the proposed research is development of pre-trained DNABERT model and prediction algorithms, which present new powerful methods for analyses and predictions of DNA sequences. Since the pre-training of DNABERT is resource-intensive, we will provide the source code and pre-trained model at Github for future academic research. We will also develop an integrated web server to (1) deploy DNABERT model, (2) database to store the identified sequence features and predictions, and (3) tutorials to help users to apply DNABERT to their specific research problems. We anticipate that DNABERT can bring new advancements and insights to the bioinformatics community by bringing advanced language modeling perspective to gene regulation analyses.

总结

项目成果

期刊论文数量（5）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

DNABERT-S: LEARNING SPECIES-AWARE DNA EMBEDDING WITH GENOME FOUNDATION MODELS

DOI：
10.48550/arxiv.2402.08777
发表时间：
2024-02
期刊：
ArXiv
影响因子：
0
作者：
Zhihan Zhou;Weimin Wu;Harrison Ho;Jiayi Wang;Lizhen Shi;R. Davuluri;Zhong Wang;Han Liu
通讯作者：
Zhihan Zhou;Weimin Wu;Harrison Ho;Jiayi Wang;Lizhen Shi;R. Davuluri;Zhong Wang;Han Liu

Deep multi-omics integration by learning correlation-maximizing representation identifies prognostically stratified cancer subtypes.