权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Measuring functional similarity between transcriptional enhancers using deep learning

使用深度学习测量转录增强子之间的功能相似性

基本信息

批准号：
10302539
负责人：
Hani Z. Girgis
金额：
$ 36.79万
依托单位：
TEXAS A&M UNIVERSITY-KINGSVILLE
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-09-01 至 2024-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10302539
关键词：
Address Artificial Intelligence Binding Sites Clinical Code Computing Methodologies Consumption DNA DNA Sequence Data Databases Diptera Disease Disease susceptibility Distant Drosophila genus Drosophila melanogaster Elements Enhancers Gene Expression Gene Expression Regulation Genes Genetic Enhancer Element Genetic Transcription Genome Genomics Goals Hymenoptera Insecta Learning Lepidoptera Link Location Machine Learning Malignant Neoplasms Measures Methods Modeling Molecular Molecular Biology Mutate Mutation Order Coleoptera Output Performance Physiological Processes Reporter Genes Research Risk Specificity Structure Techniques Testing Time Tissues Training Transcriptional Regulation Transgenic Organisms Variant artificial neural network cell type computerized tools deep learning deep learning algorithm design experimental study genetic element machine learning algorithm novel promoter sequence learning tool transcription factor

项目摘要

PROJECT SUMMARY Understanding transcriptional regulation remains as a major task in the molecular biology ﬁeld. Enhancers are genetic elements that regulate when and where genes are expressed and their expression levels. These elements are hard to discover because their locations and orientations are not constrained with respect to their target genes. Several diseases and susceptibility to certain diseases are linked to mutations and variants in enhancers. Multiple experimental and computational methods have been developed for locating enhancers. Computational methods are more suitable to handle the large number of genomes being sequenced now because they are faster, cheaper, and less labor intensive than experimental methods. Despite many available computational tools, we lack a sophisticated tool that can measure similarity in the enhancer activity of a pair of sequences. We propose here utilizing Deep Artiﬁcial Neural Networks (DANNs) to develop such a tool. The long-term objective of this project is to decipher the code governing gene regulation with the following speciﬁc aims: (i) design a computational tool for measuring enhancer-enhancer similarity, (ii) validate up to 96 putative enhancers experimentally, (iii) understand enhancer grammar, and (iv) annotate enhancers in more than 50 insect genomes. To achieve these aims, a novel application of DANNs is proposed. Current tools utilize DANNs to answer a yes-no question: does a sequence have similar activity to the tissue-speciﬁc enhancers comprising a particular training set of known enhancers? These approaches require training a separate network on each tissue, leading to inconsistent performances on different tissues. Instead, here we use a DANN to answer a related but different question: does this sequence have similar enhancer activity to a single known tissue-speciﬁc enhancer? This deep network should perform consistently on different cell types because it is trained on pairs of sequences — not individual sequences as is the case in the available tools — representing all tissues for which there are known enhancers. The DANN is trained to recognize sequence pairs with similar enhancer activities and those with dissimilar activities including (i) two enhancers active in two different tissues, (ii) one enhancer and a random genomic sequence, and (iii) two random genomic sequences. The tool outputs a score between 0 and 1, indicating how similar the enhancer activities of the two sequences are. Using a much simpler machine learning algorithm than DANNs, we demonstrate that pairs with similar enhancer activities can be separated from pairs of random genomic sequences or pairs of one enhancer and a random genomic sequence with a very high accuracy. The new tool has many important potential applications including consistent annotation of enhancers across cell types and related species. Our tool can annotate enhancers active in a cell type that has a small number of known enhancers, and it can annotate enhancers in related genomes when there is a set of known enhancers demarcated in one of them. Discovering new transcription factor binding sites is another potential application. Studying enhancer “design principles” and the effects of variants can be facilitated using the proposed tool. Such applications will advance our ﬁeld.

项目概要了解转录调控仍然是分子生物学领域的一项主要任务。增强剂是调节基因表达的时间和地点及其表达水平的遗传元件。这些元素很难发现，因为它们的位置和方向不受其目标基因的限制。多种疾病和对某些疾病的易感性与增强子的突变和变异有关。多种的已经开发了用于定位增强子的实验和计算方法。计算方法更适合处理现在正在测序的大量基因组，因为它们更快、更便宜，与实验方法相比，劳动强度较低。尽管有许多可用的计算工具，但我们缺乏可以测量一对序列增强子活性相似性的复杂工具。我们在这里提议利用深度人工神经网络（DANN）来开发这样的工具。该项目的长期目标是破译控制基因调控的密码，其具体目标如下：（i）设计一种计算工具测量增强子与增强子的相似性，(ii) 通过实验验证多达 96 个假定的增强子，(iii) 理解 (iv) 注释 50 多个昆虫基因组中的增强子。为了实现这些目标，一部小说提出了 DANN 的应用。当前的工具利用 DANN 来回答是或否问题：是否存在序列与包含已知增强子的特定训练集的组织特异性增强子具有类似的活性吗？这些方法需要在每个组织上训练单独的网络，从而导致性能不一致不同的组织。相反，这里我们使用 DANN 来回答一个相关但不同的问题：这个序列是否与单一已知的组织特异性增强子具有相似的增强子活性吗？这个深度网络应该执行在不同的细胞类型上保持一致，因为它是在序列对上训练的——而不是像可用工具中的案例 - 代表有已知增强剂的所有组织。 DANN 已接受训练识别具有相似增强子活性和具有不同活性的序列对，包括（i）两个增强子在两种不同组织中活跃，(ii) 一种增强子和随机基因组序列，以及 (iii) 两种随机基因组序列。该工具输出 0 到 1 之间的分数，表明增强子活动的相似程度两个序列中的一个是。使用比 DANN 简单得多的机器学习算法，我们证明了具有相似增强子活性的对可以从随机基因组序列对或一个增强子和一个具有非常高准确性的随机基因组序列。新工具有许多重要的内容潜在的应用包括跨细胞类型和相关物种的增强子的一致注释。我们的工具可以注释在具有少量已知增强子的细胞类型中活跃的增强子，并且可以注释相关基因组中的增强子，当其中一个基因组中有一组已知的增强子时。发现新的转录因子结合位点是另一个潜在的应用。研究增强器“设计原理”和使用所提出的工具可以促进变体的影响。此类应用将推动我们的领域发展。