权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Reliable post hoc interpretations of deep learning in genomics

基因组学深度学习的可靠事后解释

基本信息

批准号：
10638753
负责人：
Peter K Koo
金额：
$ 38.4万
依托单位：
COLD SPRING HARBOR LABORATORY
依托单位国家：
美国
项目类别：
财政年份：
2023
资助国家：
美国
起止时间：
2023-08-01 至 2027-04-30
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10638753
关键词：
ATAC-seq Acceleration Address Affect Benchmarking Binding Biological Biological Assay Biological Sciences Biology Chromatin Communities Complex Computer software Computing Methodologies DNA DNA Sequence Data Development Disease Ensure Future Genetic Transcription Genomics Goals Individual Knowledge Learning Machine Learning Maps Methods Modeling Noise Nucleotides Performance Population Positioning Attribute Recurrence Regulation Regulatory Element Resolution Series Signal Transduction Single Nucleotide Polymorphism Specific qualifier value Structure TensorFlow Time Training Transcriptional Regulation Translating Untranslated RNA base bioinformatics tool biological systems cell type computerized tools deep learning deep neural network direct application functional genomics genomic data geometric structure human disease improved innovation insight machine learning method new technology open source performance tests predictive modeling syntax technology research and development tool transcription factor user-friendly

项目摘要

PROJECT SUMMARY Understanding how the coordination of transcription factors bind to non-coding DNA provides mechanistic insights into transcriptional regulation. Recent developments in deep neural networks (DNNs) have revolutionized our ability to study regulatory genomics. While they have demonstrated improved predictions compared to previous methods based on traditional computational genomics, their low interpretability has earned them a reputation as a black box. To address this gap, post hoc model interpretability methods have emerged to interrogate important features that the network has learned. Of these, attribution maps have demonstrated promise, providing importance scores for each nucleotide in a given sequence; these have a natural interpretation as single-nucleotide variant effects. In principle, attribution maps should contain information to identify motifs that are important for cell-type specific regulatory functions and annotate their positions at base- resolution. However, attribution maps are often noisy in practice; in addition to motifs, they contain spurious importance scores for arbitrary nucleotides for reasons that are not well established. Despite their promise, interpreting a DNN through attribution maps remains challenging. Here we propose three complementary aims that serve to maximize the biological insights that we can achieve from attribution maps for genomic DNNs. In Aim 1, we will develop a model selection framework to identify the optimal DNN from a set of candidate DNNs that yields high generalization performance and interpretable attribution maps. In Aim 2, we will develop robust training strategies based on regularization and data augmentations tailored for genomics, with the broader aim of ensuring that DNNs yield high-quality attribution maps and high generalization. In Aim 3, we will develop and employ interpretable computational methods to directly analyze attribution maps to facilitate discovery of functional motifs and annotate their positions. Each aim will be implemented as open-source software in TensorFlow and PyTorch. As the number of deep learning applications in genomics is rising quickly, the biomedical community will greatly benefit from these user-friendly computational tools by enabling the deployment of robust training and interpretability analysis for any DNN trained on functional genomics assays. This, in turn, will drive new discoveries in cis-regulatory biology across the many biological systems that deep learning has already been applied to and the new applications that will continue to emerge in the future.

项目概要了解转录因子如何与非编码 DNA 结合的协调提供了机制对转录调控的见解。深度神经网络 (DNN) 的最新发展彻底改变了我们研究调控基因组学的能力。虽然他们已经证明了改进的预测与之前基于传统计算基因组学的方法相比，其低可解释性赢得了他们被称为黑匣子。为了解决这一差距，出现了事后模型可解释性方法询问网络学到的重要特征。其中，归因图已证明 Promise，为给定序列中的每个核苷酸提供重要性评分；这些有一个自然的解释为单核苷酸变异效应。原则上，归因图应包含以下信息：识别对细胞类型特异性调节功能重要的基序并注释它们在碱基的位置解决。然而，归因图在实践中往往很嘈杂；除了图案之外，它们还包含虚假的由于尚未确定的原因，任意核苷酸的重要性得分。尽管他们做出了承诺，通过归因图解释 DNN 仍然具有挑战性。在这里，我们提出三个互补的目标这有助于最大限度地利用基因组 DNN 的归因图获得生物学见解。在目标 1，我们将开发一个模型选择框架，从一组候选 DNN 中识别最佳 DNN 产生高泛化性能和可解释的归因图。在目标 2 中，我们将开发强大的基于针对基因组学定制的正则化和数据增强的培训策略，具有更广泛的目标确保 DNN 产生高质量的归因图和高度泛化。在目标 3 中，我们将开发并采用可解释的计算方法直接分析归因图以促进发现功能图案并注释它们的位置。每个目标都将作为开源软件实施 TensorFlow 和 PyTorch。随着深度学习在基因组学中的应用数量迅速增加，生物医学界将极大地受益于这些用户友好的计算工具，使为任何经过功能基因组学分析训练的 DNN 部署强大的训练和可解释性分析。反过来，这将推动许多生物系统中顺式调控生物学的新发现学习已经被应用，并且未来新的应用还会不断出现。