权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Interpretable and extendable deep learning model for biological sequence analysis and prediction

用于生物序列分析和预测的可解释和可扩展的深度学习模型

基本信息

批准号：
10409152
负责人：
DONG XU
金额：
$ 23.48万
依托单位：
UNIVERSITY OF MISSOURI-COLUMBIA
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-05-01 至 2023-04-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10409152
关键词：
ATAC-seq Address Algorithmic Software Alzheimer&apos s Disease Attention Benchmarking Bioinformatics Biological Biological Models Biology Biotechnology Cell Communication Cells Code Collaborations Communication Communities Complex Analysis Computer Analysis Computers Data Data Analyses Data Set Databases Development Dimensions Disease Environment Evaluation Formulation Genes Genomics Graph Head Heterogeneity Individual Knowledge Machine Learning Malignant Neoplasms Measures Medicine Methods Modeling Multiomic Data Nature Ohio Performance Play Problem Formulations Process Public Health Publishing Regulator Genes Reporting Research Research Personnel Resources Role Running Sequence Analysis Site Source Structure System Techniques Technology Testing Training Universities Validation Visualization Work analysis pipeline base cell type data complexity data format data integration deep learning deep learning algorithm experience flexibility improved innovation insight learning community learning strategy method development neural network novel online resource parent grant single cell sequencing single cell technology single-cell RNA sequencing tool tool development transcriptomics web interface web portal web site

项目摘要

SUMMARY Single-cell sequencing technologies provide great opportunities for studying biology and medicine, but computational analyses are often the bottlenecks to reveal biological insights and define cellular heterogeneity underlying the data. The applications of machine learning (ML), especially deep learning hold great promises to address the challenges. While ML studies from various labs, including the PI’s lab, have made significant progress along this line, the involvement of the ML community in single-cell data analysis is limited due to the barriers of technology complexity and biology knowledge. To attract more ML experts into this field, the PI proposes to make large-scale single-cell sequencing data ML-ready and provide an ML-friendly development environment. Specific aims include: (1) Collect, process, and manage diverse single-cell sequencing data to make them ML-ready. We will collect single-cell sequencing data from public sources and convert them into formats efficient for storage and handling. The data will be processed with multiple options, such as imputation, normalization, and dimension reduction using a pipeline to be developed. (2) Configure the data into benchmarks. We will use the collected data to build benchmarks, gather public benchmarks, and encourage the community to submit their benchmarks. The data will be divided into training, validation, and test sets in multiple settings, including a minimum viable benchmark to assist efficient method development and a comprehensive benchmark for full evaluations. We will develop utilities to evaluate results based on a set of assessment measures, and generate detailed reports. We will select a set of public tools to run them on the benchmarks as baselines for others to compare with. (3) Provide an integrated development environment (IDE) to support partial method development. We will build an IDE for single-cell sequencing analysis method development with plug-and-play features at the code level and web interface for ML researchers to contribute and test any minimum new ideas. A report will be provided containing evaluation metrics and usage of computer resources, comparisons with some public tools, and downstream visualization and interpretation. The newly formatted data, the benchmarks, and the method development and assessment environment will be available at GitHub and the in-house single-cell data analysis web portal DeepMAPS. The proposed research is a natural extension of the parent grant (R35-GM126985), which aims to develop deep- learning algorithms, tools, web resources for analyses and predictions of biological sequences, including (1) developing general unsupervised representations and making deep-learning models interpretable for understanding biological mechanisms and generating hypotheses; (2) applying deep-learning models to a wide range of bioinformatics problems, and (3) making the data, models, and tools freely accessible to the research community. Thanks to the flexibility of the R35 mechanism, the PI’s lab extended these methods to single-cell data analyses, which well-prepared the lab for the proposed tasks.

总结单细胞测序技术为研究生物学和医学提供了巨大的机会，计算分析通常是揭示生物学见解和定义细胞异质性的瓶颈数据的基础。机器学习（ML）的应用，特别是深度学习，具有很大的前景来应对挑战。虽然来自各种实验室的ML研究，包括PI的实验室，虽然ML社区沿着这条路线取得了进展，但由于技术复杂性和生物学知识的障碍。为了吸引更多的ML专家进入这个领域，PI 建议使大规模单细胞测序数据ML就绪，并提供ML友好的开发环境具体目标包括：（1）收集、处理和管理多样化的单细胞测序数据让他们为ML做好准备。我们将从公共来源收集单细胞测序数据并进行转换转换成便于储存和处理的格式。将使用多个选项处理数据，例如使用待开发的流水线进行插补、归一化和降维。(2)配置数据变成基准。我们将使用收集的数据来建立基准，收集公共基准，鼓励社区提交基准。数据将分为训练、验证和多种环境下的测试集，包括最低可行基准，以帮助有效的方法开发和全面评价的综合基准。我们将开发实用程序来评估基于一套评估措施，并生成详细的报告。我们将选择一组公共工具来运行它们基准作为其他人比较的基准。(3)提供综合发展开发环境（IDE）支持部分方法开发。我们将为单细胞测序构建一个IDE 分析方法开发，在代码级和ML的Web界面上具有即插即用功能研究人员贡献和测试任何最低限度的新想法。将提供一份包含评价的报告计算机资源的度量和使用，与一些公共工具的比较，以及下游可视化和解释。新格式化的数据、基准以及方法开发和评估该环境将在GitHub和内部单细胞数据分析门户网站DeepMAPS上提供。的拟议的研究是母基金（R35-GM 126985）的自然延伸，其目的是深入开发学习算法，工具，网络资源的分析和预测的生物序列，包括（1）开发一般的无监督表示，并使深度学习模型可解释为理解生物机制并产生假设;（2）将深度学习模型应用于广泛的一系列生物信息学问题，以及（3）使数据，模型和工具免费提供给研究社区由于R35机制的灵活性，PI的实验室将这些方法扩展到单细胞数据分析，为实验室完成拟议任务做好了充分准备。