权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Interpretable and extendable deep learning model for biological sequence analysis and prediction

用于生物序列分析和预测的可解释和可扩展的深度学习模型

基本信息

批准号：
10395451
负责人：
DONG XU
金额：
$ 45.64万
依托单位：
UNIVERSITY OF MISSOURI-COLUMBIA
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-05-01 至 2023-07-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10395451
关键词：
Algorithmic Software Amino Acid Sequence Area Base Sequence Big Data Bioinformatics Biological Biological Models Biology Biomedical Research Communities Computational Biology Computational algorithm DNA DNA Sequence Data Data Analyses Development Genotype Goals Healthcare Information Systems Knowledge Label Learning Light Machine Learning Malignant Neoplasms Medical Medicine Methods Microbe Modeling Mutation Mutation Analysis Paper Performance Phenotype Plants Plug-in Post-Translational Protein Processing Property Proteins Public Health Publishing RNA RNA Sequences Research Resource Informatics Sequence Analysis Series Source System Technology Work computerized tools deep learning deep learning algorithm deep learning model design drug development improved in silico indexing integrated circuit learning strategy machine learning method mobile application novel online resource open source personalized diagnostics personalized medicine precision medicine protein structure function protein structure prediction software systems supervised learning synthetic biology tool unsupervised learning

项目摘要

Project Abstract Bioinformatics and computational biology have become the core of biomedical research. The PI Dr. Dong Xu's work in this area focuses on development of novel computational algorithms, software and information systems, as well as on broad applications of these tools and other informatics resources for diverse biological and medical problems. He works on many research problems in protein structure prediction, post-translational modification prediction, high-throughput biological data analyses, in silico studies of plants, microbes and cancers, biological information systems, and mobile App development for healthcare. He has published more than 300 papers, with about 12,000 citations and H-index of 55. In this project, the PI proposes to develop deep-learning algorithms, tools, web resources for analyses and predictions of biological sequences, including DNA, RNA, and protein sequences. The availability of these data provides emerging opportunities for precision medicine and other areas, while deep learning as a cutting-edge technology in machine learning, presents a new powerful method for analyses and predictions of biological sequences. With rapidly accumulating sequence data and fast development of deep-learning methods, there is an urgent need to systematically investigate how to best apply deep learning in sequence analyses and predictions. For this purpose, the PI will develop cutting-edge deep-learning methods with the following goals for the next five years: (1) Develop a series of novel deep-learning methods and models to specifically target biological sequence analyses and predictions in: (a) general unsupervised representations of DNA/RNA, protein and SNP/mutation sequences that capture both local and global features for various applications; (b) methods to make deep-learning models interpretable for understanding biological mechanisms and generating hypotheses; (c) “rule learning”, which abstracts the underlying “rules” by combining unsupervised learning of large unlabeled data and supervised learning of small labeled data so that it can classify new unlabeled data. (2) Apply the proposed deep-learning model to DNA/RNA sequence annotation, genotype-phenotype analyses, cancer mutation analyses, protein function/structure prediction, protein localization prediction, and protein post-translational modification prediction. The PI will exploit particular properties associated with each of these problems to improve the deep-learning models. He will develop a set of related prediction and analysis tools, which will improve the state-of-art performance and shed some light on related biological mechanisms. (3) Make the data, models, and tools freely accessible to the research community. The system will be designed modular and open-source, available through GitHub. They will be available like integrated circuit modules, which are universal and ready to plug in for different applications. The PI will develop a web resource for biological sequence representations, analyses, and predictions, as well as tutorials to help biologists with no computational knowledge to apply deep learning to their specific research problems.

项目摘要生物信息学和计算生物学已成为生物医学研究的核心。主要研究者徐东博士这一领域的工作重点是开发新的计算算法、软件和信息系统，以及这些工具和其他信息学资源在不同生物学领域的广泛应用。和医疗问题。他致力于蛋白质结构预测、翻译后修饰预测，高通量生物数据分析，植物，微生物和癌症、生物信息系统和医疗保健领域的移动的应用程序开发。他出版了更多 300多篇论文，约12，000次引用，H指数为55。在本项目中，PI建议开发用于分析和预测生物序列的深度学习算法、工具和网络资源，包括 DNA、RNA和蛋白质序列。这些数据的可用性为精确性提供了新的机会医学和其他领域，而深度学习作为机器学习的前沿技术，分析和预测生物序列的新的强大方法。随着快速积累序列数据和深度学习方法的快速发展，迫切需要系统地研究如何在序列分析和预测中最好地应用深度学习。为此，PI将开发尖端的深度学习方法，未来五年的目标如下： (1)开发一系列新颖的深度学习方法和模型，专门针对生物序列分析和预测：（a）DNA/RNA、蛋白质和（B）捕获用于各种应用的局部和全局特征的SNP/突变序列; 使深度学习模型具有可解释性，以理解生物机制并生成（c）“规则学习”，通过将无监督学习与假设相结合来抽象潜在的“规则”，大的未标记数据和小的标记数据的监督学习，以便它可以分类新的未标记数据。 (2)将所提出的深度学习模型应用于DNA/RNA序列注释、基因型-表型分析、癌症突变分析、蛋白质功能/结构预测、蛋白质定位预测，以及蛋白质翻译后修饰预测。PI将利用与每个项目相关的特定属性，来改进深度学习模型。他将制定一套相关的预测和分析工具，这将提高最先进的性能，并阐明相关的生物机制。 (3)使研究社区可以免费访问数据、模型和工具。该系统将设计为模块化和开源，可通过GitHub获得。它们将像集成电路一样可用这些模块是通用的，可以随时插入不同的应用程序。PI将开发一个网络资源用于生物序列表示，分析和预测，以及帮助生物学家没有计算知识来将深度学习应用于他们的特定研究问题。