权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Deep learning for protein subcellular/sub-organelle localizations and localization motifs

蛋白质亚细胞/亚细胞器定位和定位基序的深度学习

基本信息

批准号：
9768571
负责人：
DONG XU
金额：
$ 20.53万
依托单位：
UNIVERSITY OF MISSOURI-COLUMBIA
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-09-01 至 2021-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9768571
关键词：
Address Amino Acid Sequence Architecture Attention Base Sequence Bioinformatics Cells Classification Computer software Computing Methodologies Data Data Sources Detection Development Disease Engineering Enzymes Eukaryotic Cell Experimental Designs Generations Hybrids Imagery Label Machine Learning Metabolic Diseases Methodology Methods Mitochondria Modeling N-terminal Neural Network Simulation Nobel Prize Organelles Pattern Peptide Signal Sequences Peptides Plants Protein Family Protein translocation Proteins Proteome Public Health Research Research Personnel Resolution Series Signal Transduction Supervision Techniques Technology Training Ubiquitination Work base bioinformatics tool computerized tools convolutional neural network deep learning design improved innovation interest learning network learning strategy novel online resource predictive modeling protein function recurrent neural network success therapy design tool

项目摘要

Project Summary Eukaryotic cells have diverse cellular components, including subcellular organelles and sub-organelle compartments. The accurate targeting of proteins to these cellular components is crucial in establishing and maintaining cellular organizations and functions. Mis-localization of proteins is often associated with metabolic disorders and diseases. However, the vast majority of proteins lack subcellular/sub-organelle localization annotation. Compared with experimental methods, computational prediction of protein localization provides an efficient and effective way for proteome annotation and experimental design. The current prediction tools for protein localization have significant room for improvement. In addition, no tool can predict localization at the sub- organelle resolution or internal localization signals. Deep learning, as the cutting-edge technology in machine learning, presents a new opportunity for this classical bioinformatics problem. The availability of recent high- throughput localization data can also train deep learning well. The PI’s lab has demonstrated some success on a special case, i.e., predicting mitochondrial localizations for plants using deep learning. In this project, the PI proposes to develop new methods and a standalone toolkit for accurate and scalable protein localization prediction at the subcellular and sub-organelle levels, as well as for characterization of localization motifs (including novel internal motifs). The general approach is to design a semi-supervised deep- learning method that utilizes both annotated protein sequences with known localization and unannotated protein sequences as training data. Through the realization of an unsupervised deep-learning approach, a general representation of protein sequences will be implemented, characterizing both local and global features of protein sequences. By visualizing and characterizing the deep-learning models, novel, interpretable protein sequence patterns will be predicted as putative targeting peptides and compared with known localization signals. We will also use the methods to be developed and the unsupervised models to be trained on all protein sequences as a general framework for other sequence-based prediction problems that predict the label of a protein and the key residues contributing to the label. We will make the platform highly customizable and apply it to three applications, including ubiquitination protein prediction, enzyme EC number prediction, and protein family/subfamily classification. The innovative contributions to protein sequence-based analyses and predictions include: (1) using raw amino acid sequences as training inputs without feature engineering; (2) utilizing the huge amount of unannotated data in an unsupervised deep learning to characterize a general protein feature representation; (3) identifying potential targeting signals (especially internal motifs) by decoding the trained deep- learning models, augmented with sophisticated attention mechanisms; 4) detecting multiple-organelle targeting and sub-organelle localizations by a novel hierarchical multi-label architecture; and (5) combining features from different data sources by a multiplicative fused CNN model.

项目摘要真核细胞具有多种细胞成分，包括亚细胞器和亚细胞器。车厢。蛋白质对这些细胞成分的准确靶向对于建立和维护蜂窝组织和功能。蛋白质的错误定位常常与代谢有关。紊乱和疾病。然而，绝大多数蛋白质缺乏亚细胞/亚细胞器定位。注释。与实验方法相比，蛋白质定位的计算预测提供了一个一种高效的蛋白质组注释和实验设计方法。当前的预测工具用于蛋白质定位有很大的改进空间。此外，没有任何工具可以预测本地化在细胞器分辨率或内部定位信号。深度学习，作为机器的前沿技术学习，为这个经典的生物信息学问题提供了一个新的机会。最近的高可获得性吞吐量本地化数据也可以很好地训练深度学习。PI的实验室已经证明了一些成功的一种特殊情况，即使用深度学习预测植物线粒体的定位。在这个项目中，PI建议开发新的方法和独立的工具包，以实现精确度和可扩展性在亚细胞和亚细胞器水平上的蛋白质定位预测，以及对本土化主题(包括新奇的内在主题)。一般的方法是设计一个半监督的深井- 利用已知定位的已注释蛋白质序列和未注释蛋白质序列的学习方法序列作为训练数据。通过实现无监督的深度学习方法，一般将实现蛋白质序列的表示，表征蛋白质的局部和全局特征序列。通过可视化和表征深度学习模型、新颖的、可解释的蛋白质序列模式将被预测为假定的靶向多肽，并与已知的定位信号进行比较。我们会也使用要开发的方法和要对所有蛋白质序列进行训练的无监督模型作为其他基于序列的预测问题的通用框架，用于预测蛋白质的标签和关键字构成标签的残留物。我们将使该平台高度可定制，并将其应用于三个应用，包括泛素化蛋白质预测、酶EC数预测和蛋白质族/子族分类。基于蛋白质序列的分析和预测的创新贡献包括：(1)使用未经特征工程的氨基酸原始序列作为训练输入；(2)利用海量的用于表征一般蛋白质特征的无监督深度学习中的未注释数据量识别潜在的目标信号(特别是内部基序)，方法是对训练好的深层基序进行解码。学习模型，增强了复杂的注意机制；4)检测多细胞器靶向和子细胞器局部化；以及(5)结合来自不同的数据源通过一个乘性融合的CNN模型。