权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

MARKOVIAN MODELS FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY

串联质谱蛋白质鉴定的马尔可夫模型

基本信息

批准号：
8364375
负责人：
Vanathi Gopalakrishnan
金额：
$ 0.11万
依托单位：
CARNEGIE-MELLON UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2011
资助国家：
美国
起止时间：
2011-09-15 至 2013-07-31
项目状态：
已结题

项目摘要

This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. Primary support for the subproject and the subproject's principal investigator may have been provided by other sources, including other NIH sources. The Total Cost listed for the subproject likely represents the estimated amount of Center infrastructure utilized by the subproject, not direct funding provided by the NCRR grant to the subproject or subproject staff. Biomedical Research has been revolutionized with technological advances leading to massive accumulation of data. All this data now needs to be mined in order to draw actionable insights into the various biological processes. Complex machine learning algorithms are being developed to perform automated analyses of these large datasets and to come up with robust models that explain the observed data. Such models are then used to identify patterns in data that enable solving of challenging decision problems like diagnosis and prognosis of disease. Our research involves one such class of algorithms called Hidden Markov Models, which are used extensively in sequential data mining problems in Biology. Our particular focus is on development of novel algorithms for identification and quantification of protein sequences in complex biological samples using data that comes out of mass spectrometers. Such analysis will lead to molecular characterization of target conditions like diseased states. Our algorithms involve learning models from large training datasets and are computationally intensive. Additionally, in order to learn a robust model that will perform well across a variety of future test data, we are proposing to perform large-scale experiments with different model topologies and features, and require learning of hundreds of different models worth many days of number-crunching work. However, the entire experimentation can be parallelized trivially since all the models can be learned independently from each other and hence, the need for computing machines that can run multiple jobs in parallel. Our algorithms (homegrown) have been implemented using Python programming language and can take advantage of presence of multiple processing units or cores. After speaking with consultants at PSC, we were suggested that the Blacklight machines are most suitable for our needs.

这个子项目是利用资源的许多研究子项目之一。由NIH/NCRR资助的中心拨款提供。对子项目的主要支持子项目的首席调查员可能是由其他来源提供的，包括美国国立卫生研究院的其他来源。为子项目列出的总成本可能表示该子项目使用的中心基础设施的估计数量，不是由NCRR赠款提供给次级项目或次级项目工作人员的直接资金。随着技术的进步，生物医学研究发生了革命性的变化，导致了大量数据的积累。现在需要挖掘所有这些数据，以便对各种生物过程得出可操作的见解。人们正在开发复杂的机器学习算法，以对这些大型数据集进行自动分析，并提出稳健的模型来解释观察到的数据。然后，这些模型被用来识别数据中的模式，这些模式能够解决具有挑战性的决策问题，如疾病的诊断和预后。我们的研究涉及这样一类算法，称为隐马尔可夫模型，它在生物学中的序列数据挖掘问题中被广泛使用。我们的重点是开发新的算法，利用质谱仪获得的数据识别和量化复杂生物样本中的蛋白质序列。这样的分析将导致目标条件的分子表征，如疾病状态。我们的算法涉及从大型训练数据集中学习模型，并且计算密集型。此外，为了学习一个健壮的模型，该模型将在未来的各种测试数据中表现良好，我们建议用不同的模型拓扑和功能进行大规模实验，并需要学习数百个不同的模型，这需要许多天的数值计算工作。然而，整个实验可以微不足道地并行化，因为所有的模型都可以相互独立地学习，因此需要能够并行运行多个作业的计算机器。我们的算法(自主开发的)是使用Python编程语言实现的，可以利用多个处理单元或核心的存在。在与PSC的顾问交谈后，我们被建议Blacklight机器最适合我们的需求。