CAREER: BCSP: Methods for analyzing sequencing data from repetitive genomes
职业:BCSP:分析重复基因组测序数据的方法
基本信息
- 批准号:1349906
- 负责人:
- 金额:$ 53.59万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2014
- 资助国家:美国
- 起止时间:2014-05-15 至 2019-04-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Our understanding of how biological systems work is increasingly fueled by data from DNA sequencers. Sequencing has improved dramatically over the past several years, but the datasets produced by sequencers are unwieldy and difficult to interpret. This is especially true when the genome being studied contains many repeated stretches of DNA, as is the case for most mammals and plants. The goal of this project is to develop improved computational and statistical methods for analyzing DNA sequencing data, providing faster, more accurate, and more interpretable results to scientists studying organisms with repetitive genomes. These methods will be implemented as open source software tools made freely available to the research community. A successful project will result in these tools being widely adopted in the biological research community. Repetitive sequences are implicated in cellular regulation processes and associated with human disease. The integrated education plan also seeks to improve software for analyzing sequencing data by teaching computer science students the complete set of skills needed to make usable genomics software in the era of big data genomics. The PI will develop an undergraduate minor in computational biology, a graduate class covering methods for analyzing large sequencing datasets, and a competitive class project. A successful effort will result in more trained computer scientists joining and contributing to the study of computational biology and genomics.The genomes of plants, mammals and other higher eukaryotes contain many repeated DNA sequences. 80% of the maize genome, for example, is covered by repetitive stretches of DNA. At the same time, computational tools typically model DNA as a string. This has advantages; it allows these tools to borrow methods developed for other strings, such as books and web pages, and apply them to DNA. But for repetitive genomes, the string abstraction fails to capture the prevalence of repeated DNA sequences related to each other through evolution. The PI proposes a broad research agenda based on the idea that analyzing sequencing data derived from repetitive genomes requires special, repeat-aware computational methods. The first project explores accurate and efficient methods for aligning sequence reads to repeat families. The PI proposes methods that exploit similarities between alignment problems to yield solutions that are more accurate than current approaches. The second project explores novel methods for predicting the probability that an alignment reported by a read aligner is correct, i.e., that the aligner correctly identified the read's point of origin. Downstream analysis tools use this quantity to weigh their confidence in evidence derived from the alignment. But estimating this quantity accurately is difficult, and there are no widely applicable approaches available now. The PI proposes a tandem simulation approach, whereby a simulator mimicking properties of a real dataset can provide training examples that in turn allows us to accurately predict these probabilities for real data. These methods address major deficiencies in everyday common genomics analyses, which are made slower and less accurate by repetitive DNA.The PI will also conduct an integrated set of curriculum building and outreach efforts. These have the goal of bringing computational biology to the attention of more students earlier in their training, and to provide graduate and upper undergraduate students with a strong computational biology curriculum. Specifically, the PI will develop and implement an undergraduate minor in computational biology at Johns Hopkins University. Second, the PI will design a new graduate-level course covering contemporary methods for analyzing very large collections of sequence data. Finally, the PI will develop a competitive project called the Big Sequence Data Pentathlon that tests students' ability to design scalable, usable genomics analysis tools on a parallel computer system.
我们对生物系统如何工作的理解越来越多地受到DNA测序仪数据的推动。在过去的几年里,测序技术有了很大的进步,但是测序仪产生的数据集很笨重,很难解释。当被研究的基因组包含许多重复的DNA片段时尤其如此,就像大多数哺乳动物和植物一样。该项目的目标是开发用于分析DNA测序数据的改进的计算和统计方法,为研究具有重复基因组的生物的科学家提供更快,更准确,更可解释的结果。这些方法将作为开放源码软件工具实施,免费提供给研究界。一个成功的项目将导致这些工具在生物研究界被广泛采用。 重复序列参与细胞调控过程并与人类疾病相关。 综合教育计划还试图通过向计算机科学专业的学生教授在大数据基因组学时代制作可用的基因组学软件所需的全套技能,来改进用于分析测序数据的软件。PI将开发计算生物学的本科未成年人,研究生课程涵盖分析大型测序数据集的方法,以及竞争性课程项目。一个成功的努力将导致更多训练有素的计算机科学家加入和促进计算生物学和基因组学的研究。植物,哺乳动物和其他高等真核生物的基因组包含许多重复的DNA序列。例如,玉米基因组的80%被重复的DNA片段所覆盖。与此同时,计算工具通常将DNA建模为字符串。这样做的好处是,它允许这些工具借用为其他字符串(如书籍和网页)开发的方法,并将其应用于DNA。但对于重复的基因组,字符串抽象无法捕捉到通过进化相互关联的重复DNA序列的普遍性。 PI提出了一个广泛的研究议程,其基础是分析来自重复基因组的测序数据需要特殊的重复感知计算方法。第一个项目探索了将序列读数与重复家族进行比对的准确和有效的方法。PI提出了利用对齐问题之间的相似性来产生比当前方法更准确的解决方案的方法。第二个项目探索用于预测由读段比对器报告的比对是正确的概率的新方法,即,比对器正确识别了读数的来源下游分析工具使用该数量来衡量其对源自比对的证据的置信度。但准确估计这一数量是困难的,现在还没有广泛适用的方法。PI提出了一种串联模拟方法,即模拟真实的数据集属性的模拟器可以提供训练示例,从而使我们能够准确地预测真实的数据的概率。这些方法解决了日常常见基因组学分析中的主要缺陷,重复DNA使这些分析变得缓慢和不准确。PI还将进行一系列综合课程建设和外展工作。这些目标是使计算生物学在训练早期引起更多学生的注意,并为研究生和高年级本科生提供强大的计算生物学课程。具体来说,PI将在约翰霍普金斯大学开发和实施计算生物学本科辅修课程。第二,PI将设计一个新的研究生课程,涵盖分析大量序列数据的当代方法。最后,PI将开发一个名为Big Sequence Data Pentathlon的竞争项目,测试学生在并行计算机系统上设计可扩展,可用的基因组学分析工具的能力。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Benjamin Langmead其他文献
Tackling the widespread and critical impact of batch effects in high-throughput data
解决批效应在高通量数据中广泛且关键的影响
- DOI:
10.1038/nrg2825 - 发表时间:
2010-09-14 - 期刊:
- 影响因子:52.000
- 作者:
Jeffrey T. Leek;Robert B. Scharpf;Héctor Corrada Bravo;David Simcha;Benjamin Langmead;W. Evan Johnson;Donald Geman;Keith Baggerly;Rafael A. Irizarry - 通讯作者:
Rafael A. Irizarry
Benjamin Langmead的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
内蒙古自治区神经型布氏杆菌病临床特点、BCSP31基因扩增测序及流行病学调查
- 批准号:82160248
- 批准年份:2021
- 资助金额:34 万元
- 项目类别:地区科学基金项目
相似海外基金
BCSP: Collaborative Research: ABI Development: Exploring Taxon Concepts (ETC) through analysing fine-grained semantic markup of descriptive literature
BCSP:协作研究:ABI 开发:通过分析描述性文献的细粒度语义标记探索分类概念 (ETC)
- 批准号:
1643002 - 财政年份:2015
- 资助金额:
$ 53.59万 - 项目类别:
Standard Grant
BCSP: The Emergence of Inactivity: Adaptive Task Allocation in Complex Distributed Systems, or Why Are There so Many Lazy Ants?
BCSP:不活动的出现:复杂分布式系统中的自适应任务分配,或者为什么有这么多懒蚂蚁?
- 批准号:
1455983 - 财政年份:2015
- 资助金额:
$ 53.59万 - 项目类别:
Continuing Grant
Collaborative Research: ABI Innovation: BCSP: Understanding the design and usage of distributed biological networks
合作研究:ABI 创新:BCSP:了解分布式生物网络的设计和使用
- 批准号:
1356260 - 财政年份:2014
- 资助金额:
$ 53.59万 - 项目类别:
Standard Grant
Collaborative Research: ABI Innovation: BCSP: Understanding the design and usage of distributed biological networks
合作研究:ABI 创新:BCSP:了解分布式生物网络的设计和使用
- 批准号:
1356505 - 财政年份:2014
- 资助金额:
$ 53.59万 - 项目类别:
Continuing Grant
BCSP: ABI Innovation: Collaborative Research: Predicting changes in protein activity from changes in sequence by identifying the underlying Biophysical Conditional Random Field
BCSP:ABI 创新:协作研究:通过识别潜在的生物物理条件随机场,根据序列变化预测蛋白质活性的变化
- 批准号:
1262457 - 财政年份:2014
- 资助金额:
$ 53.59万 - 项目类别:
Standard Grant
Collaborative Research: BCSP: BIOMAPS: The Hydrodynamics of Predator Sensing and Escape in Zebrafish
合作研究:BCSP:BIOMAPS:斑马鱼捕食者感知和逃脱的流体动力学
- 批准号:
1353937 - 财政年份:2014
- 资助金额:
$ 53.59万 - 项目类别:
Continuing Grant
Collaborative Research: BCSP: BIOMAPS: The Hydrodynamics of Predator Sensing and Escape in Zebrafish
合作研究:BCSP:BIOMAPS:斑马鱼捕食者感知和逃脱的流体动力学
- 批准号:
1354842 - 财政年份:2014
- 资助金额:
$ 53.59万 - 项目类别:
Continuing Grant
BCSP: ABI Innovation: Collaborative Research: Predicting changes in protein activity from changes in sequence by identifying the underlying Biophysical Conditional Random Field
BCSP:ABI 创新:协作研究:通过识别潜在的生物物理条件随机场,根据序列变化预测蛋白质活性的变化
- 批准号:
1262469 - 财政年份:2014
- 资助金额:
$ 53.59万 - 项目类别:
Standard Grant
RI: Small: BCSP: Robustness and Adaptation in Morphogenetic Collective Systems
RI:小:BCSP:形态发生集体系统的鲁棒性和适应性
- 批准号:
1319152 - 财政年份:2013
- 资助金额:
$ 53.59万 - 项目类别:
Standard Grant
RI: Medium: Collaborative Research: BCSP: Automated Parameter Tuning of Large-Scale Spiking Neural Networks
RI:媒介:协作研究:BCSP:大规模尖峰神经网络的自动参数调整
- 批准号:
1302256 - 财政年份:2013
- 资助金额:
$ 53.59万 - 项目类别:
Standard Grant