权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Improving overlap-finding techniques for whole genome shotgun data

改进全基因组鸟枪数据的重叠查找技术

基本信息

批准号：
0312360
负责人：
James Yorke
金额：
$ 9.94万
依托单位：
University of Maryland, College Park
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2003
资助国家：
美国
起止时间：
2003-07-15 至 2005-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0312360&HistoricalAwards=false
关键词：
Improving overlap finding techniques whole

项目摘要

Yorke A genome (the DNA in a cell) can be represented by asequence of letters called "bases." A large genome can consistof billions of bases. Chemical techniques allow scientists toread only a few hundred bases at a time. The whole genome shotgun(WGS) assembly technique creates a draft of the sequence of awhole genome by selecting such short fragments at random from thegenome, determining the sequence of the fragments, and thencomputationally re-assembling millions of these fragments. Twofragments are said to "overlap" if it is plausible that they comefrom the same part of the genome, based on a comparison of theirsequences. The goal of this project is to focus efforts onproducing an extremely robust set of overlaps, using acombination of sophisticated error-correction techniques, as wellas "localizing" fragments to validate overlaps by ensuring thatboth fragments come from the same vicinity of the genome.Several issues complicate the determination of which pairs offragments overlap. First, most genomes contain many "repeatregions," i.e., two or more almost identical copies of longstretches of sequence. Thus, two fragments that do not actuallyoverlap may look like they do. Second, the random samplingtechnique results in many base errors --- bases can be mis-reador missed entirely. These errors, combined with the fact thatrepeat regions usually differ slightly, make it very difficult todistinguish a spurious overlap from a true overlap in which oneor both fragments contain read errors. Thus, if extreme care isnot taken, it is easy to use a spurious overlap and therebymistakenly connect distant parts of the genome. Preliminaryresults in collaboration with Celera Genomics, the Baylor Collegeof Medicine, and The Institute for Genomic Research (TIGR) havedemonstrated that the investigator's current techniques canalready produce more sequence at higher quality. The goal isimprove these techniques and make them widely available. The determination and interpretation of genetic informationis one of the great challenges of the twenty-first century. Thegenome, i.e., all the DNA in a cell, is the molecular basis ofdiversity and the cornerstone of genetic information. Draftgenomes have been obtained for human, mouse, and some insects,fish, plants, and bacteria. This is a start, but a fullunderstanding of biological processes cannot be had by studyingthe genomes of only a handful of species. The federal governmentis spending about 100 million dollars per year generatingsequence data. Millions of small pieces of a genome are sampledfrom the genome. The second stage is called "assembly," whenthese pieces are re-assembled on a computer like a giant jigsawpuzzle. The puzzle is complicated by two facts: first, many ofthe puzzle pieces have small errors that make them mis-fitagainst pieces that they SHOULD fit with; and second, many piecesthat should NOT go together actually fit together quite well.This makes it extremely difficult to correctly assemble a genome.There are two ways to decrease the ambiguities: first, one couldgenerate more pieces. However, each new piece costs about $2,and one would need to generate millions of new pieces to have asignificant effect on assembly quality. The investigators use asecond route. They attempt to squeeze as much information out ofthe existing pieces as possible. The latter route issubstantially cheaper, and there is still much room forimprovement here over existing techniques. The investigators areusing sophisticated mathematics to help discern with extremeprecision those pairs of pieces that do, and those that do not,fit together. Preliminary results of the investigators -- incollaboration with several large sequencing centers -- havedemonstrated that using their techniques to "pre-process" thepieces can produce more of the genome, with fewer errors. Thisproject aims at extending these ideas further and making themfreely accessible to all investigators. The impact on the federalgenome (biotechnology) projects is potentially great.

Yorke 一个基因组（细胞中的DNA）可以用一系列被称为“碱基”的字母来表示。“一个大的基因组可以由数十亿个碱基组成。化学技术允许科学家一次只能读取几百个碱基。全基因组鸟枪法（WGS）组装技术通过从基因组中随机选择这样的短片段，确定片段的序列，然后通过计算重新组装数百万个这些片段来创建全基因组序列的草图。如果两个片段的序列比较表明它们来自基因组的同一部分，那么这两个片段就被称为“重叠”。该项目的目标是集中精力，使用复杂的纠错技术的组合，以及通过确保两个片段来自基因组的同一邻近区域来“定位”片段以验证重叠，从而产生一组极其可靠的重叠。有几个问题使确定哪些片段对重叠变得复杂。首先，大多数基因组包含许多“重复区域”，即，两个或多个几乎相同的长序列拷贝。因此，两个实际上不重叠的片段可能看起来像是重叠的。第二，随机抽样技术导致许多碱基错误-碱基可能被误读或完全遗漏。这些错误，加上重复区域通常略有不同的事实，使得很难区分假重叠和真正的重叠，其中一个或两个片段包含读取错误。因此，如果不特别小心，很容易使用虚假的重叠，从而错误地将基因组的遥远部分连接起来。与塞雷拉基因组学、贝勒医学院和基因组研究所（TIGR）合作的初步结果表明，研究人员目前的技术已经可以以更高的质量产生更多的序列。我们的目标是改进这些技术并使其广泛应用。确定和解释遗传信息是二十一世纪的重大挑战之一。基因组，即，细胞中的所有DNA，是多样性的分子基础和遗传信息的基石。人类、小鼠、一些昆虫、鱼类、植物和细菌的草稿基因组已经获得。这是一个开始，但是仅仅通过研究少数物种的基因组是不能完全理解生物过程的。联邦政府每年花费大约1亿美元来产生序列数据。从基因组中抽取了数百万个基因组的小片段。第二阶段被称为“组装”，当这些碎片像一个巨大的拼图游戏一样在电脑上重新组装起来时。这个难题因两个事实而变得复杂：第一，许多拼图碎片都有小的错误，使它们与它们应该匹配的碎片不匹配;第二，许多不应该在一起的碎片实际上很好地匹配在一起。这使得正确组装基因组变得非常困难。有两种方法可以减少模糊性：第一，可以生成更多的碎片。然而，每一个新零件的成本约为2美元，并且需要产生数百万个新零件才能对装配质量产生重大影响。调查人员使用第二条路线。他们试图从现有的碎片中挤出尽可能多的信息。后一种方法要便宜得多，而且与现有技术相比还有很大的改进空间。研究人员正在使用复杂的数学来帮助极其精确地辨别出哪些是匹配的，哪些是不匹配的。研究人员与几个大型测序中心合作的初步结果表明，使用他们的技术对片段进行“预处理”可以产生更多的基因组，错误更少。这个项目旨在进一步扩展这些想法，使他们自由地访问所有的调查人员。对联邦基因组（生物技术）项目的影响可能是巨大的。