Improving overlap-finding techniques for whole genome shotgun data
改进全基因组鸟枪数据的重叠查找技术
基本信息
- 批准号:0312360
- 负责人:
- 金额:$ 9.94万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2003
- 资助国家:美国
- 起止时间:2003-07-15 至 2005-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Yorke A genome (the DNA in a cell) can be represented by asequence of letters called "bases." A large genome can consistof billions of bases. Chemical techniques allow scientists toread only a few hundred bases at a time. The whole genome shotgun(WGS) assembly technique creates a draft of the sequence of awhole genome by selecting such short fragments at random from thegenome, determining the sequence of the fragments, and thencomputationally re-assembling millions of these fragments. Twofragments are said to "overlap" if it is plausible that they comefrom the same part of the genome, based on a comparison of theirsequences. The goal of this project is to focus efforts onproducing an extremely robust set of overlaps, using acombination of sophisticated error-correction techniques, as wellas "localizing" fragments to validate overlaps by ensuring thatboth fragments come from the same vicinity of the genome.Several issues complicate the determination of which pairs offragments overlap. First, most genomes contain many "repeatregions," i.e., two or more almost identical copies of longstretches of sequence. Thus, two fragments that do not actuallyoverlap may look like they do. Second, the random samplingtechnique results in many base errors --- bases can be mis-reador missed entirely. These errors, combined with the fact thatrepeat regions usually differ slightly, make it very difficult todistinguish a spurious overlap from a true overlap in which oneor both fragments contain read errors. Thus, if extreme care isnot taken, it is easy to use a spurious overlap and therebymistakenly connect distant parts of the genome. Preliminaryresults in collaboration with Celera Genomics, the Baylor Collegeof Medicine, and The Institute for Genomic Research (TIGR) havedemonstrated that the investigator's current techniques canalready produce more sequence at higher quality. The goal isimprove these techniques and make them widely available. The determination and interpretation of genetic informationis one of the great challenges of the twenty-first century. Thegenome, i.e., all the DNA in a cell, is the molecular basis ofdiversity and the cornerstone of genetic information. Draftgenomes have been obtained for human, mouse, and some insects,fish, plants, and bacteria. This is a start, but a fullunderstanding of biological processes cannot be had by studyingthe genomes of only a handful of species. The federal governmentis spending about 100 million dollars per year generatingsequence data. Millions of small pieces of a genome are sampledfrom the genome. The second stage is called "assembly," whenthese pieces are re-assembled on a computer like a giant jigsawpuzzle. The puzzle is complicated by two facts: first, many ofthe puzzle pieces have small errors that make them mis-fitagainst pieces that they SHOULD fit with; and second, many piecesthat should NOT go together actually fit together quite well.This makes it extremely difficult to correctly assemble a genome.There are two ways to decrease the ambiguities: first, one couldgenerate more pieces. However, each new piece costs about $2,and one would need to generate millions of new pieces to have asignificant effect on assembly quality. The investigators use asecond route. They attempt to squeeze as much information out ofthe existing pieces as possible. The latter route issubstantially cheaper, and there is still much room forimprovement here over existing techniques. The investigators areusing sophisticated mathematics to help discern with extremeprecision those pairs of pieces that do, and those that do not,fit together. Preliminary results of the investigators -- incollaboration with several large sequencing centers -- havedemonstrated that using their techniques to "pre-process" thepieces can produce more of the genome, with fewer errors. Thisproject aims at extending these ideas further and making themfreely accessible to all investigators. The impact on the federalgenome (biotechnology) projects is potentially great.
Yorke 一个基因组(细胞中的DNA)可以用一系列被称为“碱基”的字母来表示。“一个大的基因组可以由数十亿个碱基组成。 化学技术允许科学家一次只能读取几百个碱基。 全基因组鸟枪法(WGS)组装技术通过从基因组中随机选择这样的短片段,确定片段的序列,然后通过计算重新组装数百万个这些片段来创建全基因组序列的草图。 如果两个片段的序列比较表明它们来自基因组的同一部分,那么这两个片段就被称为“重叠”。 该项目的目标是集中精力,使用复杂的纠错技术的组合,以及通过确保两个片段来自基因组的同一邻近区域来“定位”片段以验证重叠,从而产生一组极其可靠的重叠。有几个问题使确定哪些片段对重叠变得复杂。 首先,大多数基因组包含许多“重复区域”,即,两个或多个几乎相同的长序列拷贝。 因此,两个实际上不重叠的片段可能看起来像是重叠的。 第二,随机抽样技术导致许多碱基错误-碱基可能被误读或完全遗漏。 这些错误,加上重复区域通常略有不同的事实,使得很难区分假重叠和真正的重叠,其中一个或两个片段包含读取错误。 因此,如果不特别小心,很容易使用虚假的重叠,从而错误地将基因组的遥远部分连接起来。 与塞雷拉基因组学、贝勒医学院和基因组研究所(TIGR)合作的初步结果表明,研究人员目前的技术已经可以以更高的质量产生更多的序列。 我们的目标是改进这些技术并使其广泛应用。 确定和解释遗传信息是二十一世纪的重大挑战之一。 基因组,即,细胞中的所有DNA,是多样性的分子基础和遗传信息的基石。 人类、小鼠、一些昆虫、鱼类、植物和细菌的草稿基因组已经获得。 这是一个开始,但是仅仅通过研究少数物种的基因组是不能完全理解生物过程的。 联邦政府每年花费大约1亿美元来产生序列数据。 从基因组中抽取了数百万个基因组的小片段。 第二阶段被称为“组装”,当这些碎片像一个巨大的拼图游戏一样在电脑上重新组装起来时。 这个难题因两个事实而变得复杂:第一,许多拼图碎片都有小的错误,使它们与它们应该匹配的碎片不匹配;第二,许多不应该在一起的碎片实际上很好地匹配在一起。这使得正确组装基因组变得非常困难。有两种方法可以减少模糊性:第一,可以生成更多的碎片。 然而,每一个新零件的成本约为2美元,并且需要产生数百万个新零件才能对装配质量产生重大影响。 调查人员使用第二条路线。 他们试图从现有的碎片中挤出尽可能多的信息。 后一种方法要便宜得多,而且与现有技术相比还有很大的改进空间。 研究人员正在使用复杂的数学来帮助极其精确地辨别出哪些是匹配的,哪些是不匹配的。 研究人员与几个大型测序中心合作的初步结果表明,使用他们的技术对片段进行“预处理”可以产生更多的基因组,错误更少。 这个项目旨在进一步扩展这些想法,使他们自由地访问所有的调查人员。 对联邦基因组(生物技术)项目的影响可能是巨大的。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
James Yorke其他文献
What is the graph of a dynamical system?
- DOI:
10.1007/s11071-025-11466-9 - 发表时间:
2025-07-02 - 期刊:
- 影响因子:6.000
- 作者:
Chirag Adwani;Roberto De Leo;James Yorke - 通讯作者:
James Yorke
James Yorke的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('James Yorke', 18)}}的其他基金
Mathematical Modeling of DNA Repeats and HIV Epidemics
DNA 重复和 HIV 流行的数学模型
- 批准号:
0616585 - 财政年份:2006
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Chaos with Multiple Positive Lyapunov Exponents
具有多个正李亚普诺夫指数的混沌
- 批准号:
9870183 - 财政年份:1998
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Mathematical Sciences: "Chaos with Multiple Positive Lyapunov Exponents
数学科学:“具有多个正李雅普诺夫指数的混沌
- 批准号:
9423843 - 财政年份:1995
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Attractor Reconstruction from Experimental Data
根据实验数据重建吸引子
- 批准号:
9116391 - 财政年份:1992
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Mathematical Sciences: Bifurcation and Global Continuation
数学科学:分岔和全局延拓
- 批准号:
8117967 - 财政年份:1982
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Qualitative Behavior For Generalized Dynamical Processes
广义动态过程的定性行为
- 批准号:
7818221 - 财政年份:1979
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Qualitative Behavior For Generalized Dynamical Processes
广义动态过程的定性行为
- 批准号:
7624432 - 财政年份:1976
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
Qualitative Behavior For Generalized Dynamical Processes
广义动态过程的定性行为
- 批准号:
7424310 - 财政年份:1974
- 资助金额:
$ 9.94万 - 项目类别:
Continuing Grant
相似海外基金
Exploring the overlap between neurodevelopmental disorders and traits with adolescent hypomania
探索神经发育障碍和青少年轻躁狂特征之间的重叠
- 批准号:
2886920 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Studentship
The cardiovascular consequences of sleep apnea plus COPD (Overlap syndrome)
睡眠呼吸暂停加慢性阻塞性肺病(重叠综合征)对心血管的影响
- 批准号:
10733384 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Domestic Abuse Proceedings In Family Courts: Overlap And Pathways In Private And Public Family Justice
家庭法院的家庭暴力诉讼:私人和公共家庭司法的重叠和途径
- 批准号:
ES/X011399/1 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Fellowship
Integrating Epidemiologic and Genomic Data to Elucidate the Genetic Overlap Between Congenital Anomalies and Pediatric Cancer
整合流行病学和基因组数据来阐明先天性异常和儿童癌症之间的遗传重叠
- 批准号:
10749761 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
The Changing Structure of the International Court of Justice: Overlap of Dispute Settlement and International Control
国际法院结构的变化:争端解决与国际控制的重叠
- 批准号:
23K01112 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
The overlap of speech production and verbal working memory
言语产生和言语工作记忆的重叠
- 批准号:
10735031 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
Bilingual discourse comprehension: How is text integration affected by overlap in language?
双语话语理解:语言重叠如何影响文本整合?
- 批准号:
10629501 - 财政年份:2023
- 资助金额:
$ 9.94万 - 项目类别:
An intellectual overlap of pure mathematics and engineering techniques targeted to develop self-reliant, efficient, and clean artificial intelligence processors
纯数学和工程技术的智力重叠,旨在开发自力更生、高效和清洁的人工智能处理器
- 批准号:
577214-2022 - 财政年份:2022
- 资助金额:
$ 9.94万 - 项目类别:
Alliance Grants
Comparing Overlap Between Existing Bilingualism Questionnaires: A Content Analysis Research Proposal
比较现有双语问卷之间的重叠:内容分析研究建议
- 批准号:
573665-2022 - 财政年份:2022
- 资助金额:
$ 9.94万 - 项目类别:
University Undergraduate Student Research Awards
Efficient carrier transport in organic semiconductors through molecular orbital overlap engineering
通过分子轨道重叠工程实现有机半导体中的高效载流子传输
- 批准号:
22H01933 - 财政年份:2022
- 资助金额:
$ 9.94万 - 项目类别:
Grant-in-Aid for Scientific Research (B)