BIGDATA: Low-Memory Streaming Prefilters for Biological Sequencing Data
BIGDATA:生物测序数据的低内存流预过滤器
基本信息
- 批准号:8703739
- 负责人:
- 金额:$ 20.42万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2013
- 资助国家:美国
- 起止时间:2013-07-19 至 2015-05-31
- 项目状态:已结题
- 来源:
- 关键词:AlgorithmsAllelesArchitectureBacteriaBenchmarkingBindingBioinformaticsBiologicalCellsCommunitiesComputer softwareDNADNA ResequencingDNA SequenceDataData SetDetectionDiploidyDistalEnvironmentEvaluationGene Expression ProfileGene FrequencyGenerationsGenomeHumanIndividualLeftMapsMemoryMetagenomicsMinorModelingNoisePartner in relationshipPopulationProcessRNARNA SequencesReadingReference DocumentRunningSamplingSequence AnalysisShotgun SequencingSignal TransductionSorting - Cell MovementStreamStructureTechniquesTestingTheoretical modelTimeUpdateVariantbasecyber infrastructuredata reductiondata structuredigitalimprovedinsightmetagenomemicrobial genomenovelshared memorytooltumor
项目摘要
DESCRIPTION (provided by applicant): We will soon be able to exhaustively sequence the DNA and RNA of entire communities of bacteria, as well as every individual cell of a tumor. Both of these very different applications of sequencing share in the need to rapidly and efficiently sort through large amounts of noisy sequence data (dozens to 100s of terabases) to separate signal from noise and produce biological insight. However, current bioinformatics approaches for extracting information from this data cannot easily handle the vast amounts of data being acquired. The primary challenges in processing this sequence data are twofold: the relatively high error rate of 0.1-1\%, per base, and the volume of data we can now easily acquire with sequencers such as lllumina HiSeq. For years, sequencing capacity has been doubling every 6 months -significantly faster than compute capacity. Since almost all extant bioinformatics analysis approaches require multiple passes across the primary data, and many analysis algorithms have not been parallelized, bioinformatics analysis capacity continues to lag ever further behind data generation capacity. In addition, many of the existing software packages cannot easily be retooled to take advantage of many core or GPU algorithms, and hence will not take advantage of expected advances in compute capacity and cyber infrastructure we propose to develop and implement novel streaming approaches for loss compression and error connection in shotgun sequencing data. Our algorithms are few-pass ($<$ 2), require no sample-specific information, and can be implemented in fixed or low memory; moreover, they are amenable to parallelization and can run efficiently in many core environments. When implemented as a prefilter to existing analysis packages, our approaches will eliminate or correct the majority of errors in data sets, dramatically reducing the computational space and time requirements for downstream analysis using existing packages. Moreover, we will provide novel capability by extending error correction approaches to mRNAseq and metagenomic data sets. Intellectual Merit: We will develop a range of algorithms for space- and time-efficient compression and error correction of short-read DNA and RNA sequence data. These strategies will substantially increase the scalability of many downstream analysis applications, ranging from community analysis of metagenomes to resequencing analysis of humans. We will provide analyses describing the tradeoffs between space and time efficiency and sensitivity, and deliver tested, documented reference implementations of our approaches that can be used by the community for practical evaluation and incorporation into analysis tools. Our approaches will significantly impact short-read sequence analysis by introducing efficient and effective streaming approaches to the two most common types of short-read analysis, mapping and assembly.
描述(由申请人提供):我们很快就能对整个细菌群落以及肿瘤的每个细胞的 DNA 和 RNA 进行详尽的测序。这两种截然不同的测序应用都需要快速有效地对大量噪声序列数据(数十到数百个太碱基)进行分类,以将信号与噪声分开并产生生物学洞察力。然而,当前从这些数据中提取信息的生物信息学方法无法轻松处理所获取的大量数据。处理该序列数据的主要挑战有两个:每个碱基相对较高的错误率(0.1-1%),以及我们现在可以使用 lllumina HiSeq 等测序仪轻松获取的数据量。多年来,测序能力每 6 个月就会翻一番,明显快于计算能力。由于几乎所有现有的生物信息学分析方法都需要多次遍历原始数据,并且许多分析算法尚未并行化,因此生物信息学分析能力继续远远落后于数据生成能力。此外,许多现有软件包无法轻松地进行重组以利用许多核心或 GPU 算法,因此无法利用计算能力和网络基础设施方面的预期进步,我们建议开发和实施用于鸟枪测序数据中的丢失压缩和错误连接的新型流媒体方法。我们的算法是很少通过的($<$2),不需要特定于样本的信息,并且可以在固定或低内存中实现;此外,它们适合并行化,并且可以在许多核心环境中高效运行。当作为现有分析包的预过滤器实现时,我们的方法将消除或纠正数据集中的大多数错误,从而大大减少使用现有包进行下游分析的计算空间和时间要求。此外,我们将通过将纠错方法扩展到 mRNAseq 和宏基因组数据集来提供新颖的功能。智力优势:我们将开发一系列算法,用于短读长 DNA 和 RNA 序列数据的空间和时间高效压缩和纠错。这些策略将大大提高许多下游分析应用的可扩展性,从宏基因组的群落分析到人类的重测序分析。我们将提供描述空间和时间效率与灵敏度之间权衡的分析,并提供我们方法的经过测试、记录的参考实现,可供社区用于实际评估和纳入分析工具。我们的方法将通过为两种最常见类型的短读分析(映射和组装)引入高效且有效的流方法来显着影响短读序列分析。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
C. Titus BROWN其他文献
C. Titus BROWN的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('C. Titus BROWN', 18)}}的其他基金
Tools and Workflows for Mining Genomic Data on Many Clouds
用于在许多云上挖掘基因组数据的工具和工作流程
- 批准号:
9559842 - 财政年份:2017
- 资助金额:
$ 20.42万 - 项目类别:
BIGDATA: Low-Memory Streaming Prefilters for Biological Sequencing Data
BIGDATA:生物测序数据的低内存流预过滤器
- 批准号:
8599821 - 财政年份:2013
- 资助金额:
$ 20.42万 - 项目类别:
相似海外基金
Linkage of HIV amino acid variants to protective host alleles at CHD1L and HLA class I loci in an African population
非洲人群中 HIV 氨基酸变异与 CHD1L 和 HLA I 类基因座的保护性宿主等位基因的关联
- 批准号:
502556 - 财政年份:2024
- 资助金额:
$ 20.42万 - 项目类别:
Olfactory Epithelium Responses to Human APOE Alleles
嗅觉上皮对人类 APOE 等位基因的反应
- 批准号:
10659303 - 财政年份:2023
- 资助金额:
$ 20.42万 - 项目类别:
Deeply analyzing MHC class I-restricted peptide presentation mechanistics across alleles, pathways, and disease coupled with TCR discovery/characterization
深入分析跨等位基因、通路和疾病的 MHC I 类限制性肽呈递机制以及 TCR 发现/表征
- 批准号:
10674405 - 财政年份:2023
- 资助金额:
$ 20.42万 - 项目类别:
An off-the-shelf tumor cell vaccine with HLA-matching alleles for the personalized treatment of advanced solid tumors
具有 HLA 匹配等位基因的现成肿瘤细胞疫苗,用于晚期实体瘤的个性化治疗
- 批准号:
10758772 - 财政年份:2023
- 资助金额:
$ 20.42万 - 项目类别:
Identifying genetic variants that modify the effect size of ApoE alleles on late-onset Alzheimer's disease risk
识别改变 ApoE 等位基因对迟发性阿尔茨海默病风险影响大小的遗传变异
- 批准号:
10676499 - 财政年份:2023
- 资助金额:
$ 20.42万 - 项目类别:
New statistical approaches to mapping the functional impact of HLA alleles in multimodal complex disease datasets
绘制多模式复杂疾病数据集中 HLA 等位基因功能影响的新统计方法
- 批准号:
2748611 - 财政年份:2022
- 资助金额:
$ 20.42万 - 项目类别:
Studentship
Recessive lethal alleles linked to seed abortion and their effect on fruit development in blueberries
与种子败育相关的隐性致死等位基因及其对蓝莓果实发育的影响
- 批准号:
22K05630 - 财政年份:2022
- 资助金额:
$ 20.42万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Genome and epigenome editing of induced pluripotent stem cells for investigating osteoarthritis risk alleles
诱导多能干细胞的基因组和表观基因组编辑用于研究骨关节炎风险等位基因
- 批准号:
10532032 - 财政年份:2022
- 资助金额:
$ 20.42万 - 项目类别:
Investigating the Effect of APOE Alleles on Neuro-Immunity of Human Brain Borders in Normal Aging and Alzheimer's Disease Using Single-Cell Multi-Omics and In Vitro Organoids
使用单细胞多组学和体外类器官研究 APOE 等位基因对正常衰老和阿尔茨海默病中人脑边界神经免疫的影响
- 批准号:
10525070 - 财政年份:2022
- 资助金额:
$ 20.42万 - 项目类别:
Leveraging the Evolutionary History to Improve Identification of Trait-Associated Alleles and Risk Stratification Models in Native Hawaiians
利用进化历史来改进夏威夷原住民性状相关等位基因的识别和风险分层模型
- 批准号:
10689017 - 财政年份:2022
- 资助金额:
$ 20.42万 - 项目类别: