Genomic Compression: From Information Theory to Parallel Algorithms
基因组压缩:从信息论到并行算法
基本信息
- 批准号:9259954
- 负责人:
- 金额:$ 30.35万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2015
- 资助国家:美国
- 起止时间:2015-06-01 至 2019-05-31
- 项目状态:已结题
- 来源:
- 关键词:AddressAlgorithmsArchivesAreaArithmeticBig DataBiologicalBiomedical ResearchCategoriesChromosomesCodeComputer softwareDNA sequencingDataData CompressionDatabasesDetectionDimensionsDiseaseEnsureEvaluationFutureGenomeGenomicsGoalsGovernmentGrowthHealth Care ResearchImageryIndividualInformation TheoryKnowledgeMeasurementMedical ResearchMethodsMiningModelingModernizationNucleotidesOutcomeOutcomes ResearchPerformancePositioning AttributeProcessPropertyPsychological TechniquesResearchSchemeSideSorting - Cell MovementSpeedStatistical Data InterpretationTechniquesThe Cancer Genome AtlasTimeTreesUnited States National Institutes of HealthWeightbasecancer genomeclinical practicecomputing resourcescostcrowdsourcingdata accessdata formatdesigndisease-causing mutationexperiencefunctional genomicsgenomic dataimprovedindexingnovelnovel strategiesoperationparallel computerpersonalized medicineprogramspublic health relevancesignal processingstatisticstheorieswhole genome
项目摘要
DESCRIPTION (provided by applicant): One of the highest priorities of modern healthcare research and practice is to identify genomic changes and markers that predispose individuals to debilitating diseases or make them more responsive to certain therapies and emerging treatments. Timely discovery and knowledge mining in this area of medical research is largely enabled by massive DNA sequencing and functional genomic data, the volumes of which are expected to experience drastic growth in the near future. It is therefore of paramount importance to develop efficient, accurate, and low-latency data compression and decompression techniques that will allow for fast exchange, dissemination, random access, visualization and search of diversely formatted genomic information. The use of specialized compression methods for biological data will ensure unprecedented growth of NIH databases and their utility, new uses of crowd-sourced computing in medical research, and large scale dissemination of experimental results. Specific aims of the proposal include developing parallel, task-oriented algorithms for a reference-based and reference-free compression of reads and whole genomes; b) lossy compression of quality scores; and c) compression of functional genomic data. Although the three data categories have different statistical properties and formats, they may be compressed using similar combinations of pre-processing, statistical coding, and parallel algorithms. Furthermore, some of the universal features of the developed compression techniques will make it possible to successfully apply them on other emerging genomic data formats. The long-term objectives of the proposed research program are two-fold. The first objective is to perform fundamental analytical studies of lossless and certain restricted forms of lossy compression and dimensionality reduction methods for genomic and functional genomic data, using information-theoretic techniques. The second objective is to develop a new suite of parallel algorithms for SAM, FASTQ and Wig track data compression. The developed algorithms are expected to include suitably combined, modified and extended classical compression methods (e.g., arithmetic, Huffman, and Lempel-Ziv coding), as well as novel solutions based on context-mixing and context-tree weighting with biological side-information. Immediate goals of the project include using CUDA, as well as classical parallel computing platforms, to implement current compression algorithms in order to reduce the latency of the compression and decompression process. Novel components of the parallel implementations will include extensive use of state-of-the-art hashing, indexing, and stringing methods. SAM, FASTQ and Wig data ¿les are ubiquitous in genomic research. Hence, a research program resulting in high-performance software suites for compression of these and other genomic information formats will enable management, transfer and access to massive data crucial for the operation of governmental and NIH sponsored projects such as ENCODE, TCGA, ClinVar, Genome 10K, the Million Cancer Genome Warehouse, and ADAM.
描述(由申请人提供):现代医疗保健研究和实践的最高优先事项之一是确定使个体易患衰弱性疾病或使其对某些疗法和新兴疗法更敏感的基因组变化和标记。医学研究这一领域的及时发现和知识挖掘在很大程度上是由大量的DNA测序和功能基因组数据实现的,预计在不久的将来,这些数据的数量将急剧增长。因此,开发高效、准确和低延迟的数据压缩和解压缩技术是至关重要的,这些技术将允许快速交换、传播、随机访问、可视化和搜索离散格式的基因组信息。对生物数据使用专门的压缩方法将确保NIH数据库及其实用性的空前增长,众包计算在医学研究中的新用途,以及实验结果的大规模传播。 该提案的具体目标包括开发并行的、面向任务的算法,用于读取和整个基因组的基于参考和无参考的压缩; B)质量评分的有损压缩;以及c)功能基因组数据的压缩。虽然这三种数据类别具有不同的统计属性和格式,但是可以使用预处理、统计编码和并行算法的类似组合来压缩它们。此外,开发的压缩技术的一些通用功能将使其能够成功地应用于其他新兴的基因组数据格式。 拟议的研究计划的长期目标是双重的。第一个目标是使用信息论技术对基因组和功能基因组数据的无损和某些限制形式的有损压缩和降维方法进行基础分析研究。第二个目标是为SAM、FASTQ和Wig航迹数据压缩开发一套新的并行算法。所开发的算法预期包括适当组合、修改和扩展的经典压缩方法(例如,算术、霍夫曼和Lempel-Ziv编码),以及基于上下文混合和具有生物边信息的上下文树加权的新颖解决方案。该项目的近期目标包括使用CUDA以及经典的并行计算平台来实现当前的压缩算法,以减少压缩和解压缩过程的延迟。并行实现的新组件将包括广泛使用最先进的散列、索引和字符串方法。 SAM、FASTQ和Wig数据在基因组研究中无处不在。因此,一项研究计划将产生用于压缩这些和其他基因组信息格式的高性能软件套件,这将使管理、传输和访问大量数据成为可能,这些数据对政府和NIH赞助的项目(如ENCODE、TCGA、ClinVar、Genome 10 K、百万癌症基因组仓库和ADAM)的运作至关重要。
项目成果
期刊论文数量(27)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores.
- DOI:10.1109/itw.2016.7606808
- 发表时间:2016-09
- 期刊:
- 影响因子:0
- 作者:Ochoa I;No A;Hernaez M;Weissman T
- 通讯作者:Weissman T
Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations.
通过布尔交集表示的潜在网络特征和重叠社区发现。
- DOI:10.1109/tnet.2017.2728638
- 发表时间:2017
- 期刊:
- 影响因子:0
- 作者:Dau,Hoang;Milenkovic,Olgica
- 通讯作者:Milenkovic,Olgica
Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes.
二次相似性查询的压缩:有限块长度和实用方案。
- DOI:10.1109/tit.2016.2535172
- 发表时间:2016
- 期刊:
- 影响因子:2.5
- 作者:Steiner,Fabian;Dempfle,Steffen;Ingber,Amir;Weissman,Tsachy
- 通讯作者:Weissman,Tsachy
Aligned genomic data compression via improved modeling.
通过改进的建模来对齐基因组数据压缩。
- DOI:10.1142/s0219720014420025
- 发表时间:2014
- 期刊:
- 影响因子:1
- 作者:Ochoa,Idoia;Hernaez,Mikel;Weissman,Tsachy
- 通讯作者:Weissman,Tsachy
Chained Kullback-Leibler Divergences.
- DOI:10.1109/isit.2016.7541365
- 发表时间:2016-07
- 期刊:
- 影响因子:0
- 作者:Pavlichin DS;Weissman T
- 通讯作者:Weissman T
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Olgica Milenkovic其他文献
Olgica Milenkovic的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Olgica Milenkovic', 18)}}的其他基金
Genomic Compression: From Information Theory to Parallel Algorithms
基因组压缩:从信息论到并行算法
- 批准号:
9239305 - 财政年份:2015
- 资助金额:
$ 30.35万 - 项目类别:
Genomic Compression: From Information Theory to Parallel Algorithms
基因组压缩:从信息论到并行算法
- 批准号:
8876278 - 财政年份:2015
- 资助金额:
$ 30.35万 - 项目类别:
相似海外基金
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
- 批准号:
EP/Y029089/1 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Research Grant
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
- 批准号:
2337776 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
- 批准号:
2338816 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
- 批准号:
2338846 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
- 批准号:
2348261 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
- 批准号:
2348346 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
- 批准号:
2348457 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
- 批准号:
2404989 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
- 批准号:
2339310 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
- 批准号:
2339669 - 财政年份:2024
- 资助金额:
$ 30.35万 - 项目类别:
Continuing Grant