Data compression for biomedical data analysis

用于生物医学数据分析的数据压缩

基本信息

  • 批准号:
    RGPIN-2022-03074
  • 负责人:
  • 金额:
    $ 2.11万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2022
  • 资助国家:
    加拿大
  • 起止时间:
    2022-01-01 至 2023-12-31
  • 项目状态:
    已结题

项目摘要

Increasingly massive biological data sets are being generated. These data range from high-throughput (meta)genomic sequencing to population-level studies of communities made possible only by the advent of coordinated electronic health records. Much of the focus of researchers has been on the analysis and interpretation of these data sets for generating biological insights or medical interventions. However, this focus on biological impact obscures the underlying fundamental infrastructural challenges of handling and transmitting those data, for which the design of appropriate and targeted data compression techniques is essential. As computational resources and transmission bandwidth become incapable of handling the influx, faster algorithms and appropriate data compression become essential for large scale analytics. Fortunately, unlike simply applying general-purpose data compression, the design of targeted compression methods often leads to the discovery of other desirable biologically-relevant features. The aims of this project are (1) to design new lossy compressive feature sets suitable for fast transmission and analysis of genomic sequencing data, (2) to develop succinct compressed summary sketches of medical data for privacy-preserving distributed analyses, and (3) to utilize insights from the previous two aims to build faster bioanalysis software. GOALS and APPROACH (1) Compression algorithms typically rely on the identification of repetitive patterns in the source data to structure the compressed representation. In the context of sequencing data, biologists have often relied on random k-mer selection to find redundancies. We believe that rigorously analyzing k-mer selection methods and related alternatives for can be used to exploit redundancy in both population mapping and metagenomic data sets. (2) An alternative to identifying repetitive patterns is to extract only patterns that downstream agents perceive. In this mode, for many analyses, we do not need access to the raw data, but can instead work with probabilistic summaries. This not only assists in reducing transmission requirements between collaborating institutions, but can also improve and provide privacy guarantees, useful when dealing with patient health records. These probabilistic summaries can further be augmented with multi-party computation techniques from the cryptographic literature to give privacy and security guarantees to all of the parties involved in the analysis. (3) From prior work, we know that it often turns out that in the building of smaller representations of data, we can often improve the runtime and sometimes accuracy of downstream analysis algorithms. This is not so much a separate goal. One aim of this proposal is to demonstrate the practical relevant of the compressed representations from Goals (1) and (2) to practitioners through the design and prototyping of usable software packages and libraries.
日益庞大的生物数据集正在产生。这些数据的范围从高通量(元)基因组测序到社区人口水平的研究,只有通过协调电子健康记录的出现才有可能实现。研究人员的大部分重点一直放在分析和解释这些数据集,以产生生物学见解或医疗干预。然而,这种对生物影响的关注掩盖了处理和传输这些数据的潜在基本基础设施挑战,为此,设计适当和有针对性的数据压缩技术至关重要。随着计算资源和传输带宽无法处理大量涌入的数据,对于大规模分析来说,更快的算法和适当的数据压缩变得至关重要。幸运的是,与简单地应用通用数据压缩不同,目标压缩方法的设计通常会导致发现其他理想的生物学相关特征。该项目的目标是(1)设计新的有损压缩特征集,适用于基因组测序数据的快速传输和分析,(2)开发简洁的医疗数据压缩摘要草图,用于保护隐私的分布式分析,以及(3)利用前两个目标的见解构建更快的生物分析软件。目标和方法(1)压缩算法通常依赖于源数据中重复模式的识别来构建压缩表示。在测序数据的背景下,生物学家经常依靠随机k-mer选择来发现冗余。我们认为,严格分析k-mer选择方法和相关替代方法可用于利用种群映射和宏基因组数据集中的冗余。(2)识别重复模式的另一种选择是仅提取下游代理感知的模式。在这种模式下,对于许多分析,我们不需要访问原始数据,而是可以使用概率总结。这不仅有助于减少合作机构之间的传输要求,而且还可以改善和提供隐私保障,在处理患者健康记录时非常有用。这些概率总结可以通过密码学文献中的多方计算技术进一步增强,从而为参与分析的所有各方提供隐私和安全保证。(3)从之前的工作中,我们知道,在构建更小的数据表示时,我们通常可以提高下游分析算法的运行时间,有时还可以提高准确性。这并不是一个单独的目标。这个建议的一个目的是通过设计和原型化可用的软件包和库,向实践者展示目标(1)和(2)的压缩表示的实际相关性。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Yu, YunWilliam其他文献

Yu, YunWilliam的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Yu, YunWilliam', 18)}}的其他基金

Data compression for biomedical data analysis
用于生物医学数据分析的数据压缩
  • 批准号:
    DGDND-2022-03074
  • 财政年份:
    2022
  • 资助金额:
    $ 2.11万
  • 项目类别:
    DND/NSERC Discovery Grant Supplement
Data compression for biomedical data analysis
用于生物医学数据分析的数据压缩
  • 批准号:
    DGECR-2022-00353
  • 财政年份:
    2022
  • 资助金额:
    $ 2.11万
  • 项目类别:
    Discovery Launch Supplement

相似海外基金

Data compression for biomedical data analysis
用于生物医学数据分析的数据压缩
  • 批准号:
    DGDND-2022-03074
  • 财政年份:
    2022
  • 资助金额:
    $ 2.11万
  • 项目类别:
    DND/NSERC Discovery Grant Supplement
Data compression for biomedical data analysis
用于生物医学数据分析的数据压缩
  • 批准号:
    DGECR-2022-00353
  • 财政年份:
    2022
  • 资助金额:
    $ 2.11万
  • 项目类别:
    Discovery Launch Supplement
A Fast, Accurate and Cloud-based Data Processing Pipeline for High-Density, High-Site-Count Electrophysiology
用于高密度、高位点计数电生理学的快速、准确且基于云的数据处理管道
  • 批准号:
    9905557
  • 财政年份:
    2018
  • 资助金额:
    $ 2.11万
  • 项目类别:
Meaningful Data Compression and Reduction of High-Throughput Sequencing Data
有意义的数据压缩和高通量测序数据的缩减
  • 批准号:
    9336154
  • 财政年份:
    2015
  • 资助金额:
    $ 2.11万
  • 项目类别:
Task-Specific Compression for Biomedical Big Data
生物医学大数据的特定任务压缩
  • 批准号:
    9265807
  • 财政年份:
    2015
  • 资助金额:
    $ 2.11万
  • 项目类别:
Task-Specific Compression for Biomedical Big Data
生物医学大数据的特定任务压缩
  • 批准号:
    9070666
  • 财政年份:
    2015
  • 资助金额:
    $ 2.11万
  • 项目类别:
Task-Specific Compression for Biomedical Big Data
生物医学大数据的特定任务压缩
  • 批准号:
    8874698
  • 财政年份:
    2015
  • 资助金额:
    $ 2.11万
  • 项目类别:
Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools
大型组学数据集的压缩基因组学:算法、应用程序和工具
  • 批准号:
    9546755
  • 财政年份:
    2013
  • 资助金额:
    $ 2.11万
  • 项目类别:
Database-centric data analysis of molecular simulations
以数据库为中心的分子模拟数据分析
  • 批准号:
    8457122
  • 财政年份:
    2010
  • 资助金额:
    $ 2.11万
  • 项目类别:
Database-centric data analysis of molecular simulations
以数据库为中心的分子模拟数据分析
  • 批准号:
    8061700
  • 财政年份:
    2010
  • 资助金额:
    $ 2.11万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了