CAREER: Towards Fast and Scalable Algorithms for Big Proteogenomics Data Analytics

职业:面向蛋白质基因组大数据分析的快速且可扩展的算法

基本信息

  • 批准号:
    1925960
  • 负责人:
  • 金额:
    $ 41.6万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2018
  • 资助国家:
    美国
  • 起止时间:
    2018-09-01 至 2023-09-30
  • 项目状态:
    已结题

项目摘要

Proteogenomics studies require combination and integration of mass spectrometry data (MS) for proteomics and next generation sequencing (NGS) data for genomics. This integration drastically increases the size of the data sets that need to be analyzed to make biological conclusions. However, existing tools yield low accuracy and exhibit poor scalability for big proteogenomics data. This CAREER grant is expected to lay a foundation for fast algorithmic and high performance computing solutions suitable for analyzing big proteogenomics data sets. Design of accurate computational algorithms suitable for peta-scale data sets will be pursued and the software implementation will run on massively parallel supercomputers and graphical processing units. The direction in this CAREER proposal is towards designing and building infrastructure, which would be useful for the broadest biological and ecological community. A comprehensive interdisciplinary education will be executed for K12, undergraduate and graduate students to ensure that US retains its global leadership position in STEM fields. This project thus serves the national interest, as stated by NSF's mission: to promote the progress of science and to advance the national health, prosperity and welfare.The goal of the proposed CAREER grant is to design and develop algorithmic and high performance computing (HPC) foundations for practical sublinear and parallel algorithms for big proteogenomics data - especially for non-model organisms with previously unsequenced or partially sequenced genomes. Integration of MS and NGS data sets required for proteogenomics studies exhibit enormous volume and velocity of data: NGS technologies such as Chip-Seq can generate tera-bytes of DNA/RNA data and mass spectrometers can generate millions of spectra (with thousand of peak per spectra). The current systems for analyzing MS data are mainly driven by heuristic practices and do not scale well. This CAREER proposal will explore a new class of reductive algorithms for analysis of MS data that can allow peptide deductions in sublinear time, compression algorithms that operate in sub-linear space, and denovo algorithms that operate on lossy reduced-form of the MS data. Novel low-complexity sampling and reductive algorithms that can exploit the sparsity of MS data such as non-uniform FFT based convolution kernels can lead to superior similarity metrics not prone to spurious correlations. The bottleneck in large system-biology studies is the low-scalability of coarse-grained parallel algorithms that do not exploit MS-specific data characteristics and lead to unbalanced loads due to non-uniform compute time required for peptide deductions. This project aims to explore design and implementation of scalable algorithms for both NGS and MS data on multicore and GPU platforms using domain decomposition techniques based on spectral clustering, MS-specific hybrid load-balancing based on work-load estimate, and HPC dimensionality reduction strategies and novel out-of-core sketching & streaming fine-grained parallel algorithms. These HPC solutions can enable previously impractical proteogenomics projects and allow biologists to perform computational experiments without needing expensive hardware. All of the implemented algorithms will be made available as open-source code interfaced with Galaxy framework to ensure maximum impact in systems biology labs. These designed techniques will then be integrated so that matching of spectra to RNA-Seq data can be accomplished without a reconstructed transcriptome. The proposed tools aim to reveal new biological insight such as novel genes, proteins and PTM's and are crucial steps towards understanding the genomic, proteomic and evolutionary aspects of species in the tree of life.
蛋白质组学研究需要蛋白质组学的质谱学数据(MS)和基因组学的下一代测序(NGS)数据的组合和集成。这种整合极大地增加了需要进行分析以得出生物学结论的数据集的大小。然而,现有的工具对大蛋白组学数据的准确性较低,可扩展性较差。这项职业拨款预计将为适合分析大型蛋白质组数据集的快速算法和高性能计算解决方案奠定基础。将致力于设计适用于Peta级数据集的精确计算算法,软件实施将在大规模并行超级计算机和图形处理单元上运行。这份职业提案的方向是设计和建设基础设施,这将对最广泛的生物和生态社区有用。将对K12、本科生和研究生进行全面的跨学科教育,以确保美国保持其在STEM领域的全球领先地位。因此,正如NSF的使命所表明的那样,该项目服务于国家利益:促进科学进步,促进国家健康、繁荣和福祉。拟议的职业拨款的目标是设计和开发算法和高性能计算(HPC)基础,为用于大蛋白基因组学数据的实用次线性和并行算法-特别是对于具有以前未测序或部分测序的非模式生物-的算法和开发高性能计算(HPC)基础。蛋白质组学研究所需的MS和NGS数据集的集成显示了巨大的数据量和速度:NGS技术(如Chip-Seq)可以生成万亿字节的DNA/RNA数据,质谱仪可以生成数百万个谱(每个谱有数千个峰)。目前用于分析MS数据的系统主要是由启发式实践驱动的,并且不能很好地扩展。这份职业建议书将探索用于MS数据分析的一类新的简化算法,该算法可以在次线性时间内进行肽推导,在次线性空间中操作的压缩算法,以及在MS数据的有损简化形式上操作的Denovo算法。新的低复杂度采样和约简算法可以利用MS数据的稀疏性,例如基于非均匀FFT的卷积核可以产生更好的相似性度量,而不容易产生虚假相关性。大型系统生物学研究的瓶颈是粗粒度并行算法的低可伸缩性,这些算法没有利用MS特有的数据特征,并且由于多肽推导所需的计算时间不均匀而导致负载不平衡。该项目旨在探索在多核和GPU平台上针对NGS和MS数据的可扩展算法的设计和实现,使用基于谱聚类的域分解技术、基于工作负载估计的特定于MS的负载平衡、高性能计算降维策略和新颖的核外草图绘制和流细粒度并行算法。这些HPC解决方案可以使以前不切实际的蛋白质组学项目成为可能,并允许生物学家在不需要昂贵硬件的情况下进行计算实验。所有实现的算法都将作为与Galaxy框架接口的开源代码提供,以确保在系统生物学实验室中产生最大影响。然后,这些设计的技术将被整合,这样就可以在不重建转录组的情况下完成光谱与RNA-Seq数据的匹配。拟议的工具旨在揭示新的生物学洞察力,如新的基因、蛋白质和PTM,是理解生命树中物种的基因组、蛋白质组和进化方面的关键步骤。

项目成果

期刊论文数量(34)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry Data.
  • DOI:
    10.1038/s43588-021-00113-z
  • 发表时间:
    2021-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Haseeb M;Saeed F
  • 通讯作者:
    Saeed F
A Deep Learning-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets
基于深度学习的数据最小化算法,用于快速安全地传输大基因组数据集
  • DOI:
    10.1109/tbdata.2018.2805687
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    7.2
  • 作者:
    Aledhari, Mohammed;Di Pierro, Marianne;Hefeida, Mohamed;Saeed, Fahad
  • 通讯作者:
    Saeed, Fahad
ASD-SAENet: A Sparse Autoencoder, and Deep-Neural Network Model for Detecting Autism Spectrum Disorder (ASD) Using fMRI Data.
ASD-SAENET:使用fMRI数据检测自闭症谱系障碍(ASD)的稀疏自动编码器和深神经网络模型。
A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets
用于快速安全传输大基因组数据集的基于傅立叶的数据最小化算法
Machine Learning Methods for Diagnosing Autism Spectrum Disorder and Attention- Deficit/Hyperactivity Disorder Using Functional and Structural MRI: A Survey.
使用功能和结构 MRI 诊断自闭症谱系障碍和注意力缺陷/多动障碍的机器学习方法:一项调查。
  • DOI:
    10.3389/fninf.2020.575999
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    3.5
  • 作者:
    Eslami T;Almuqhim F;Raiker JS;Saeed F
  • 通讯作者:
    Saeed F
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Fahad Saeed其他文献

Best convective parameterization scheme within RegCM4 to downscale CMIP5 multi-model data for the CORDEX-MENA/Arab domain
  • DOI:
    10.1007/s00704-015-1463-5
  • 发表时间:
    2015-04-22
  • 期刊:
  • 影响因子:
    2.700
  • 作者:
    Mansour Almazroui;Md. Nazrul Islam;A. K. Al-Khalaf;Fahad Saeed
  • 通讯作者:
    Fahad Saeed
The Dialysis De Facto Default Is Not for Everyone: The Palliative Care Clinician's Role for Older Patients with Kidney Failure and Comorbidities (TH137)
  • DOI:
    10.1016/j.jpainsymman.2022.02.220
  • 发表时间:
    2022-05-01
  • 期刊:
  • 影响因子:
  • 作者:
    Alvin Moss;Dale Lupu;Fahad Saeed;Christine Corbett
  • 通讯作者:
    Christine Corbett
Establishing Research Priorities in Geriatric Nephrology: A Delphi Study of Clinicians and Researchers
老年肾脏病学研究重点的确立:一项针对临床医生和研究人员的德尔菲研究
  • DOI:
    10.1053/j.ajkd.2024.09.012
  • 发表时间:
    2025-03-01
  • 期刊:
  • 影响因子:
    8.200
  • 作者:
    Catherine R. Butler;Akanksha Nalatwad;Katharine L. Cheung;Mary F. Hannan;Melissa D. Hladek;Emily A. Johnston;Laura Kimberly;Christine K. Liu;Devika Nair;Semra Ozdemir;Fahad Saeed;Jennifer S. Scherer;Dorry L. Segev;Anoop Sheshadri;Karthik K. Tennankore;Tiffany R. Washington;Dawn Wolfgram;Nidhi Ghildayal;Rasheeda Hall;Mara McAdams-DeMarco
  • 通讯作者:
    Mara McAdams-DeMarco
International Politics — Effects on the Training of International Medical Graduates
  • DOI:
    10.1007/bf03341760
  • 发表时间:
    2014-01-17
  • 期刊:
  • 影响因子:
    1.800
  • 作者:
    Fahad Saeed;Nadia Kousar;Jean L. Holley
  • 通讯作者:
    Jean L. Holley
Correction to: Hydrologic interpretation of machine learning models for 10-daily streamflow simulation in climate sensitive upper Indus catchments
  • DOI:
    10.1007/s00704-024-05121-3
  • 发表时间:
    2024-08-07
  • 期刊:
  • 影响因子:
    2.700
  • 作者:
    Haris Mushtaq;Taimoor Akhtar;Muhammad Zia ur Rahman Hashmi;Amjad Masood;Fahad Saeed
  • 通讯作者:
    Fahad Saeed

Fahad Saeed的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Fahad Saeed', 18)}}的其他基金

OAC Core: High Performance Computing Algorithms and Software for large-scale Mass Spectrometry based Omics
OAC Core:基于大规模质谱组学的高性能计算算法和软件
  • 批准号:
    2312599
  • 财政年份:
    2023
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Standard Grant
PFI-TT: Artificial Intelligence-enabled Real-time System for Early Epileptic Seizure Detection and Prediction
PFI-TT:用于早期癫痫发作检测和预测的人工智能实时系统
  • 批准号:
    2213951
  • 财政年份:
    2022
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Standard Grant
I-Corps: Utilizing Machine learning and Artificial Intelligence (AI) for Early Detection and Identification of Mental Disorders
I-Corps:利用机器学习和人工智能 (AI) 早期检测和识别精神障碍
  • 批准号:
    2143515
  • 财政年份:
    2021
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Standard Grant
CRII: SHF: HPC Solutions to Big NGS Data Compression
CRII:SHF:NGS 大数据压缩的 HPC 解决方案
  • 批准号:
    1855441
  • 财政年份:
    2018
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Standard Grant
CAREER: Towards Fast and Scalable Algorithms for Big Proteogenomics Data Analytics
职业:面向蛋白质基因组大数据分析的快速且可扩展的算法
  • 批准号:
    1651724
  • 财政年份:
    2017
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Standard Grant
CRII: SHF: HPC Solutions to Big NGS Data Compression
CRII:SHF:NGS 大数据压缩的 HPC 解决方案
  • 批准号:
    1464268
  • 财政年份:
    2015
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Standard Grant

相似海外基金

CAREER: Towards a general recipe for fast high-dimensional scientific computing
职业:寻找快速高维科学计算的通用方法
  • 批准号:
    2339439
  • 财政年份:
    2024
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Continuing Grant
Development of Rotating Anode Systems for Fast Electrorefining of Copper: Towards a Stronger Resource-Recycling Framework
开发用于铜快速电解精炼的旋转阳极系统:建立更强大的资源回收框架
  • 批准号:
    23K17834
  • 财政年份:
    2023
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Grant-in-Aid for Challenging Research (Exploratory)
CAREER: NgOS: Towards Better Operating Systems: Fast, Secure, and Reliable
职业:NgOS:迈向更好的操作系统:快速、安全且可靠
  • 批准号:
    2239615
  • 财政年份:
    2023
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Continuing Grant
CAREER: Towards Harnessing the Motility of Microorganisms: Fast Algorithms, Data-Driven Models, and 3D Interactive Visual Computing
职业:利用微生物的运动性:快速算法、数据驱动模型和 3D 交互式视觉计算
  • 批准号:
    2408964
  • 财政年份:
    2023
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Continuing Grant
CAREER: Towards Harnessing the Motility of Microorganisms: Fast Algorithms, Data-Driven Models, and 3D Interactive Visual Computing
职业:利用微生物的运动性:快速算法、数据驱动模型和 3D 交互式视觉计算
  • 批准号:
    2146191
  • 财政年份:
    2022
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Continuing Grant
CAREER: Towards Efficient and Fast Hierarchical Federated Learning in Heterogeneous Wireless Edge Networks
职业:在异构无线边缘网络中实现高效快速的分层联邦学习
  • 批准号:
    2145031
  • 财政年份:
    2022
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Continuing Grant
Towards Glucose Transporter-Mediated Glucose-Responsive Insulin Delivery with Fast Response
实现快速响应的葡萄糖转运蛋白介导的葡萄糖反应性胰岛素输送
  • 批准号:
    10425401
  • 财政年份:
    2018
  • 资助金额:
    $ 41.6万
  • 项目类别:
Towards the Fabrication of Functional Nanomaterials via Bioorthogonal Chemistry: It's fast, it's clean, it's biocompatible!
通过生物正交化学制造功能性纳米材料:快速、清洁、生物相容!
  • 批准号:
    505109-2017
  • 财政年份:
    2018
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Development of a fast and modular approach towards the synthesis of unusual boron-containing molecules for serine hydrolase probe design.
开发一种快速、模块化的方法来合成用于丝氨酸水解酶探针设计的不寻常的含硼分子。
  • 批准号:
    489969-2016
  • 财政年份:
    2018
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Exploration and Fast Evaluation of Novel Electrode Materials Towards Development of Post-Lithium-Ion Batteries for Energy Storage Systems
新型电极材料的探索和快速评估,以开发用于储能系统的后锂离子电池
  • 批准号:
    18H01427
  • 财政年份:
    2018
  • 资助金额:
    $ 41.6万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了