权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Optimized workflows for structural variant analysis of the Kids First genomes using short and long reads

使用短读长和长读长对 Kids First 基因组进行结构变异分析的优化工作流程

基本信息

批准号：
10432507
负责人：
MICHAEL SCHATZ
金额：
$ 15.63万
依托单位：
JOHNS HOPKINS UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2022
资助国家：
美国
起止时间：
2022-04-01 至 2024-03-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10432507
关键词：
Address Affect Algorithms Base Pairing Biological Sciences Code Complex Data Analyses Data Set Development Disease Ensure Ethnic Origin Etiology Family member Fostering Genes Genetic Genetic Variation Genome Genomics Genotype Goals Human Genome Individual Jasminum Malignant Childhood Neoplasm Maps Medical Mutation Patients Pediatric Research Phase Pilot Projects Population Proteins Repetitive Sequence Reproducibility Research Research Personnel Resolution Sampling Structural Congenital Anomalies Technology Time Variant X Chromosome autosome cloud based cohort data resource driver mutation genetic analysis genetic pedigree genome analysis genome annotation genome-wide human reference genome improved insertion/deletion mutation nanopore novel open source paralogous gene power analysis programs reconstruction reference genome screening software development statistical and machine learning telomere variant detection

项目摘要

Project Summary The overall goal of the Gabriella Miller Kids First Pediatric Research Program is to alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases. A recent addition to the program is the Kids First Long Read Pilot Projects, which are leveraging long-read sequencing technologies to further resolve the patients’ genomes. Already these technologies are transforming genomics by allowing complete telomere-to-telomere (T2T) reconstructions of human genomes for the first time, and by allowing the discovery of structural variants and other complex variants that were previously inaccessible using short read sequencing. Here we will enhance the utility of the Kids First data sets by developing and applying optimized cloud-scale workflows for analyzing short and long read datasets with the new T2T-CHM13 human genome. Within the T2T consortium, we have led the effort to characterize how the CHM13 genome influences variant calling, and have found the T2T reference universally improves the analysis of genetic variation using both short and long read sequencing. Here we will develop optimized workflows for analyzing short read datasets with the T2T-CHM13 reference genome using GATK for SNVs and small indels, and Parliament2 for short-read SV discovery. Next we will develop optimized workflows for Long Read Structural Variant Detection. Short-reads are challenged to detect many classes of mutations (e.g. SVs, repeat expansions, etc), and cannot resolve many repetitive regions of the genome, including within many medically relevant genes. Long-reads show great promise to address these challenges and discover new disease associations due to its increased mappability, variant resolution, and phasing capabilities. To enable these technologies for Kids First, we will develop optimized workflows for accurately identifying and comparing SVs across long read samples with Jasmine, as well as genotyping SVs discovered by long reads within short read datasets with Paragraph. This will enable us to analyze and prioritize variants found by long reads within the much larger numbers of short read datasets. We will then apply these workflows to the Kids First data resource to develop improved variant calls and improved variant analysis of these precious samples. This will lead to the discovery of thousands of SVs that were previously missed, and will reduce the number of false variants that would otherwise confuse any downstream analysis. We will also develop new statistical and machine learning approaches for prioritizing the variants that are most likely to be related to the studied diseases, leveraging the pedigree information and genome annotations available, in support of our overall goal of identifying the driver mutations for these diseases. All workflows and software developments will be released open source for use in CAVATICA, the cloud-based analysis platform used by all Kids First researchers, ensuring scalability and reproducibility.

项目摘要加布里埃拉米勒儿童第一儿科研究计划的总体目标是减轻痛苦，儿童癌症和结构性出生缺陷，促进合作研究，以揭示病因，这些疾病。最近加入该计划的是儿童第一长阅读试点项目，该项目利用长读测序技术，以进一步解决病人的基因组。这些技术已经通过允许人类基因组的完整端粒到端粒（T2 T）重建来改变基因组学这是第一次，并允许发现结构变异和其他复杂的变异，以前无法使用短读序测序。在这里，我们将通过开发和应用优化的云规模，使用新的T2 T-CHM 13人类基因组分析短读和长读数据集的工作流程。内 T2 T联盟，我们领导了CHM 13基因组如何影响变异识别的研究，我发现T2 T参考普遍改善了使用短和长两种方法的遗传变异分析读取测序。在这里，我们将开发优化的工作流程，用于使用使用GATK的T2 T-CHM 13参考基因组用于SNV和小插入缺失，使用C2 T-CHM 13参考基因组用于短读段SV 的发现接下来，我们将为长读段结构变异检测开发优化的工作流程。短读检测许多类型的突变（例如SV、重复扩增等）是一项挑战，基因组的重复区域，包括许多医学相关基因。长阅读显示伟大的有望解决这些挑战，并发现新的疾病关联，因为它增加了可映射性，可变分辨率和定相能力。为了使这些技术能够用于儿童优先，我们将开发优化的工作流程，用于使用Jasmine准确识别和比较长读段样本中的SV，以及用段落在短读段数据集中通过长读段发现的SV的基因分型。这将使我们以分析和优先化通过长读段在大量短读段数据集中发现的变体。然后，我们将这些工作流程应用于Kids First数据资源，以开发改进的变体调用，改进了这些珍贵样本的变异分析这将导致发现成千上万的SV，以前错过了，并将减少错误变体的数量，否则会混淆任何下游分析我们还将开发新的统计和机器学习方法，最有可能与所研究疾病相关的变异，利用谱系信息，基因组注释可用，以支持我们的总体目标，确定这些驱动突变疾病所有工作流程和软件开发都将开源发布，用于CAVATICA，所有Kids First研究人员都使用基于云的分析平台，确保可扩展性和可重复性。