Exact scalable inference for coalescent processes

合并过程的精确可扩展推理

基本信息

  • 批准号:
    EP/R044732/1
  • 负责人:
  • 金额:
    $ 12.68万
  • 依托单位:
  • 依托单位国家:
    英国
  • 项目类别:
    Research Grant
  • 财政年份:
    2018
  • 资助国家:
    英国
  • 起止时间:
    2018 至 无数据
  • 项目状态:
    已结题

项目摘要

Modern genetic data sets are vast, both in terms of the number of sequenced individuals and the length of the sequenced DNA segments. Patterns within these data sets carry information about the biological and demographic histories of the population, which cannot usually be observed directly.The central tool connecting observed patterns to predictions and inference is the Kingman coalescent: a random tree that provides a model for the unobserved ancestry of the sampled DNA sequences. Since the ancestry is unobserved, inferences are made by averaging over all possible ancestries.In simple cases the average over ancestries can be calculated analytically, but in most biologically relevant scenarios the average has to be approximated. This is usually done by simulating an ensemble of possible ancestral trees, and treating the ensemble average as an approximation of the true, unknown average. The quality of the approximation depends on the degree to which the ensemble is representative of the set of all possible ancestries. Ensuring that an ensemble is both representative, and not infeasibly large, is a challenging problem. Existing methods for producing ensembles split into two categories: importance sampling (IS), and Markov chain Monte Carlo (MCMC), of which the latter is typically more flexible and easier to implement. Both are known to scale poorly with the size and complexity of the data set. This proposal seeks to improve the scalability of state of the art MCMC methods in three related ways:1. Much work has been done to characterise optimal IS algorithms, which have been observed to perform roughly as well as naive implementations of the more flexible MCMC. Preliminary results for this project show that optimality results for IS can also be used to characterise optimal MCMC algorithms, but this has never been done. This work will investigate and thoroughly benchmark the performance of the resulting, optimised MCMC algorithms.2. The practical utility of MCMC algorithms has improved dramatically through so-called optimal scaling results, which provide a guide for how to tweak the algorithm as the data set grows. However, these typically apply only to settings in which the distribution being simulated consists of independent, real-valued components. In genetics, the distributions of interests consist of trees, and is hence much more complicated. This project will investigate extensions of optimal scaling results to tree-valued settings using recently developed machinery of optimal scaling via Dirichlet forms, which are a natural way to analyse tree-valued algorithms.3. A recently published algorithm called msprime uses a novel data structure, called a sparse tree, to improve the speed and memory consumption of naive coalescent simulation by many orders of magnitude. This does not immediately translate to improved inference algorithms, because naive simulation typically results in ensembles that are poor representations of the true average. The sparse tree structure cannot be directly inserted into an MCMC algorithm, but preliminary work has identified several ways in which MCMC can be modified to use data structures resembling sparse trees. This project will implement and benchmark all of the resulting algorithms to determine which of these ways is the most effective.The end result of these three streams will be a highly optimised, flexible, open source algorithm for inference in genetics. It will have unprecedented performance on large data sets due to a combination of mathematical optimisation (objectives 1 and 2) and optimisation of the underlying data structure (objective 3). MCMC algorithms also provide automatic, rigorous uncertainty quantification for their estimates, which many state-of-the-art competitors are not able to provide. This makes MCMC particularly well suited to e.g. clinical practice, where understanding uncertainties is crucial for medical outcomes.
现代基因数据集是巨大的,无论是在测序的个人数量和测序的DNA片段的长度。这些数据集中的模式携带着关于种群的生物学和人口统计学历史的信息,这些信息通常无法直接观察到,将观察到的模式与预测和推理联系起来的核心工具是金曼结合:一种随机树,为采样DNA序列的未观察到的祖先提供模型。由于祖先是不可观察的,所以推断是通过对所有可能的祖先进行平均来进行的。在简单的情况下,可以通过分析来计算祖先的平均值,但在大多数生物学相关的场景中,平均值必须近似。这通常是通过模拟可能的祖先树的集合,并将集合平均值视为真实的未知平均值的近似值来完成的。近似的质量取决于系综代表所有可能祖先的程度。确保一个集合既有代表性,又不是不可行的大,是一个具有挑战性的问题。现有的用于产生集合的方法分为两类:重要性采样(IS)和马尔可夫链蒙特卡罗(MCMC),其中后者通常更灵活且更容易实现。众所周知,这两种方法都不能很好地适应数据集的大小和复杂性。该建议旨在以三种相关的方式提高现有技术MCMC方法的可扩展性:1.已经做了大量的工作来优化IS算法,已经观察到执行粗略以及更灵活的MCMC的幼稚实现。该项目的初步结果表明,IS的最优性结果也可以用来验证最优MCMC算法,但这从来没有做过。这项工作将调查和彻底的基准性能的结果,优化MCMC算法。MCMC算法的实际效用通过所谓的最优缩放结果得到了显着提高,这些结果为如何随着数据集的增长调整算法提供了指导。然而,这些通常仅适用于被模拟的分布由独立的实值分量组成的设置。在遗传学中,利益的分布由树组成,因此要复杂得多。这个项目将研究最优缩放结果的扩展到树值设置使用最近开发的最优缩放机制通过Dirichlet形式,这是一种自然的方式来分析树值算法。最近发表的一种名为msprime的算法使用了一种称为稀疏树的新型数据结构,将朴素合并模拟的速度和内存消耗提高了许多个数量级。这并不能立即转化为改进的推理算法,因为朴素的模拟通常会导致对真实平均值的不良表示。稀疏树结构不能直接插入到MCMC算法中,但初步工作已经确定了几种方法,可以修改MCMC以使用类似稀疏树的数据结构。该项目将实施和基准测试所有产生的算法,以确定这些方法中哪一种是最有效的。这三个流的最终结果将是一个高度优化的,灵活的,开源的遗传学推理算法。由于数学优化(目标1和2)和底层数据结构优化(目标3)的结合,它将在大型数据集上具有前所未有的性能。MCMC算法还为其估计提供自动、严格的不确定性量化,这是许多最先进的竞争对手无法提供的。这使得MCMC特别适合于例如临床实践,其中理解不确定性对医疗结果至关重要。

项目成果

期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Weak convergence of non-neutral genealogies to Kingman's coalescent
非中立谱系与金曼合并的弱收敛
Robust model selection between population growth and multiple merger coalescents.
人口增长和多重合并合并之间的稳健模型选择。
  • DOI:
    10.1016/j.mbs.2019.03.004
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    4.3
  • 作者:
    Koskela J
  • 通讯作者:
    Koskela J
Zig-zag sampling for discrete structures and non-reversible phylogenetic MCMC
离散结构和不可逆系统发育 MCMC 的锯齿形采样
  • DOI:
    10.48550/arxiv.2004.08807
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Koskela J
  • 通讯作者:
    Koskela J
Efficient ancestry and mutation simulation with msprime 1.0.
  • DOI:
    10.1093/genetics/iyab229
  • 发表时间:
    2022-03-03
  • 期刊:
  • 影响因子:
    3.3
  • 作者:
    Baumdicker, Franz;Bisschop, Gertjan;Goldstein, Daniel;Gower, Graham;Ragsdale, Aaron P.;Tsambos, Georgia;Zhu, Sha;Eldon, Bjarki;Ellerman, E. Castedo;Galloway, Jared G.;Gladstein, Ariella L.;Gorjanc, Gregor;Guo, Bing;Jeffery, Ben;Kretzschumar, Warren W.;Lohse, Konrad;Matschiner, Michael;Nelson, Dominic;Pope, Nathaniel S.;Quinto-Cortes, Consuelo D.;Rodrigues, Murillo F.;Saunack, Kumar;Sellinger, Thibaut;Thornton, Kevin;van Kemenade, Hugo;Wohns, Anthony W.;Wong, Yan;Gravel, Simon;Kern, Andrew D.;Koskela, Jere;Ralph, Peter L.;Kelleher, Jerome
  • 通讯作者:
    Kelleher, Jerome
Statistical tools for seed bank detection
种子库检测统计工具
  • DOI:
    10.48550/arxiv.1907.13549
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Blath J
  • 通讯作者:
    Blath J
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Jere Koskela其他文献

Consistency of Bayesian nonparametric inference for discretely observed jump diffusions
离散观察跳跃扩散的贝叶斯非参数推理的一致性
  • DOI:
  • 发表时间:
    2015
  • 期刊:
  • 影响因子:
    1.5
  • 作者:
    Jere Koskela;Dario Spanò;P. A. Jenkins
  • 通讯作者:
    P. A. Jenkins
Zig-Zag Sampling for Discrete Structures and Nonreversible Phylogenetic MCMC
离散结构和不可逆系统发育 MCMC 的 Zig-Zag 采样
Bayesian non-parametric inference for $\Lambda$-coalescents: Posterior consistency and a parametric method
$Lambda$-coalescents 的贝叶斯非参数推理:后验一致性和参数方法
  • DOI:
    10.3150/16-bej923
  • 发表时间:
    2015
  • 期刊:
  • 影响因子:
    1.5
  • 作者:
    Jere Koskela;P. A. Jenkins;Dario Spanò
  • 通讯作者:
    Dario Spanò
Multi-locus data distinguishes between population growth and multiple merger coalescents
Genealogical processes of non-neutral population models under rapid mutation
快速突变下非中性种群模型的谱系过程
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jere Koskela;Paul A. Jenkins;A. M. Johansen;Dario Spanò
  • 通讯作者:
    Dario Spanò

Jere Koskela的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Jere Koskela', 18)}}的其他基金

Mathematical foundations of non-reversible MCMC for genome-scale inference
用于基因组规模推理的不可逆 MCMC 的数学基础
  • 批准号:
    EP/V049208/1
  • 财政年份:
    2021
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Research Grant

相似国自然基金

Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    合作创新研究团队

相似海外基金

Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
  • 批准号:
    2412357
  • 财政年份:
    2024
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
  • 批准号:
    2415562
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: MEDIUM: General and Scalable Pluggable Type Inference
合作研究:SHF:MEDIUM:通用且可扩展的可插入类型推理
  • 批准号:
    2312263
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Continuing Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
  • 批准号:
    2243053
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Standard Grant
Scalable Computational Methods for Genealogical Inference: from species level to single cells
用于谱系推断的可扩展计算方法:从物种水平到单细胞
  • 批准号:
    10889303
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
  • 批准号:
    2341725
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
  • 批准号:
    2243052
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: MEDIUM: General and Scalable Pluggable Type Inference
合作研究:SHF:MEDIUM:通用且可扩展的可插入类型推理
  • 批准号:
    2312262
  • 财政年份:
    2023
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Continuing Grant
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
  • 批准号:
    2212508
  • 财政年份:
    2022
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Standard Grant
Collaborative Research: CDS&E: Scalable Inference for Spatio-Temporal Markov Random Fields
合作研究:CDS
  • 批准号:
    2152777
  • 财政年份:
    2022
  • 资助金额:
    $ 12.68万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了