NIRG: FARSPhase: a Flexible, widely Applicable, Robust, and Scalable phasing algorithm for human genetics

NIRG:FARSPhase:一种灵活、广泛适用、稳健且可扩展的人类遗传学定相算法

基本信息

  • 批准号:
    MR/M000370/1
  • 负责人:
  • 金额:
    $ 48.26万
  • 依托单位:
  • 依托单位国家:
    英国
  • 项目类别:
    Research Grant
  • 财政年份:
    2015
  • 资助国家:
    英国
  • 起止时间:
    2015 至 无数据
  • 项目状态:
    已结题

项目摘要

In computational genetics, phasing is the modelling of the underlying haploid structure of diploid genotypes. It is important for many genetic studies because inheritance actually takes place at the haploid level, even though we can only directly observe diploid genotypes with current mainstream technologies. In many applications haplotypes provide richer and more useful information than genotypes alone. Applications of haplotype phase include understanding the interplay of genetic variation and disease, enabling identity-by-descent models for use in heritability analysis, gene association studies and genomic prediction, imputation of un-typed genetic variation, prioritizing individuals for sequencing, calling genotypes, detecting genotype error, inferring human demographic history, inferring points of recombination, detecting recurrent mutation and signatures of selection, and modelling cis-regulation of gene expression.Human genetics data sets that will likely be phased in the future can be categorised into: (i) huge populations of nominally unrelated individuals (e.g. 500,000 individuals, UK Biobank); (ii) smaller subsets of such populations (e.g. data collected in individual studies); (iii) large (e.g. 50,000 individuals) or small (e.g. 1,000 individuals) data sets collected from isolated populations with high degrees of relatedness within them (e.g. Orcades - Orkney, deCODE - Iceland, VIKING - Sweden); (iv) data sets with and without pedigree information; (v) data sets that combine several of these features (e.g. Generation Scotland); and (vi) data sets with different types of genomic information (e.g. single nucleotide polymorphisms, low- or high-coverage sequence, short or longer sequence reads, etc.).There are many phasing methods for human genetics data and these can be broadly classified into two groups: (i) heuristic methods (e.g. Long-Range Phasing (LRP)); and (ii) probabilistic methods (e.g. Hidden Markov Models (HMM)). Phasing is computationally intensive and the size and features of different data sets make them more or less suited to particular methods. LRP is computationally fast in comparison to HMM, but is only applicable to situations where individuals share relatively recent ancestry (e.g. within 10 generations) and thus share relatively long haplotypes (e.g. 5 to 10 cM length). Isolated populations (e.g. as in Orcades, Orkney) are ideally suited to LRP but huge populations with hundreds of thousands of nominally unrelated individuals may also be suitable (e.g. UK Biobank). Application of current HMM to such huge populations is computationally intractable. However, HMM are more suited to subsets of such populations than LRP because HMM only require that individuals share short haplotypes (e.g. <1 cM) due to sharing very distant relatives (e.g. 50 to 100 generations ago).LRP and HMM methods are complementary in many ways. One models long haplotypes, the other short haplotypes. HMM methods are more flexible and can better model uncertainty in the data. LRP methods are computationally much more efficient and are also more accurate in scenarios to which they are suited. LRP methods are also more amenable to incorporation of pedigree information. A combined algorithm could exploit this complementarity.The objective of this proposal is to develop FARSPhase: a Flexible, widely Applicable, Robust, and Scalable, phasing algorithm for human genetics that combines the best features of LRP, other heuristics, and HMM methods into a single framework. As well as meeting the phasing needs for small data sets, if successful, this research will enable huge data sets be phased and thereby opening the possibility of more powerful analysis. The developed algorithm will be combined into a user friendly software package built using best practices in software engineering and its performance will be tested in a wide range of simulated and real data sets that reflect the likely future phasing scenarios for human genetics.
在计算遗传学中,分相是对二倍体基因型的潜在单倍体结构的建模。这对许多基因研究都很重要,因为遗传实际上发生在单倍体水平上,尽管我们目前的主流技术只能直接观察到二倍体基因型。在许多应用中,单倍型比单独的基因型提供更丰富、更有用的信息。单倍型阶段的应用包括了解遗传变异和疾病的相互作用,使血统识别模型用于遗传力分析、基因关联研究和基因组预测、未分型遗传变异的归算、排序个体的优先顺序、调用基因型、检测基因型错误、推断人类人口统计学历史、推断重组点、检测复发突变和选择的特征。以及模拟基因表达的顺式调控。未来可能分阶段进行的人类遗传学数据集可以分为:(i)名义上不相关个体的庞大种群(例如,英国生物银行的50万人);(ii)此类人群的较小子集(例如,在个别研究中收集的数据);(iii)从具有高度亲缘关系的孤立种群(例如Orcades -奥克尼、deCODE -冰岛、VIKING -瑞典)收集的大型(例如50,000个个体)或小型(例如1,000个个体)数据集;(iv)有或没有系谱信息的数据集;(v)结合了上述几个特征的数据集(例如苏格兰世代);(六)不同类型基因组信息的数据集(如单核苷酸多态性、低或高覆盖序列、短或长序列读取等)。人类遗传学数据的分相方法有很多,大致可分为两类:(i)启发式方法(如远程分相(LRP));(ii)概率方法(例如隐马尔科夫模型(HMM))。分阶段是计算密集型的,不同数据集的大小和特征使它们或多或少适合于特定的方法。与HMM相比,LRP的计算速度更快,但仅适用于个体共享相对较近的祖先(例如在10代内),因此共享相对较长的单倍型(例如5到10厘米长)的情况。孤立的种群(如奥克尼群岛和奥克尼群岛)非常适合LRP,但拥有数十万名义上无关个体的庞大种群也可能适合LRP(如UK Biobank)。将现有HMM应用于如此庞大的群体在计算上是难以解决的。然而,HMM比LRP更适合于这些群体的子集,因为HMM只要求个体共享较短的单倍型(例如<1 cM),这是由于共享非常遥远的亲戚(例如50到100代以前)。LRP和HMM方法在许多方面是互补的。一个是长单倍型,另一个是短单倍型。HMM方法更加灵活,可以更好地对数据中的不确定性进行建模。LRP方法在计算上更有效,而且在适合它们的情况下也更准确。LRP方法也更适合纳入谱系信息。组合算法可以利用这种互补性。本提案的目标是开发FARSPhase:一种灵活、广泛适用、鲁棒和可扩展的人类遗传学相位算法,它将LRP、其他启发式和HMM方法的最佳特征结合到一个框架中。除了满足小数据集的分阶段需求外,如果成功,本研究将使大数据集能够分阶段进行,从而为更强大的分析提供可能。开发的算法将结合到使用软件工程最佳实践构建的用户友好软件包中,其性能将在广泛的模拟和真实数据集中进行测试,这些数据集反映了人类遗传学可能的未来分阶段情景。

项目成果

期刊论文数量(10)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A hybrid method for the imputation of genomic data in livestock populations.
  • DOI:
    10.1186/s12711-017-0300-y
  • 发表时间:
    2017-03-03
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Antolín R;Nettelblad C;Gorjanc G;Money D;Hickey JM
  • 通讯作者:
    Hickey JM
MOESM3 of A hybrid method for the imputation of genomic data in livestock populations
用于家畜种群基因组数据插补的混合方法的 MOESM3
  • DOI:
    10.6084/m9.figshare.c.3708046_d3
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    AntolA­N R
  • 通讯作者:
    AntolA­N R
MOESM8 of A hybrid method for the imputation of genomic data in livestock populations
MOESM8 家畜种群基因组数据插补的混合方法
  • DOI:
    10.6084/m9.figshare.c.3708046_d8
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    AntolA­N R
  • 通讯作者:
    AntolA­N R
A family-based phasing algorithm for sequence data
基于家族的序列数据定相算法
  • DOI:
    10.1101/504480
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Battagin M
  • 通讯作者:
    Battagin M
Effect of manipulating recombination rates on response to selection in livestock breeding programs.
  • DOI:
    10.1186/s12711-016-0221-1
  • 发表时间:
    2016-06-22
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Battagin M;Gorjanc G;Faux AM;Johnston SE;Hickey JM
  • 通讯作者:
    Hickey JM
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

John Hickey其他文献

Spatial Dissection of the Bone Marrow Microenvironment in Multiple Myeloma By High Dimensional Multiplex Tissue Imaging
  • DOI:
    10.1182/blood-2023-189255
  • 发表时间:
    2023-11-02
  • 期刊:
  • 影响因子:
  • 作者:
    Marc-Andrea Baertsch;Alexander Brobeil;John Hickey;Maximilian Haist;Alexandra Maria Poos;Guolan Lu;Wilson Kuswanto;Christian Schuerch;Harald Voehringer;Wolfgang Huber;Gunhild Mechtersheimer;Carsten Mueller-Tidow;Peter Schirmacher;Katja Weisel;Roland Fenk;Hartmut Goldschmidt;Yury Goltsev;Marc S. Raab;Niels Weinhold;Garry P. Nolan
  • 通讯作者:
    Garry P. Nolan
Colonisation of clearfelled coupes by rainforest tree species from mature mixed forest edges, Tasmania, Australia
  • DOI:
    10.1016/j.foreco.2006.11.021
  • 发表时间:
    2007-03-15
  • 期刊:
  • 影响因子:
  • 作者:
    John Tabor;Chris McElhinny;John Hickey;Jeff Wood
  • 通讯作者:
    Jeff Wood

John Hickey的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('John Hickey', 18)}}的其他基金

A general method for the imputation of genomic data in crop species
作物物种基因组数据估算的通用方法
  • 批准号:
    BB/R002061/1
  • 财政年份:
    2017
  • 资助金额:
    $ 48.26万
  • 项目类别:
    Research Grant
Analysis of quantitative genetic traits in a huge data set
海量数据集中的数量遗传性状分析
  • 批准号:
    BB/N006178/1
  • 财政年份:
    2016
  • 资助金额:
    $ 48.26万
  • 项目类别:
    Research Grant
15AGRITECHCAT3 Precision Breeding: Broilers from Sequence to Consequence
15AGRITECHCAT3 精准育种:肉鸡从顺序到结果
  • 批准号:
    BB/N004728/1
  • 财政年份:
    2015
  • 资助金额:
    $ 48.26万
  • 项目类别:
    Research Grant
Developing next generation genetic improvement tools from next generation sequencing
通过下一代测序开发下一代遗传改良工具
  • 批准号:
    BB/M009254/1
  • 财政年份:
    2015
  • 资助金额:
    $ 48.26万
  • 项目类别:
    Research Grant
15AGRITECHCAT3 Innovative NextGen pig breeding using DNA sequence data
15AGRITECHCAT3 使用 DNA 序列数据的创新下一代猪育种
  • 批准号:
    BB/N004736/1
  • 财政年份:
    2015
  • 资助金额:
    $ 48.26万
  • 项目类别:
    Research Grant
Next generation imputation for huge data sets
大数据集的下一代插补
  • 批准号:
    BB/L020726/1
  • 财政年份:
    2014
  • 资助金额:
    $ 48.26万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了