权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Robust Methods for the Efficient Analysis and Integration of DNA Sequence Data

DNA 序列数据高效分析和整合的稳健方法

基本信息

批准号：
8064557
负责人：
ANDREW S ALLEN
金额：
$ 20.99万
依托单位：
DUKE UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2008
资助国家：
美国
起止时间：
2008-09-26 至 2013-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/8064557
关键词：
Accounting Address Base Sequence Communities Complex Computer software DNA Sequence DNA Sequence Analysis Data Data Set Development Disease Disease Association Disease Progression Documentation Evolution Future Genetic Genetic Research Genetic Variation Genome Genotype Goals Haplotypes Human Genetics Individual Information Networks Internet Investigation Localized Disease Major Depressive Disorder Methodology Methods Performance Population Population Genetics Procedures Production Property Research Research Personnel Research Project Grants Rest Role Sampling Scientist Signal Transduction Single Nucleotide Polymorphism Software Tools Source Code Statistical Methods Stratification Structure Testing Trust Variant Work base case control cost database of Genotypes and Phenotypes falls genetic association genetic variant genome sequencing genome wide association study human disease novel research study response simulation statistics tool user friendly software

项目摘要

Human genetics research is on the cusp of a major transformation in how genetic variation is captured-from a marker-based approach to one based on a complete characterization of an individual's genome by sequencing. This is an exciting prospect but not without its challenges. The imminent production of large amounts of sequence data raises several issues on how best to use these data. For example, because of the sheer scale of the data, statistical approaches for associating sequence variants with human disease need to be efficient, both statistically and computationally. In addition, most genetic association experiments in the near term will not rely solely on sequence data but instead will have sub-samples of individuals with sequence data while the rest of the sample will remain unsequenced but will contain genotype information. Alternatively, sequence data may be available on a separate, external sample. Thus it will be important to develop statistical methods that can appropriately integrate these various types of data into a unified inferential framework. This research project will address these issues by proposing to develop a novel class of sequence- based haplotype sharing statistics that exploit the implications of DNA sequence evolution in testing for variant/disease association (specific aim 1). Further, we propose to develop a statistical framework that allows for the unified analysis of DNA sequence and genotype data (specific aim 2). Throughout we will leverage our previous work developing robust methods for haplotype inference to develop computationally and statistically efficient procedures that remain robust to population genetic assumptions. A stratified analytic approach will be emphasized to allow for adjustment for confounding due to population stratification. Efficient Monte Carlo procedures will be proposed to account for the large number of sequence variants investigated. We will develop a suite of software tools that fully implement the methodology developed and make them freely available to the general research community (specific aim 3). Finally, using these tools, we will analyze a publicly available DNA sequence dataset with the goal of better localizing disease- associated sequence variants (specific aim 4). The methods developed through this proposal represent a unified and statistically rigorous framework for developing powerful tests that exploit evolutionary relationships between DNA sequences while allowing for disparate data types to be incorporated into a unified analysis. These procedures will give researchers the tools to more finely localize disease-associated sequence variants, allowing variants to be better prioritized for subsequent investigation via functional studies. Human genetics research is on the cusp of a major transformation in how genetic variation is captured-from a marker based approach to one based on a complete characterization of an individual's genome by sequencing. The imminent production of large amounts of sequencing data, however, leads to questions concerning their statistical analysis and incorporation into the larger experiment. We address these questions by proposing a unified and statistically rigorous framework for developing powerful tests that exploit evolutionary relationships between DNA sequences and that allow for disparate data types to be incorporated into a unified analysis.

人类遗传学研究正处于捕捉遗传变异方式的重大变革的尖端-从一个一种基于标记的方法，其基于个体基因组的完整特征测序。这是一个令人兴奋的前景，但也不是没有挑战。即将投产的大型汽车大量的序列数据引发了如何最好地使用这些数据的几个问题。例如，由于仅仅是数据的规模，将序列变异与人类疾病联系起来的统计方法需要在统计和计算上都要有效率。此外，大多数近距离的遗传关联实验 Term将不只依赖于序列数据，而是将具有具有序列数据的个体的子样本而样本的其余部分将保持未测序，但将包含基因信息。或者，序列数据可以在单独的外部样本上获得。因此，发展统计将是很重要的可以将这些不同类型的数据适当地集成到一个统一的推理框架中的方法。这个研究项目将通过提出开发一类新的序列来解决这些问题-- 基于单倍型共享统计，在测试中利用DNA序列进化的含义对于变种/疾病关联(具体目标1)。此外，我们建议制定一个统计框架这使得能够对DNA序列和基因数据进行统一分析(具体目标2)。在整个过程中，我将利用我们之前的工作，开发用于单倍型推断的健壮方法，以在计算上进行开发以及统计上有效的程序，这些程序对群体遗传假设保持稳健。分层分析将强调采取办法，以便根据人口分层造成的混乱情况进行调整。高效将提出蒙特卡罗程序来解释所研究的大量序列变体。我们将开发一套软件工具，完全实现所开发的方法并使向一般研究界免费提供(具体目标3)。最后，使用这些工具，我们将分析一个公开可用的DNA序列数据集，目标是更好地定位疾病- 相关序列变体(特定目标4)。通过这一提议制定的方法代表了一个统一的、统计上严格的框架用于开发强大的测试，该测试利用DNA序列之间的进化关系，同时允许要合并到统一分析中的不同数据类型。这些程序将为研究人员提供更精细地定位与疾病相关的序列变异的工具，使变异能够更好地优先处理通过功能研究进行后续研究。人类遗传学研究正处于捕捉遗传变异方式的重大变革的尖端-从一个一种基于标记的方法，通过测序对个体基因组的完整特征进行分析。然而，即将产生的大量测序数据导致了关于它们的问题统计分析，并纳入更大的实验。针对这些问题，我们提出了一个统一且在统计上严格的框架，用于开发利用进化关系的强大测试这使得不同的数据类型可以合并到统一的分析中。