权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Robust Methods for the Efficient Analysis and Integration of DNA Sequence Data

DNA 序列数据高效分析和整合的稳健方法

基本信息

批准号：
7692191
负责人：
ANDREW S ALLEN
金额：
$ 23.4万
依托单位：
DUKE UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2008
资助国家：
美国
起止时间：
2008-09-26 至 2011-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7692191
关键词：
Accounting Address Base Sequence Communities Complex Computer software DNA Sequence DNA Sequence Analysis Data Data Set Development Disease Disease Association Disease Progression Documentation Evolution Future Genetic Genetic Research Genetic Variation Genome Genotype Goals Haplotypes Human Genetics Individual Information Networks Internet Investigation Localized Disease Major Depressive Disorder Methodology Methods Performance Population Population Genetics Procedures Production Property Research Research Personnel Research Project Grants Rest Role Sampling Scientist Signal Transduction Single Nucleotide Polymorphism Software Tools Source Code Statistical Methods Stratification Structure Testing Trust Variant Work base case control cost database of Genotypes and Phenotypes falls follow-up genetic association genetic variant genome sequencing genome wide association study human disease novel research study response simulation statistics tool user friendly software

项目摘要

DESCRIPTION (provided by applicant): Human genetics research is on the cusp of a major transformation in how genetic variation is captured-from a marker-based approach to one based on a complete characterization of an individual's genome by sequencing. This is an exciting prospect but not without its challenges. The imminent production of large amounts of sequence data raises several issues on how best to use these data. For example, because of the sheer scale of the data, statistical approaches for associating sequence variants with human disease need to be efficient, both statistically and computationally. In addition, most genetic association experiments in the near term will not rely solely on sequence data but instead will have sub-samples of individuals with sequence data while the rest of the sample will remain unsequenced but will contain genotype information. Alternatively, sequence data may be available on a separate, external sample. Thus it will be important to develop statistical methods that can appropriately integrate these various types of data into a unified inferential framework. This research project will address these issues by proposing to develop a novel class of sequence- based haplotype sharing statistics that exploit the implications of DNA sequence evolution in testing for variant/disease association (specific aim 1). Further, we propose to develop a statistical framework that allows for the unified analysis of DNA sequence and genotype data (specific aim 2). Throughout we will leverage our previous work developing robust methods for haplotype inference to develop computationally and statistically efficient procedures that remain robust to population genetic assumptions. A stratified analytic approach will be emphasized to allow for adjustment for confounding due to population stratification. Efficient Monte Carlo procedures will be proposed to account for the large number of sequence variants investigated. We will develop a suite of software tools that fully implement the methodology developed and make them freely available to the general research community (specific aim 3). Finally, using these tools, we will analyze a publicly available DNA sequence dataset with the goal of better localizing disease- associated sequence variants (specific aim 4). The methods developed through this proposal represent a unified and statistically rigorous framework for developing powerful tests that exploit evolutionary relationships between DNA sequences while allowing for disparate data types to be incorporated into a unified analysis. These procedures will give researchers the tools to more finely localize disease-associated sequence variants, allowing variants to be better prioritized for subsequent investigation via functional studies. Human genetics research is on the cusp of a major transformation in how genetic variation is captured-from a marker based approach to one based on a complete characterization of an individual's genome by sequencing. The imminent production of large amounts of sequencing data, however, leads to questions concerning their statistical analysis and incorporation into the larger experiment. We address these questions by proposing a unified and statistically rigorous framework for developing powerful tests that exploit evolutionary relationships between DNA sequences and that allow for disparate data types to be incorporated into a unified analysis.

描述（由申请人提供）：人类遗传学研究正处于如何捕获遗传变异的重大转变的尖端-从基于标记的方法到基于通过测序对个体基因组进行完整表征的方法。这是一个令人兴奋的前景，但并非没有挑战。即将产生的大量序列数据提出了如何最好地使用这些数据的几个问题。例如，由于数据的绝对规模，用于将序列变异与人类疾病相关联的统计方法需要在统计和计算上都是有效的。此外，在短期内，大多数遗传关联实验将不仅仅依赖于序列数据，而是将具有序列数据的个体的子样本，而样本的其余部分将保持未测序，但将包含基因型信息。或者，序列数据可以在单独的外部样品上获得。因此，重要的是要开发统计方法，可以适当地将这些不同类型的数据整合到一个统一的推理框架。本研究项目将通过提出开发一类新的基于序列的单倍型共享统计来解决这些问题，该统计利用DNA序列进化在检测变异/疾病关联中的意义（具体目标1）。此外，我们建议开发一个统计框架，允许DNA序列和基因型数据的统一分析（具体目标2）。在整个过程中，我们将利用我们以前的工作开发强大的单倍型推断方法，以开发计算和统计上有效的程序，保持强大的人口遗传假设。将强调分层分析方法，以调整由于人群分层引起的混杂因素。有效的蒙特卡罗程序将被提出来考虑大量的序列变异的调查。我们将开发一套软件工具，完全实现所开发的方法，并使其免费提供给一般的研究社区（具体目标3）。最后，使用这些工具，我们将分析公开可用的DNA序列数据集，目的是更好地定位疾病相关的序列变体（具体目的4）。通过这一提议开发的方法代表了一个统一的和统计上严格的框架，用于开发强大的测试，这些测试利用DNA序列之间的进化关系，同时允许将不同的数据类型纳入统一的分析。这些程序将为研究人员提供更精细地定位疾病相关序列变异的工具，使变异能够更好地优先用于随后的功能研究。人类遗传学研究正处于如何捕捉遗传变异的重大转变的尖端从基于标记的方法到基于通过测序对个体基因组进行完整表征的方法。然而，大量测序数据的即将产生，导致了有关其统计分析和纳入更大实验的问题。我们解决这些问题，提出了一个统一的和统计上严格的框架，开发强大的测试，利用DNA序列之间的进化关系，并允许不同的数据类型被纳入一个统一的分析。