权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Efficient Methods for Genotype-Specific Distributions with Unobserved Genotypes.

未观察到的基因型的基因型特异性分布的有效方法。

基本信息

批准号：
8663321
负责人：
Yuanjia Wang
金额：
$ 26.34万
依托单位：
COLUMBIA UNIVERSITY HEALTH SCIENCES
依托单位国家：
美国
项目类别：
财政年份：
2011
资助国家：
美国
起止时间：
2011-07-15 至 2016-06-30
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): This proposal develops a series of new semiparametric efficient methods for genetic data where subjects' genotypes are not observed therefore phenotype data arise from a mixture of genotype-specific subpopulations. One example is data collected in a kin-cohort study, where the scientific interest is in estimating the distribution function of a trait or time to developing a disease for deleterious mutation carriers (penetrance function). In a kin- cohort study, index subjects (probands) possibly enriched with mutation carriers are sampled and genotyped. Disease history in relatives of the probands is collected, but the relatives are not genotyped therefore it may be unknown whether they carry a mutation. However, one can calculate the probability of each relative being a mutation carrier using the proband's genotype and Mendelian laws. Another example is interval mapping of quantitative traits (QTL). In such studies, genotype at a QTL is unobserved therefore the trait distribution takes the form of a mixture of QTL-genotype specific distributions. The probability of the QTL having a specific geno- type is computed based on marker genotypes and recombination fractions between the marker and the QTL. Interest is on estimating the QTL genotype-specific distributions. A common feature of these examples is that the scientific interest is in inference of genotype-specific subpopulations but it is unknown which subpopulation each observation belongs to. The probability of each observation being in any subpopulation varies and can be estimated. Without making a prespecified, error prone parametric assumption on these genotype-specific distributions, the only available statistical methods in the literature are two distinct nonparametric maximum like- lihood estimators (NPMLE1, NPMLE2). However, we will show that NPMLE1 is not efficient, and NPMLE2 is not consistent. There is therefore great need to develop valid and efficient statistical tools for such data. We use modern semiparametric theory to carry out a formal semiparametric analysis where we define a rich class of estimators. We show that any least squares based estimator is a member of this estimation class. We construct an optimal member of this family which obtains the minimum estimation variance hence reaches the semipara- metric efficiency bound. For censored outcomes, we propose a semiparametric efficient estimator given an influence function of the complete uncensored data. We propose an inverse probability weighting estimator, and add an augmentation term to obtain optimal efficiency. We also construct an imputation estimator which is easy to implement and does not require additional model assumption for the imputation step. Furthermore we propose methods to handle other observed covariates such as gender and additional residual correlation among family members. We also develop a series of tests for equality of two distributions at single or multi- ple time points simultaneously and an overall test of two distributions being equal at all time points. We will apply apply developed methods to analyze a kin-cohort study on Parkinson's disease, a large family study on Huntington's disease and two QTL studies.

描述（由申请人提供）：该提案为遗传数据开发了一系列新的半参数有效方法，其中未观察到受试者的基因型，因此表型数据来自基因型特异性亚群的混合物。一个例子是在亲属队列研究中收集的数据，其中科学兴趣是估计有害突变携带者发展疾病的特征或时间的分布函数（突变函数）。在亲属队列研究中，对可能富含突变携带者的索引受试者（先证者）进行采样和基因分型.收集先证者亲属的疾病史，但亲属未进行基因分型，因此可能不知道他们是否携带突变。然而，可以使用先证者的基因型和孟德尔定律计算每个亲属是突变携带者的概率。另一个例子是数量性状（QTL）的区间作图。在这样的研究中，QTL的基因型是未观察到的，因此性状分布采取QTL-基因型特异性分布的混合形式。基于标记基因型和标记与QTL之间的重组分数计算QTL具有特定基因型的概率。感兴趣的是估计QTL基因型特异性分布。这些例子的一个共同特征是，科学兴趣在于推断基因型特异性亚群，但不知道每个观察结果属于哪个亚群。每个观测值在任何子总体中的概率各不相同，并且可以估计。在没有对这些基因型特异性分布进行预先指定的、容易出错的参数假设的情况下，文献中唯一可用的统计方法是两种不同的非参数最大似然估计（NPMLE 1，NPMLE 2）。然而，我们将证明NPMLE 1不是有效的，NPMLE 2是不一致的。因此，迫切需要为这些数据开发有效和高效的统计工具。我们使用现代半参数理论进行正式的半参数分析，我们定义了丰富的估计类。我们表明，任何最小二乘估计是这个估计类的成员。我们构造了这个族的一个最优成员，使估计方差最小，从而达到半参数有效界。对于删失结果，我们提出了一个半参数有效估计的完全未删失数据的影响函数。我们提出了一种逆概率加权估计，并增加了一个增广项，以获得最佳的效率。我们还构造了一个插补估计，这是很容易实现的，不需要额外的模型假设的插补步骤。此外，我们还提出了处理其他观察到的协变量，如性别和家庭成员之间的额外剩余相关性的方法。我们还开发了一系列的两个分布在单个或多个时间点同时相等的检验和两个分布在所有时间点相等的总体检验。我们将应用先进的方法分析帕金森病的亲属队列研究，亨廷顿病的大家系研究和两个QTL研究。