权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Non-uniform sampling of permutations and large scale hypothesis testing

排列的非均匀采样和大规模假设检验

基本信息

批准号：
1521145
负责人：
Art Owen
金额：
$ 39.97万
依托单位：
Stanford University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2015
资助国家：
美国
起止时间：
2015-08-01 至 2019-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1521145&HistoricalAwards=false
关键词：
Non uniform sampling permutations large

项目摘要

Modern scientific tools are delivering very large data sets. This is especially true in biology where expression levels for thousands of genes or even the specific DNA information at millions of locations on the genome can be measured. Scientists would like to correlate these variables with other measured quantities, especially the presence or absence of a disease. When millions of hypotheses are investigated, it is possible that one of them will correlate with some genes just by chance. It is common to insist that the observed correlation for one test be so strong that it would happen by chance at most once in 20 million tries. The usual way to measure chance correlations is to shuffle the data at random and see how often a strong effect appears. If the event of interest is a one in 20 million outcome we usually need about ten times that many random shuffles to be sure. This proposal is about finding more efficient random shuffling strategies to get a desired answer with fewer shuffles. The goal is to find important biological variables with much less computation and greater reliability. Finding the important genes is a first step for followup work that includes mining the literature and running experiments to understand the role of those genes and determine whether their relationship is useful or not. Part of the work will also involve adjusting for other factors measured or otherwise that could make the observed correlations misleading. New mathematical methods for finding and measuring rare and unusual outcomes can also be used in industrial problems where the rare phenomenon is an unusually effective product design as measured by computer simulations.The usual way to test whether a gene or a gene set is associated with a phenotype (disease, height, etc.) or a treatment (diet, medicines, etc.) is to run a permutation test. From n data points, there are as many as n! permutations to run. Usually this amount of permutations is beyond our budget and we sample from the permutations as well. If we compute our test statistic M times, once on the original data and once for each of M-1 permutations, then the smallest p value we can possibly get is 1/M. That is, to attain a target p value we have to compute our statistic at least 1/p times. The standard threshold for genome wide association studies translates into a bare minimum of 20,000,000 computations. To have adequate power in a permutation test requires more like 10/p computations. When the phenotype/treatment is binary, the permutation test reduces to sampling with replacement. This project uses non-uniform sampling of permutations or combinations. The main method is importance sampling from mixtures of proposals using the mixture component probabilities as control variates. Markov chain Monte Carlo methods will be investigated.

现代科学工具正在提供非常大的数据集。在生物学中尤其如此，可以测量数千个基因的表达水平，甚至基因组上数百万个位置的特定DNA信息。科学家们希望将这些变量与其他测量的量相关联，特别是疾病的存在或不存在。当数以百万计的假设被调查时，其中一个可能只是偶然地与某些基因相关。人们通常坚持认为，一个测试所观察到的相关性是如此之强，以至于它在2000万次尝试中最多会偶然发生一次。测量机会相关性的常用方法是随机地对数据进行洗牌，看看强效应出现的频率。如果我们感兴趣的事件是2000万分之一的结果，我们通常需要大约10倍的随机洗牌来确定。这个建议是关于寻找更有效的随机洗牌策略，以得到一个所需的答案与更少的洗牌。目标是以更少的计算和更高的可靠性找到重要的生物变量。找到重要的基因是后续工作的第一步，包括挖掘文献和进行实验，以了解这些基因的作用，并确定它们的关系是否有用。这项工作的一部分还将涉及调整其他因素的测量或其他可能使观察到的相关性误导。发现和测量罕见和不寻常结果的新数学方法也可以用于工业问题，其中罕见现象是通过计算机模拟测量的异常有效的产品设计。测试基因或基因集是否与表型（疾病，身高等）相关的常用方法或治疗（饮食、药物等）就是进行排列测试从n个数据点，有n个之多！排列运行。通常这种排列数量超出了我们的预算，我们也从排列中采样。如果我们计算检验统计量M次，一次是在原始数据上，一次是在M-1个排列中，那么我们可能得到的最小p值是1/M。也就是说，为了获得目标p值，我们必须至少计算1/p次统计量。全基因组关联研究的标准阈值转化为最少20，000，000次计算。为了在置换测试中具有足够的功率，需要更像10/p的计算。当表型/治疗是二元的时，排列检验简化为带替换的采样。这个项目使用排列或组合的非均匀采样。主要的方法是重要性抽样的混合物的建议，使用的混合成分的概率作为控制变量。马尔可夫链蒙特卡罗方法将进行研究。