Statistical Methods for High Dimensional Discrete Data

高维离散数据的统计方法

基本信息

  • 批准号:
    1007801
  • 负责人:
  • 金额:
    $ 20万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2010
  • 资助国家:
    美国
  • 起止时间:
    2010-06-01 至 2015-05-31
  • 项目状态:
    已结题

项目摘要

Very high dimensional count and binary data are now common in many fields including machine learning, imaging and marketing. In high-throughput biology, ultra-high thoughput sequencing technologies which produce count and categorical data are displacing microarrays and other "omics" measurement devices. The output of these measurement devices are counts per gene or other biological subunit for tens of thousands of responses per sample, or presence/absence for features such as single nucleotide polymorphisms (SNPs), for possibly millions of responses per sample. Similar data can be derived on for features on satellite images, medical scans, monitoring devices and other very high dimensional measurement devices. The investigator will extend highly multivariate and multiple testing methods developed for continuous (primarily normally distributed) data to discrete data. New methods will be developed in four areas: A) analyses for differences in distribution for discrete data that can accommodate complex experimental designs using generalized linear mixed models with overdispersion and Bayesian or empirical Bayes shrinkage. B) methods for supervised clustering of samples and variables in the discrete data setting taking into account the error structure of the discrete predictors. C) classical and sufficient dimensions reduction methods such as canonical correlation and sliced inverse regression for discrete data. D) extension of concepts and methods in multiple testing, such as false discovery rate estimation to the discrete setting in which the p-values from independent or weakly dependent tests may have different null distributions using conditional mixture modeling. The methods will be tested on genomics and imaging data.Very highly multivariate data are now the norm in fields as diverse as cell biology, marketing, medical and satellite imaging, meteorology, epidemiology, fraud detection and cancer research. These data may include thousands or millions of measurements on each item in the sample. For example, genotyping services provide individuals with information on hundreds of thousands of genetic variants in their cells and retailer databases may have information on the sales of tens of thousands of items for each store in the chain. Many of these data come in the form of counts (such as number of items of each type in inventory, number of mRNA molecules encoding a particular protein) or in the form of categories (such as on/off, present/absent, or genotype AA, aa or Aa). Methodology for highly multivariate continuous measurements such as blood pressure and temperature are well-developed but do not apply directly to count and categorical data. The investigator will develop statistical methodology and software to improve analysis and summary of count and categorical data. Four main areas of research are proposed: A) statistical models and tests to determine if the variables are associated with differences among groups; B) statistical methods for prediction or classification of group membership; C) methods to summarize the data with a much smaller set of derived variables which preserve the predictive power of the full data and D) multiple comparisons methods to estimate the error rates. For example, in a study of the genes associated with metastatic versus non-metastatic cancer, the methods could be used to determine which genes express differently in tumors which did or did not advance to metastasis, select a smaller set of genes which could be used as a diagnostic tool and then provide convenient summaries which can readily be interpreted by clinicians. In a study of stresses on a machine part, the pixels of scans of the part before and during the application of the stresses could be used to determine precise locations at which the part might fail and differences among features of the scan between parts which fail at low versus high stress. In studies in which a large number of models are fitted or tests conducted, it is necessary to tolerate a small percentage of errors. Concepts and methods in multiple testing which have been developed for continuous data will be extended to assist in estimating and controlling the number of false conclusions with count and categorical data.
高维数和二进制数据现在在许多领域都很常见,包括机器学习,成像和营销。 在高通量生物学中,产生计数和分类数据的超高通量测序技术正在取代微阵列和其他“组学”测量设备。这些测量装置的输出是每个基因或其他生物亚基的计数,每个样本有数万个响应,或者是每个样本可能有数百万个响应的特征如单核苷酸多态性(SNP)的存在/不存在。 类似的数据可以从卫星图像、医学扫描、监测设备和其他非常高维的测量设备上的特征中得出。 研究者将为连续(主要是正态分布)数据开发的高度多变量和多重检验方法扩展到离散数据。 新的方法将在四个领域发展:A)离散数据的分布差异分析,可以适应复杂的实验设计,使用广义线性混合模型与过度分散和贝叶斯或经验贝叶斯收缩。 B)考虑离散预测器的误差结构,用于离散数据设置中的样本和变量的监督聚类的方法。 C)经典的和充分的降维方法,如典型相关和离散数据的切片逆回归。 D)在多重检验中的概念和方法的扩展,例如错误发现率估计到离散设置,其中来自独立或弱相关检验的p值可以使用条件混合建模具有不同的零分布。 这些方法将在基因组学和成像数据上进行测试。非常多变量的数据现在是细胞生物学、市场营销、医学和卫星成像、气象学、流行病学、欺诈检测和癌症研究等不同领域的标准。 这些数据可能包括对样本中每个项目的数千或数百万次测量。 例如,基因分型服务为个人提供了细胞中数十万种遗传变异的信息,零售商数据库可能有连锁店中每家商店数万种商品的销售信息。这些数据中的许多数据以计数的形式出现(例如库存中每种类型的项目数量,编码特定蛋白质的mRNA分子数量)或以类别的形式出现(例如开/关,存在/不存在,或基因型AA,aa或Aa)。 高度多变量连续测量(如血压和体温)的方法学已经发展成熟,但不直接适用于计数和分类数据。 研究者将开发统计方法和软件,以改善计数和分类数据的分析和总结。 提出了四个主要的研究领域:A)统计模型和测试,以确定变量是否与组间差异有关; B)预测或分类组成员的统计方法; C)方法总结数据与一组更小的派生变量,保留了完整数据的预测能力和D)多重比较方法估计错误率。 例如,在与转移性癌症相对于非转移性癌症相关的基因的研究中,所述方法可用于确定哪些基因在已经或没有进展到转移的肿瘤中不同地表达,选择可用作诊断工具的较小基因集,然后提供可容易地由临床医生解释的方便的总结。 在机器零件上的应力研究中,在施加应力之前和期间零件的扫描像素可以用于确定零件可能失效的精确位置以及在低应力与高应力下失效的零件之间的扫描特征之间的差异。 在需要拟合大量模型或进行大量检验的研究中,有必要容忍一小部分误差。 已开发用于连续数据的多重检验的概念和方法将被扩展以帮助估计和控制计数和分类数据的错误结论的数量。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Naomi Altman其他文献

Points of Significance: Bayes' theorem
要点:贝叶斯定理
  • DOI:
  • 发表时间:
    2015
  • 期刊:
  • 影响因子:
    48
  • 作者:
    J. L. Puga;M. Krzywinski;Naomi Altman
  • 通讯作者:
    Naomi Altman
Machine learning: A primer
机器学习:入门
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    D. Bzdok;M. Krzywinski;Naomi Altman
  • 通讯作者:
    Naomi Altman
Points of Significance: Regularization
意义点:正则化
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    48
  • 作者:
    Jake Lever;M. Krzywinski;Naomi Altman
  • 通讯作者:
    Naomi Altman
Points of Significance: Analyzing outliers: influential or nuisance?
意义点:分析异常值:有影响还是有害?
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    48
  • 作者:
    Naomi Altman;Martin Krzywinski
  • 通讯作者:
    Martin Krzywinski
Neural networks primer
神经网络入门
  • DOI:
    10.1038/s41592-022-01747-1
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    48
  • 作者:
    Alexander Derry;M. Krzywinski;Naomi Altman
  • 通讯作者:
    Naomi Altman

Naomi Altman的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Naomi Altman', 18)}}的其他基金

Mathematical Sciences Computing Research Environments
数学科学计算研究环境
  • 批准号:
    9627207
  • 财政年份:
    1996
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Mathematical Sciences: Semi-parametric Methods for Longitudinal Data Analysis
数学科学:纵向数据分析的半参数方法
  • 批准号:
    9625350
  • 财政年份:
    1996
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Mathematical Sciences: Computationally Intensive Problems in Statistics
数学科学:统计中的计算密集型问题
  • 批准号:
    8916245
  • 财政年份:
    1990
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant

相似国自然基金

Computational Methods for Analyzing Toponome Data
  • 批准号:
    60601030
  • 批准年份:
    2006
  • 资助金额:
    17.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

CAREER: Next-Generation Methods for Statistical Integration of High-Dimensional Disparate Data Sources
职业:高维不同数据源统计集成的下一代方法
  • 批准号:
    2422478
  • 财政年份:
    2024
  • 资助金额:
    $ 20万
  • 项目类别:
    Continuing Grant
Deepening and Expanding Research for Efficient Methods of Function Estimation in High Dimensional Statistical Analysis
高维统计分析中高效函数估计方法的深化和拓展研究
  • 批准号:
    23H03353
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
CAREER: Practical algorithms and high dimensional statistical methods for multimodal haplotype modelling
职业:多模态单倍型建模的实用算法和高维统计方法
  • 批准号:
    2239870
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
    Standard Grant
Statistical methods for analysis of high-dimensional mediation pathways
高维中介路径分析的统计方法
  • 批准号:
    10582932
  • 财政年份:
    2023
  • 资助金额:
    $ 20万
  • 项目类别:
Statistical Challenges and Methods in the Analysis of High Dimensional and Complex Structured Data
高维复杂结构化数据分析中的统计挑战和方法
  • 批准号:
    RGPIN-2018-05475
  • 财政年份:
    2022
  • 资助金额:
    $ 20万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical theory and methods for high-dimensional data
高维数据统计理论与方法
  • 批准号:
    RGPIN-2016-03890
  • 财政年份:
    2022
  • 资助金额:
    $ 20万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical Challenges and Methods in the Analysis of High Dimensional and Complex Structured Data
高维复杂结构化数据分析中的统计挑战和方法
  • 批准号:
    RGPIN-2018-05475
  • 财政年份:
    2021
  • 资助金额:
    $ 20万
  • 项目类别:
    Discovery Grants Program - Individual
Innovative statistical methods for analysing high-dimensional counts
用于分析高维计数的创新统计方法
  • 批准号:
    DP210101923
  • 财政年份:
    2021
  • 资助金额:
    $ 20万
  • 项目类别:
    Discovery Projects
Statistical theory and methods for high-dimensional data
高维数据统计理论与方法
  • 批准号:
    RGPIN-2016-03890
  • 财政年份:
    2021
  • 资助金额:
    $ 20万
  • 项目类别:
    Discovery Grants Program - Individual
Statistical Methods for High-Dimensional Administrative Data
高维行政数据的统计方法
  • 批准号:
    RGPIN-2017-04363
  • 财政年份:
    2021
  • 资助金额:
    $ 20万
  • 项目类别:
    Discovery Grants Program - Individual
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了