Robust Estimation and Inference
稳健的估计和推理
基本信息
- 批准号:RGPIN-2014-05227
- 负责人:
- 金额:$ 2.04万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2017
- 资助国家:加拿大
- 起止时间:2017-01-01 至 2018-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Errors and perturbations which must be filtered to obtain useful inferences and predictions arise from several sources, including: (1) random fluctuations, e.g. observations are affected by measurement errors, natural fluctuations and sampling variability, (2) data contamination, e.g. data often include measurements of uneven quality, outliers, gross errors and cases from populations other than the target one, and (3) missing data. Most traditional statistical procedures deal with (1) and there are many papers dealing with (2) and (3) separately. However, there are few papers dealing with(1), (2) and (3) simultaneously. Some of my proposed research will aim at filling this gap. I wish to develop procedures able to deal with all the above mentioned sources of uncertainty, using computational efficient and scalable algorithms. Consider a data table with n rows -- one for each case -- and p columns -- one for each variable or feature. With the advent of cheap computing and storage, many modern datasets are variables-rich and cases-poor. This is referred to as "small n-- large p problem" in the literature. This phenomenon is also related to the so called curse of dimensionality problem in Statistics. Given a certain goal (e.g. prediction of future values for some response variable (s) in the data table, it is common to find that a large number of variables (which I call noise variables) hurt instead of helping this task. Hence noise variables constitute a fourth type of perturbation which needs to be filtered to better extract the information contained in the remaining signal variables. In addition, signal variables themselves may be partially redundant and subsets of signal variables (which we call phalanxes) may have better predictive power than the full set of signal variables. Phalanxes can be used to construct statistical models which results can then be ensembled to provide a single prediction/classification. The problem of selecting phalanxes (phalanx formation) is a generalization of model selection where we allow for different groups of variables to form cooperating models to perform a single task. There are many practical and theoretical questions regarding this model building approach which I would like to address. Our former PhD student Jabed Tomal did some ground breaking work on this topic in the context of drug discovery. Prof. Welch and I now wish to enroll a new PhD student to expand this work which has potential for application in many areas of industry and science. The classical robustness model is based on the paradigm that the vast majority of cases (rows in the data table) are free of contamination and useful to perform the given task. Hence, only a minority of contaminated cases may need to be identified and filtered (downweighted). Unfortunately this paradigm is not fully satisfactory in the case of very high dimensional data tables. If there is a small and independent probability, d, that a cell (individual entry in the data table) is contaminated then the probability that a case (a row in the data table) is contaminated is e=1- (1-d)^{p} which can quickly become larger than 0.5. For example, if d=0.01 and p=100 we have e=0.63397. Alqallaf, Van Aelst, Yohai and Zamar (2009) brings attention to this problem called "propagation of outliers" and propose some possible approaches to address it. I wish to further study this problem. My former Ph.D. student Mike Danilov constructed robust S-estimates of multivariate location and scatter that can efficiently deal with missing at random cells. This was an important building block for constructing robust estimates against outliers propagation. My current PhD student Andy Leung is pursuing this research direction.
必须过滤以获得有用的推断和预测的误差和扰动来自几个来源,包括:(1)随机波动,例如观测受到测量误差、自然波动和抽样变异的影响,(2)数据污染,例如数据常常包括质量不均匀的测量、离群值、粗差和来自目标群体以外的群体的情况,(3)缺失数据。大多数传统的统计程序处理(1),有许多论文分别处理(2)和(3)。然而,很少有文章同时讨论(1),(2)和(3)。我提出的一些研究将旨在填补这一空白。我希望开发程序能够处理所有上述来源的不确定性,使用计算效率和可扩展的算法。考虑一个数据表,它有n行(每种情况一行)和p列(每种变量或特征一列)。随着廉价计算和存储的出现,许多现代数据集是变量丰富和案例贫乏的。这在文献中被称为“小n-大p问题”。这种现象也与统计学中所谓的维数灾难问题有关。给定一个特定的目标(例如,预测数据表中某些响应变量的未来值),通常会发现大量变量(我称之为噪声变量)会伤害而不是帮助这项任务。因此,噪声变量构成第四种类型的扰动,其需要被滤波以更好地提取包含在剩余信号变量中的信息。此外,信号变量本身可能是部分冗余的,信号变量的子集(我们称之为方阵)可能比信号变量的全集具有更好的预测能力。方阵可以用于构建统计模型,然后可以将其结果集合以提供单个预测/分类。选择方阵(方阵形成)的问题是模型选择的推广,我们允许不同的变量组形成合作模型来执行单个任务。有许多关于这个模型构建方法的实际和理论问题,我想解决。我们以前的博士生Jabed Tomal在药物发现的背景下对这个主题做了一些开创性的工作。韦尔奇教授和我现在希望招收一名新的博士生来扩展这项在工业和科学的许多领域都有应用潜力的工作。经典的鲁棒性模型是基于绝大多数情况下(数据表中的行)是免费的污染和有用的执行给定的任务的范例。因此,可能仅需要识别和过滤(降低权重)少数污染病例。不幸的是,这种范例在非常高维的数据表的情况下并不完全令人满意。如果一个单元格(数据表中的单个条目)被污染的概率d很小且独立,那么一个案例(数据表中的一行)被污染的概率是e=1-(1-d)^{p},它可以很快变得大于0.5。例如,如果d=0.01,p=100,则e=0.63397。Alqallaf,货车Aelst,Yohai and Zelvis(2009)提出了“离群值传播”的问题,并提出了一些可能的解决方法,笔者希望进一步研究这个问题。我以前的博士学生Mike达尼洛夫构建了多变量位置和散点稳健S-估计,可有效处理随机像元缺失。这是构建针对离群值传播的稳健估计的重要构建块。我现在的博士生Andy Leung正在从事这个研究方向。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Zamar, Ruben其他文献
Robust estimation of error scale in nonparametric regression models
- DOI:
10.1016/j.jspi.2008.01.005 - 发表时间:
2008-10-01 - 期刊:
- 影响因子:0.9
- 作者:
Ghement, Isabella Rodica;Ruiz, Marcelo;Zamar, Ruben - 通讯作者:
Zamar, Ruben
RSKC: An R Package for a Robust and Sparse K-Means Clustering Algorithm
- DOI:
10.18637/jss.v072.i05 - 发表时间:
2016-08-01 - 期刊:
- 影响因子:5.8
- 作者:
Kondo, Yumi;Salibian-Barrera, Matias;Zamar, Ruben - 通讯作者:
Zamar, Ruben
Zamar, Ruben的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Zamar, Ruben', 18)}}的其他基金
Robust Estimation and Model Ensemble Selection
鲁棒估计和模型集成选择
- 批准号:
RGPIN-2019-04201 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Robust Estimation and Model Ensemble Selection
鲁棒估计和模型集成选择
- 批准号:
RGPIN-2019-04201 - 财政年份:2021
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Robust Estimation and Model Ensemble Selection
鲁棒估计和模型集成选择
- 批准号:
RGPIN-2019-04201 - 财政年份:2020
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Robust Estimation and Model Ensemble Selection
鲁棒估计和模型集成选择
- 批准号:
RGPIN-2019-04201 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Application of robust statistical models to measure data quality for improved use of sensors and diagnostics in an active mine setting
应用稳健的统计模型来测量数据质量,以改善活跃矿山环境中传感器和诊断的使用
- 批准号:
532134-2018 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Engage Grants Program
Robust Estimation and Inference
稳健的估计和推理
- 批准号:
RGPIN-2014-05227 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Design and development of machine learning and data optimization processes for detection of biogenic patterns
设计和开发用于检测生物模式的机器学习和数据优化流程
- 批准号:
500801-2016 - 财政年份:2016
- 资助金额:
$ 2.04万 - 项目类别:
Engage Plus Grants Program
Robust Estimation and Inference
稳健的估计和推理
- 批准号:
RGPIN-2014-05227 - 财政年份:2016
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Robust Estimation and Inference
稳健的估计和推理
- 批准号:
RGPIN-2014-05227 - 财政年份:2015
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Design and development of machine learning and data optimization processes for detection of biogenic patterns
设计和开发用于检测生物模式的机器学习和数据优化流程
- 批准号:
490833-2015 - 财政年份:2015
- 资助金额:
$ 2.04万 - 项目类别:
Engage Grants Program
相似海外基金
Identification, estimation, and inference of the discount factor in dynamic discrete choice models
动态离散选择模型中折扣因子的识别、估计和推断
- 批准号:
24K04814 - 财政年份:2024
- 资助金额:
$ 2.04万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Partitioning-Based Learning Methods for Treatment Effect Estimation and Inference
基于分区的治疗效果估计和推理学习方法
- 批准号:
2241575 - 财政年份:2023
- 资助金额:
$ 2.04万 - 项目类别:
Standard Grant
Applying causal inference methods to improve estimation of the real-world benefits and harms of lung cancer screening
应用因果推理方法来改进对肺癌筛查的现实益处和危害的估计
- 批准号:
10737187 - 财政年份:2023
- 资助金额:
$ 2.04万 - 项目类别:
Inference and Model Building for Vision-based Estimation of Transmissive Objects
基于视觉的透射物体估计的推理和模型构建
- 批准号:
RGPIN-2017-05638 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Estimation and inference in directed acyclic graphical models for biological networks
生物网络有向无环图模型的估计和推理
- 批准号:
10330130 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Estimation and Inference with High-Dimensional Data
高维数据的估计和推理
- 批准号:
2210850 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Standard Grant
Improvement of nonparametric inference based on kernel type estimation and resampling method, and its application
基于核类型估计和重采样方法的非参数推理改进及其应用
- 批准号:
22K11939 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Non-parametric identification, estimation and inference: generalized functions approach
非参数识别、估计和推理:广义函数方法
- 批准号:
RGPIN-2020-05444 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Essential and incidental measurement error: Bayesian estimation and inference when sample measurements are random-variable-valued
基本和偶然测量误差:样本测量为随机变量值时的贝叶斯估计和推断
- 批准号:
RGPIN-2021-04357 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Estimation and Inference via Computational Statistics Algorithms
通过计算统计算法进行估计和推理
- 批准号:
RGPIN-2019-04142 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual