权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Robust Estimation and Inference

稳健的估计和推理

基本信息

批准号：
RGPIN-2014-05227
负责人：
Zamar, Ruben
金额：
$ 2.04万
依托单位：
University of British Columbia
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2017
资助国家：
加拿大
起止时间：
2017-01-01 至 2018-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=636031
关键词：
Robust Estimation Inference

项目摘要

Errors and perturbations which must be filtered to obtain useful inferences and predictions arise from several sources, including: (1) random fluctuations, e.g. observations are affected by measurement errors, natural fluctuations and sampling variability, (2) data contamination, e.g. data often include measurements of uneven quality, outliers, gross errors and cases from populations other than the target one, and (3) missing data. Most traditional statistical procedures deal with (1) and there are many papers dealing with (2) and (3) separately. However, there are few papers dealing with(1), (2) and (3) simultaneously. Some of my proposed research will aim at filling this gap. I wish to develop procedures able to deal with all the above mentioned sources of uncertainty, using computational efficient and scalable algorithms. Consider a data table with n rows -- one for each case -- and p columns -- one for each variable or feature. With the advent of cheap computing and storage, many modern datasets are variables-rich and cases-poor. This is referred to as "small n-- large p problem" in the literature. This phenomenon is also related to the so called curse of dimensionality problem in Statistics. Given a certain goal (e.g. prediction of future values for some response variable (s) in the data table, it is common to find that a large number of variables (which I call noise variables) hurt instead of helping this task. Hence noise variables constitute a fourth type of perturbation which needs to be filtered to better extract the information contained in the remaining signal variables. In addition, signal variables themselves may be partially redundant and subsets of signal variables (which we call phalanxes) may have better predictive power than the full set of signal variables. Phalanxes can be used to construct statistical models which results can then be ensembled to provide a single prediction/classification. The problem of selecting phalanxes (phalanx formation) is a generalization of model selection where we allow for different groups of variables to form cooperating models to perform a single task. There are many practical and theoretical questions regarding this model building approach which I would like to address. Our former PhD student Jabed Tomal did some ground breaking work on this topic in the context of drug discovery. Prof. Welch and I now wish to enroll a new PhD student to expand this work which has potential for application in many areas of industry and science. The classical robustness model is based on the paradigm that the vast majority of cases (rows in the data table) are free of contamination and useful to perform the given task. Hence, only a minority of contaminated cases may need to be identified and filtered (downweighted). Unfortunately this paradigm is not fully satisfactory in the case of very high dimensional data tables. If there is a small and independent probability, d, that a cell (individual entry in the data table) is contaminated then the probability that a case (a row in the data table) is contaminated is e=1- (1-d)^{p} which can quickly become larger than 0.5. For example, if d=0.01 and p=100 we have e=0.63397. Alqallaf, Van Aelst, Yohai and Zamar (2009) brings attention to this problem called "propagation of outliers" and propose some possible approaches to address it. I wish to further study this problem. My former Ph.D. student Mike Danilov constructed robust S-estimates of multivariate location and scatter that can efficiently deal with missing at random cells. This was an important building block for constructing robust estimates against outliers propagation. My current PhD student Andy Leung is pursuing this research direction.

必须过滤以获得有用的推断和预测的误差和扰动来自几个来源，包括：（1）随机波动，例如观测受到测量误差、自然波动和抽样变异的影响，（2）数据污染，例如数据常常包括质量不均匀的测量、离群值、粗差和来自目标群体以外的群体的情况，（3）缺失数据。大多数传统的统计程序处理（1），有许多论文分别处理（2）和（3）。然而，很少有文章同时讨论（1），（2）和（3）。我提出的一些研究将旨在填补这一空白。我希望开发程序能够处理所有上述来源的不确定性，使用计算效率和可扩展的算法。考虑一个数据表，它有n行（每种情况一行）和p列（每种变量或特征一列）。随着廉价计算和存储的出现，许多现代数据集是变量丰富和案例贫乏的。这在文献中被称为“小n-大p问题”。这种现象也与统计学中所谓的维数灾难问题有关。给定一个特定的目标（例如，预测数据表中某些响应变量的未来值），通常会发现大量变量（我称之为噪声变量）会伤害而不是帮助这项任务。因此，噪声变量构成第四种类型的扰动，其需要被滤波以更好地提取包含在剩余信号变量中的信息。此外，信号变量本身可能是部分冗余的，信号变量的子集（我们称之为方阵）可能比信号变量的全集具有更好的预测能力。方阵可以用于构建统计模型，然后可以将其结果集合以提供单个预测/分类。选择方阵（方阵形成）的问题是模型选择的推广，我们允许不同的变量组形成合作模型来执行单个任务。有许多关于这个模型构建方法的实际和理论问题，我想解决。我们以前的博士生Jabed Tomal在药物发现的背景下对这个主题做了一些开创性的工作。韦尔奇教授和我现在希望招收一名新的博士生来扩展这项在工业和科学的许多领域都有应用潜力的工作。经典的鲁棒性模型是基于绝大多数情况下（数据表中的行）是免费的污染和有用的执行给定的任务的范例。因此，可能仅需要识别和过滤（降低权重）少数污染病例。不幸的是，这种范例在非常高维的数据表的情况下并不完全令人满意。如果一个单元格（数据表中的单个条目）被污染的概率d很小且独立，那么一个案例（数据表中的一行）被污染的概率是e=1-（1-d）^{p}，它可以很快变得大于0.5。例如，如果d=0.01，p=100，则e=0.63397。Alqallaf，货车Aelst，Yohai and Zelvis（2009）提出了“离群值传播”的问题，并提出了一些可能的解决方法，笔者希望进一步研究这个问题。我以前的博士学生Mike达尼洛夫构建了多变量位置和散点稳健S-估计，可有效处理随机像元缺失。这是构建针对离群值传播的稳健估计的重要构建块。我现在的博士生Andy Leung正在从事这个研究方向。