权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Variable Selection via Measurement Error Modeling

通过测量误差建模进行变量选择

基本信息

批准号：
1406456
负责人：
Leonard Stefanski
金额：
$ 30万
依托单位：
North Carolina State University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2014
资助国家：
美国
起止时间：
2014-07-01 至 2019-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1406456&HistoricalAwards=false
关键词：
Variable Selection via Measurement Error

项目摘要

Technological advances make it possible to collect and store enormous amounts of data. The implications for how businesses run (online retailing, precision manufacturing), how science is conducted (environmental science, climate monitoring and modeling, astrophysics), and how governments operate (health care delivery, public safety, homeland security) are comparably enormous. However, for many particular uses of massive data sets, not all of the available information is relevant; and a key first step in many big-data explorations is the identification of the most relevant subset of information required to address the particular question at hand. For example, when studying certain diseases, it is essential to first identify the most relevant risk factors and precursors. The more information that is available, the more difficult it is to identify the most relevant subset for a particular purpose, akin to the problem of finding a needle in a haystack. Just as a threshing machine separates the wheat from the chaff, the research in this project will develop statistical methods that separate the relevant information (the wheat) from that information that is not relevant (the chaff), thereby enabling more focused and productive analyses of large data sets.More specifically, the research in this project will develop methods for identifying the subset of information that is most relevant when the data are used to derive a regression/prediction model or algorithm. In this case the problem of separating the wheat from the chaff is the often-studied problem of variable selection. This project will develop a new approach to variable selection that differs conceptually from existing approaches and promises to offer new insights as well as new methodologies. The new approach is based on the intuitive and universally relevant idea that a non-informative variable can be contaminated with noise without a subsequent loss of predictive power; whereas any amount of contamination to an informative predictor necessarily entails a loss of predictive power. Starting from the noise-contamination idea of variable informativeness, the project shows how the theory, methods, and algorithms from the field of measurement error modeling can be used to develop new methods of variable selection applicable across the full spectrum of model- and algorithmic-based prediction methods. Instances of the general strategy will be studied and refined for several particular prediction methods such as: nonparametric regression (based on splines, or kernels, etc.); classification/regression trees; dimension reduction methods (principle components, partial least squares, SIR, etc.); bagged or model-averaged predictors of any type; and ridge regression.

技术进步使收集和存储大量数据成为可能。这对企业如何运作（在线零售、精密制造）、科学如何运作（环境科学、气候监测和建模、天体物理学）以及政府如何运作（医疗保健提供、公共安全、国土安全）的影响是巨大的。然而，对于海量数据集的许多特定用途，并非所有可用信息都是相关的;许多大数据探索的关键第一步是确定解决手头特定问题所需的最相关信息子集。例如，在研究某些疾病时，必须首先确定最相关的风险因素和前兆。可用的信息越多，识别与特定目的最相关的子集就越困难，类似于大海捞针的问题。就像一台分离机将小麦从谷壳中分离出来一样，本项目的研究将开发出分离相关信息的统计方法（小麦）从那些不相关的信息中（谷壳），从而能够对大型数据集进行更有针对性和更有成效的分析。更具体地说，该项目的研究将开发方法，用于识别当数据用于推导回归/预测模型或算法时最相关的信息子集。在这种情况下，将小麦从谷壳中分离出来的问题是经常研究的变量选择问题。该项目将开发一种新的变量选择方法，该方法在概念上不同于现有方法，并有望提供新的见解和新的方法。新方法是基于直观的和普遍相关的想法，即一个非信息变量可以被噪声污染，而没有随后的预测能力的损失;而任何数量的污染的信息预测必然会导致预测能力的损失。从变量信息量的噪声污染思想出发，该项目展示了测量误差建模领域的理论、方法和算法如何用于开发适用于基于模型和算法的全方位预测方法的变量选择新方法。将研究一般策略的简化，并针对几种特定的预测方法进行改进，例如：非参数回归（基于样条或核等）;分类/回归树;降维方法（主成分、偏最小二乘法、SIR等）;任何类型的袋装或模型平均预测因子;以及岭回归。