权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Ensemble Methods for Classification/Prediction With High-Dimensional Explanatory Variables

使用高维解释变量进行分类/预测的集成方法

基本信息

批准号：
RGPIN-2014-04962
负责人：
Welch, William
金额：
$ 1.31万
依托单位：
University of British Columbia
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2014
资助国家：
加拿大
起止时间：
2014-01-01 至 2015-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=566853
关键词：
Ensemble Methods Classification Prediction Dimensional

项目摘要

Advances in science and engineering have vastly increased the number of variables available to predict / classify a response outcome of interest. At the same time the information in the data may be sparse. Novel methods based on ensembles of models are proposed for higher prediction accuracy. Methodology will be developed for two problems with these characteristics: prediction of complex computer codes and prediction / classification in analysis of drug discovery data. Deterministic computer models can have complex relationships with high-dimensional input (explanatory) variables. For instance, the Community Land Model of the carbon cycle and vegetation dynamics has hundreds of inputs for the ecosystem, climate, hydrology, etc. Experiments with about 100 variables are aimed at sensitivity analysis, i.e., find the inputs that have most impact on an output such as a measure of total vegetation. It is feasible to make thousands of computer model runs, yet work to date shows the input-output relationships are hard to model with useful accuracy. Most likely, there are complex interaction effects between the inputs, and identifying them is a challenge because of the high dimensionality. In drug discovery, the input variables are "chemical descriptors" from computational chemistry to characterize drug-like molecules. Many sets are available, and each can have thousands of variables. The response variable or output is from a physical assay of activity against a biological target implicated in a disease. A statistical model relating biological activity to the chemical inputs can be used to predict activity of molecules that have not been assayed yet, increasing efficiency of the process to search for candidate drugs. Unfortunately, active molecules are rare, so there is a paucity of information in the response data to fit a model. Gaussian Processes (GPs) are widely used to model the deterministic input-output relationship of a computer code. They have also been used in the analysis of drug discovery data. The proposed approach to high-dimensional input and limited data information is based on ensembles of GPs, either by building separate models and averaging them, or by ensembles of correlation functions (which are key to the GP approach). Ensembles have well known general advantages in prediction accuracy and are established as among the best for the drug discovery problem, for example. They typically generate multiple prediction models by perturbing the data (bootstrapping) or dividing the data observations and then fitting a model to each data set created. The models are then averaged when making predictions. With high-dimensional input, however, sparse information in the response data means that most of the input variables are unused in a model when it is fit to data. In contrast, the proposed approach is to build an ensemble of models over distinct subsets of input variables. A subset of inputs with interaction effects should be in the same model; variables that do not interact can be in separate models. It is easier to fill the input space in a data set densely a few variables at a time, increasing prediction accuracy. Furthermore, by attributing variables to different models, more inputs have a chance to contribute to prediction accuracy. The challenges and goals of the research program are how to identify subsets of high-dimensional input variables that should be together in the same model, how to combine models for high overall prediction accuracy, and efficient algorithms to overcome the computational demands of GP models. The over-arching goal is to understand how a statistical model like a GP should be tuned to the complexities of relationships involving high-dimensional input.

科学和工程的进步极大地增加了可用于预测/分类感兴趣的响应结果的变量的数量。同时，数据中的信息可能是稀疏的。为了提高预测精度，提出了基于模型集成的新方法。方法学将针对两个具有这些特征的问题发展：复杂计算机代码的预测和药物发现数据分析中的预测/分类。确定性计算机模型可以与高维输入（解释）变量具有复杂的关系。例如，碳循环和植被动态的群落土地模型对生态系统、气候、水文等有数百个输入。有大约100个变量的实验旨在进行敏感性分析，即找到对输出（如总植被的度量）影响最大的输入。进行数千次计算机模型运行是可行的，但迄今为止的工作表明，很难以有用的准确性建模输入-输出关系。最有可能的是，输入之间存在复杂的交互作用，并且由于其高维性，识别它们是一项挑战。在药物发现中，输入变量是来自计算化学的“化学描述符”，用于表征类药物分子。许多集合都是可用的，每个集合都可以有数千个变量。反应变量或输出来自对与疾病有关的生物靶标的活性的物理测定。将生物活性与化学输入相关的统计模型可用于预测尚未检测的分子的活性，从而提高候选药物搜索过程的效率。不幸的是，活性分子很少，因此响应数据中缺乏适合模型的信息。高斯过程（GPs）被广泛用于计算机代码的确定性输入输出关系建模。它们也被用于药物发现数据的分析。所提出的处理高维输入和有限数据信息的方法是基于GPs的集成，或者通过建立单独的模型并对它们进行平均，或者通过相关函数的集成（这是GP方法的关键）。例如，集成在预测准确性方面具有众所周知的普遍优势，并且被认为是药物发现问题的最佳选择之一。它们通常通过扰动数据（自举）或划分数据观测值，然后将模型拟合到创建的每个数据集来生成多个预测模型。然后在进行预测时对这些模型取平均值。然而，对于高维输入，响应数据中的稀疏信息意味着当模型适合数据时，大多数输入变量在模型中是不使用的。相反，所提出的方法是在不同的输入变量子集上建立模型的集合。具有交互效果的输入子集应该在同一模型中；不相互作用的变量可以在单独的模型中。在数据集中，一次用几个变量密集地填充输入空间更容易，从而提高预测精度。此外，通过将变量归因于不同的模型，更多的输入有机会有助于预测精度。研究项目的挑战和目标是如何识别应该在同一模型中一起的高维输入变量的子集，如何组合模型以获得较高的整体预测精度，以及有效的算法来克服GP模型的计算需求。最重要的目标是理解像GP这样的统计模型应该如何调整到涉及高维输入的复杂关系。