Advancing Theory and Methodology for Tree-Based Algorithms in High Dimensions

推进高维树基算法的理论和方法

基本信息

  • 批准号:
    2209975
  • 负责人:
  • 金额:
    $ 33万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2022
  • 资助国家:
    美国
  • 起止时间:
    2022-07-15 至 2025-06-30
  • 项目状态:
    未结题

项目摘要

Predictive statistical modeling has long been part of the backbone of science and engineering. In recent years, the proliferation of big data has led to a need to go beyond traditional linear models, and a need for flexible models that can exploit complicated nonlinear relationships. Models based on decision trees have emerged as an easy-to-use and high performing class of models, especially for unstructured tabular datasets such as electronic health records, in which they have been found to typically outperform neural networks. Furthermore, since decision trees can be easily visualized and simulated by non-experts, this makes them easier to audit than black box machine learning models, which is especially important when predictions are used to guide high-stakes decisions in the clinic or the courtroom. Unfortunately, models based on decision trees are not well understood statistically, and it is still unclear when and why various models obtain better relative predictive performance. The project plans to bridge this gap by identifying structural properties in real world datasets that make them either amenable or not amenable to current tree-based models. This understanding will then be used to develop better algorithms based on decision trees, as well as methodology to extract reproducible scientific insights from these models. In the duration of the project, graduate students will be trained in theory, domain-driven data science, and open-source software development. Research results will further be disseminated through courses, an upcoming book, and presentations at workshops and conferences.The project plans two thrusts to develop relevant theory for decision trees and random forests. First, it will analyze the generalization performance of tree-based algorithms on a range of different generative regression models in order to elicit their inductive bias. Inductive bias is a well-known concept from machine learning, and is defined as the assumptions an algorithm makes when generalizing to new data. Since real world datasets often present some structure that can be exploited using the right inductive bias, results of this project will allow better identification of which algorithm to choose in a given application, thus improving on classical nonparametric regression analysis of decision trees and random forests. Second, the project will study a new general framework for obtaining model-agnostic nonlinear feature significance measures using mean decrease in impurity (MDI) feature importance. This framework makes use of a novel interpretation of MDI in terms of r-squared values from linear regression, and is asymptotically valid even if the decision tree used to generate MDI is not necessarily a good model for the underlying regression function.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
预测统计建模长期以来一直是科学和工程的支柱。近年来,大数据的激增导致需要超越传统的线性模型,并需要能够利用复杂非线性关系的灵活模型。基于决策树的模型已经成为一种易于使用和高性能的模型,特别是对于非结构化的表格数据集,如电子健康记录,它们通常优于神经网络。此外,由于决策树可以很容易地被非专家可视化和模拟,这使得它们比黑箱机器学习模型更容易审计,这在预测用于指导诊所或法庭上的高风险决策时尤为重要。不幸的是,基于决策树的模型在统计学上还没有得到很好的理解,并且仍然不清楚各种模型何时以及为什么获得更好的相对预测性能。该项目计划通过识别真实的世界数据集中的结构属性来弥合这一差距,这些结构属性使它们适合或不适合当前基于树的模型。然后,这种理解将用于开发基于决策树的更好的算法,以及从这些模型中提取可重复的科学见解的方法。在项目期间,研究生将接受理论、领域驱动的数据科学和开源软件开发方面的培训。研究成果将通过课程、即将出版的书籍以及在研讨会和会议上的演讲进一步传播。该项目计划两个重点来发展决策树和随机森林的相关理论。首先,它将分析基于树的算法在一系列不同的生成回归模型上的泛化性能,以得出它们的归纳偏差。归纳偏差是机器学习中的一个众所周知的概念,它被定义为算法在推广到新数据时所做的假设。由于真实的数据集通常呈现出一些可以使用正确的归纳偏差来利用的结构,因此该项目的结果将允许更好地识别在给定应用中选择哪种算法,从而改进决策树和随机森林的经典非参数回归分析。其次,该项目将研究一个新的一般框架,用于使用杂质平均减少(MDI)特征重要性来获得模型无关的非线性特征重要性度量。这个框架利用了MDI的一个新的解释,从线性回归的r平方值,是渐进有效的,即使用于生成MDI的决策树不一定是一个很好的模型为基础的回归function.This奖项反映了NSF的法定使命,并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Bin Yu其他文献

Does ceruloplasmin differential express in the brain of Ts65Dn: a mouse mode of Down syndrome?
铜蓝蛋白在唐氏综合症小鼠模型 Ts65Dn 的大脑中是否存在差异表达?
  • DOI:
  • 发表时间:
    2014
  • 期刊:
  • 影响因子:
    3.3
  • 作者:
    Bin Yu;Jing Kong;Bao;Ziqi Zhu;Bin Zhang;Qiu;S. Shao
  • 通讯作者:
    S. Shao
A PILOT STUDY IN AN APPLICATION OF TEXT MINING TO LEARNING SYSTEM EVALUATION by NITSAWAN KATERATTANAKUL
文本挖掘在学习系统评估中的应用试点研究,作者:NITSAWAN KATERATTANAKUL
  • DOI:
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Bin Yu
  • 通讯作者:
    Bin Yu
Lamellar gel containing emulsions as an effective carrier for stabilization and transdermal delivery of retinyl propionate
含有乳液的层状凝胶作为丙酸视黄酯的稳定和透皮递送的有效载体
Verifiable Visual Cryptography Based on Iterative Algorithm: Verifiable Visual Cryptography Based on Iterative Algorithm
基于迭代算法的可验证视觉密码:基于迭代算法的可验证视觉密码
  • DOI:
    10.3724/sp.j.1146.2010.00270
  • 发表时间:
    2011
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Bin Yu;Jin;Liguo Fang
  • 通讯作者:
    Liguo Fang
Loc680254 regulates Schwann cell proliferation through Psrc1 and Ska1 as a microRNA sponge following sciatic nerve injury
Loc680254 在坐骨神经损伤后作为 microRNA 海绵通过 Psrc1 和 Ska1 调节雪旺细胞增殖
  • DOI:
    10.1002/glia.24045
  • 发表时间:
    2021-06
  • 期刊:
  • 影响因子:
    6.2
  • 作者:
    Chun Yao;Qihui Wang;Yaxian Wang;Jiancheng Wu;Xuemin Cao;Yan Lu;Yanping Chen;Wei Feng;Xiaosong Gu;Xin‐Peng Dun;Bin Yu
  • 通讯作者:
    Bin Yu

Bin Yu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Bin Yu', 18)}}的其他基金

Understanding Complexity and the Bias-Variance Tradeoff in High Dimensions: Theory and Data Evidence
理解高维度的复杂性和偏差-方差权衡:理论和数据证据
  • 批准号:
    2015341
  • 财政年份:
    2020
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Parallel Ensemble Learning and Feature Interaction Discovery: High Volume Dynamic Data
并行集成学习和特征交互发现:大量动态数据
  • 批准号:
    1953191
  • 财政年份:
    2020
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Understand the functional mechanism of the DSP1 complex in the 3' end maturation of plant small nuclear RNAs
了解DSP1复合物在植物核小RNA 3端成熟中的功能机制
  • 批准号:
    1818082
  • 财政年份:
    2018
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
BIGDATA: F: Scalable and Interpretable Machine Learning: Bridging Mechanistic and Data-Driven Modeling in the Biological Sciences
BIGDATA:F:可扩展和可解释的机器学习:桥接生物科学中的机械和数据驱动建模
  • 批准号:
    1741340
  • 财政年份:
    2017
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Canonical Linear Methods and Hierarchical Non-Linear Methods in High-Dimensional Statistics
高维统计中的规范线性方法和分层非线性方法
  • 批准号:
    1613002
  • 财政年份:
    2016
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
Smart Nanofabrication via Rational Assembly of Two-Dimensional Heterosystems
通过二维异质系统的合理组装实现智能纳米制造
  • 批准号:
    1434689
  • 财政年份:
    2014
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Collaborative Research: Leverage Subsampling for Regression and Dimension Reduction
协作研究:利用子采样进行回归和降维
  • 批准号:
    1228246
  • 财政年份:
    2012
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Direct Self-Assembly of Large Area, High Crystallinity 2D Graphene on Insulator: An Integratable Carbon Platform
绝缘体上大面积、高结晶度二维石墨烯的直接自组装:可集成的碳平台
  • 批准号:
    1162312
  • 财政年份:
    2012
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Understanding DAWDLE Function in miRNA and siRNA Biogenesis
了解 DAWDLE 在 miRNA 和 siRNA 生物发生中的功能
  • 批准号:
    1121193
  • 财政年份:
    2011
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
Ultra-Low-Power Complementary Logic with On-Chip Directly Assembled, Highly Adaptive 2-D Graphitic Platform
超低功耗互补逻辑,具有片上直接组装、高度自适应的 2D 图形平台
  • 批准号:
    1002228
  • 财政年份:
    2010
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
基于isomorph theory研究尘埃等离子体物理量的微观动力学机制
  • 批准号:
    12247163
  • 批准年份:
    2022
  • 资助金额:
    18.00 万元
  • 项目类别:
    专项项目
Toward a general theory of intermittent aeolian and fluvial nonsuspended sediment transport
  • 批准号:
  • 批准年份:
    2022
  • 资助金额:
    55 万元
  • 项目类别:
英文专著《FRACTIONAL INTEGRALS AND DERIVATIVES: Theory and Applications》的翻译
  • 批准号:
    12126512
  • 批准年份:
    2021
  • 资助金额:
    12.0 万元
  • 项目类别:
    数学天元基金项目
基于Restriction-Centered Theory的自然语言模糊语义理论研究及应用
  • 批准号:
    61671064
  • 批准年份:
    2016
  • 资助金额:
    65.0 万元
  • 项目类别:
    面上项目

相似海外基金

Methodology of Argument Construction in Medieval Indian Argumentation Theory
中世纪印度论证理论的论证构建方法论
  • 批准号:
    23K18636
  • 财政年份:
    2023
  • 资助金额:
    $ 33万
  • 项目类别:
    Grant-in-Aid for Research Activity Start-up
Theory, Methodology and Tools for Tailoring CAPT Feedback
定制 CAPT 反馈的理论、方法和工具
  • 批准号:
    23K00679
  • 财政年份:
    2023
  • 资助金额:
    $ 33万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Technology-Driven and Scalable Regression Methodology, Computing and Theory
技术驱动且可扩展的回归方法、计算和理论
  • 批准号:
    DP230101179
  • 财政年份:
    2023
  • 资助金额:
    $ 33万
  • 项目类别:
    Discovery Projects
Test analysis methodology connecting classical test theory and item response theory
连接经典测试理论和项目反应理论的测试分析方法
  • 批准号:
    22K18633
  • 财政年份:
    2022
  • 资助金额:
    $ 33万
  • 项目类别:
    Grant-in-Aid for Challenging Research (Exploratory)
Development of a graph-theory methodology for the design of vibration suppression systems
开发振动抑制系统设计的图论方法
  • 批准号:
    2765808
  • 财政年份:
    2022
  • 资助金额:
    $ 33万
  • 项目类别:
    Studentship
A multimodal semiotic analysis of online "prepper" communities through visual grounded theory methodology, combined with a quantitative hierarchal clu
通过视觉扎根理论方法,结合定量层次分析,对在线“末日准备者”社区进行多模态符号学分析
  • 批准号:
    2750561
  • 财政年份:
    2022
  • 资助金额:
    $ 33万
  • 项目类别:
    Studentship
Development and innovation of statistical theory and methodology of network meta-analysis
网络荟萃分析统计理论与方法的发展与创新
  • 批准号:
    22H03554
  • 财政年份:
    2022
  • 资助金额:
    $ 33万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Establishment of an applied methodology for geographical and economic impacts of transportation infrastructure projects based on spatial economic theory
基于空间经济理论建立交通基础设施项目地理和经济影响的应用方法
  • 批准号:
    22H01617
  • 财政年份:
    2022
  • 资助金额:
    $ 33万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Development of a General Framework for Nonlinear Prediction Using Auto-Cumulants: Theory, Methodology, and Computation
使用自累积量开发非线性预测的通用框架:理论、方法和计算
  • 批准号:
    2131233
  • 财政年份:
    2021
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
Building a Methodology for Multicultural and Interdisciplinary Collaborative Education: Based on Knowledge Theory and SciTS
构建多元文化和跨学科协作教育方法论:基于知识论和科学
  • 批准号:
    20K02970
  • 财政年份:
    2020
  • 资助金额:
    $ 33万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了