CAREER: Statistical Learning with Recursive Partitioning: Algorithms, Accuracy, and Applications

职业:递归分区的统计学习:算法、准确性和应用

基本信息

  • 批准号:
    2239448
  • 负责人:
  • 金额:
    $ 45万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-06-01 至 2028-05-31
  • 项目状态:
    未结题

项目摘要

As data-driven technologies continue to be adopted and deployed in high-stakes decision-making environments, the need for fast, interpretable algorithms has never been more important. As one such candidate, it has become increasingly common to use decision trees, a hierarchically organized data structure, for building a predictive or causal model. This trend is spurred by the appealing connection between decision trees and rule-based decision-making, particularly in clinical, legal, or business contexts, as the tree structure mimics the sequential way a human user may think and reason, thereby facilitating human-machine interaction. To make them fast to compute, decision trees are popularly constructed with an algorithm called recursive partitioning, in which the decision nodes of the tree are learned from the data in a greedy, top-down manner. The overarching goal of this project is to develop a precise understanding of the strengths and limitations of decision trees based on recursive partitioning, and, in doing so, gain insights on how to improve their performance in practice. In addition to this impact, high-school, undergraduate, and graduate research assistants will be vertically integrated and benefit both academically and professionally. Innovative curricula, workshops, and data and methods competitions involving students, academics, and industry professionals will facilitate outreach and encourage participation from a broad audience. This proposal aims to provide a comprehensive study of the statistical properties of greedy recursive partitioning algorithms for training decision trees, as is demonstrated in two fundamental contexts. The first thrust of the project will develop a theoretical framework for the analysis of oblique decision trees, where, in contrast to conventional axis-aligned splits involving only a single covariate, the splits at each decision node occur at linear combinations of the covariates. While this methodology has garnered significant attention from the computer science and optimization communities since the mid-80s, the advantages they offer over their axis-aligned counterparts remain only empirically justified, and explanations for their success are largely based on heuristics. Filling this long-standing gap between theory and practice, the PI will investigate how oblique regression trees, constructed by recursively minimizing squared error, can adapt to a rich class of regression models consisting of linear combinations of ridge functions. This provides a quantitative baseline for a statistician to compare and contrast decision trees with other less interpretable methods, such as projection pursuit regression and neural networks, that target similar model forms. Crucially, to address the combinatorial complexity of finding the optimal splitting hyperplane at each decision node, the PI’s framework can accommodate many existing computational tools in the literature. A major component of the research is derived from connections between recursive partitioning and sequential greedy approximation algorithms for convex optimization problems (e.g., orthogonal greedy algorithms). The second thrust focuses on the delicate pointwise properties of axis-aligned recursive partitioning, with implications for heterogeneous causal effect estimation, where accurate pointwise estimates over the entire support of the covariates are essential for valid inference (e.g., testing hypotheses and constructing confidence intervals). Motivated by simple setting where decision trees provably fail to achieve optimal performance, the PI will investigate how the signal-to-noise ratio affects the quality of pointwise estimation. While the focus is on causal effect estimation directly using decision trees, the PI will also investigate implications for multi-step semi-parametric settings, where preliminary unknown functions (e.g., propensity scores) are estimated with machine learning tools, as well as conditional quantile regression, both of which require estimators with high pointwise accuracy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
随着数据驱动的技术继续被采用和部署在高风险的决策环境中,对快速、可解释的算法的需求从未像现在这样重要。作为这样的候选者之一,使用决策树(一种分层组织的数据结构)来构建预测或因果模型已变得越来越常见。这一趋势是由决策树和基于规则的决策之间吸引人的联系推动的,特别是在临床、法律或商业环境中,因为树结构模拟了人类用户可能思考和推理的顺序方式,从而促进了人机交互。为了快速计算决策树,通常使用一种称为递归划分的算法来构造决策树,在该算法中,决策树的决策节点是以贪婪的、自上而下的方式从数据中学习的。这个项目的首要目标是对基于递归划分的决策树的优点和局限性有一个准确的了解,并在这样做的过程中,获得关于如何在实践中改进其性能的见解。除了这一影响,高中、本科生和研究生研究助理将垂直整合,在学术和专业上都受益。由学生、学者和行业专业人士参加的创新课程、研讨会以及数据和方法竞赛将促进推广,并鼓励更广泛的受众参与。该建议旨在提供用于训练决策树的贪婪递归划分算法的统计特性的全面研究,如在两个基本上下文中所展示的那样。该项目的第一个推力将制定一个斜决策树分析的理论框架,与传统的只涉及一个协变量的轴对齐分裂不同,每个决策节点的分裂发生在协变量的线性组合上。虽然这种方法自80年代中期以来就得到了计算机科学和优化社区的极大关注,但它们相对于轴对齐的同行所提供的优势仍然只有经验上的合理性,而且它们的成功主要是基于启发式的。为了填补理论和实践之间的长期空白,PI将研究通过递归最小化平方误差构建的倾斜回归树如何适应由岭函数的线性组合组成的丰富类别的回归模型。这为统计学家提供了一个定量的基准,以便将决策树与其他较难解释的方法进行比较,如投影寻踪回归和神经网络,这些方法针对的是类似的模型形式。重要的是,为了解决在每个决策节点寻找最优分裂超平面的组合复杂性,PI的框架可以容纳文献中的许多现有计算工具。研究的一个主要部分来自于递归划分和凸优化问题的顺序贪婪近似算法(例如,正交贪婪算法)之间的联系。第二个重点集中在轴对齐的递归划分的微妙的逐点性质,以及对异质因果效应估计的影响,其中关于协变量的整个支持的准确的逐点估计对于有效的推理(例如,测试假设和构建可信区间)是必不可少的。在决策树显然无法达到最佳性能的简单设置的激励下,PI将调查信噪比如何影响逐点估计的质量。虽然重点是直接使用决策树进行因果效应估计,但PI还将调查多步骤半参数设置的影响,在多步骤半参数设置中,使用机器学习工具和条件分位数回归估计初步未知函数(例如,倾向性分数),这两种方法都需要高精度的估计器。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Jason Klusowski其他文献

Jason Klusowski的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Jason Klusowski', 18)}}的其他基金

Deep Learning and Random Forests for High-Dimensional Regression
用于高维回归的深度学习和随机森林
  • 批准号:
    2054808
  • 财政年份:
    2020
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
Deep Learning and Random Forests for High-Dimensional Regression
用于高维回归的深度学习和随机森林
  • 批准号:
    1915932
  • 财政年份:
    2019
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant

相似海外基金

CAREER: New Frameworks for Ethical Statistical Learning: Algorithmic Fairness and Privacy
职业:道德统计学习的新框架:算法公平性和隐私
  • 批准号:
    2340241
  • 财政年份:
    2024
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Domain-aware Statistical Learning
职业:领域感知统计学习
  • 批准号:
    2143695
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Standard Grant
CAREER: Statistical Learning from a Modern Perspective: Over-parameterization, Regularization, and Generalization
职业:现代视角下的统计学习:过度参数化、正则化和泛化
  • 批准号:
    2143215
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Understanding metal/support interactions in catalysis with statistical learning
职业:通过统计学习了解催化中金属/载体的相互作用
  • 批准号:
    2143941
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Federated Learning: Statistical Optimality and Provable Security
职业:联邦学习:统计最优性和可证明的安全性
  • 批准号:
    2144593
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Designing Meaningful Learning Experiences for Statistical Literacy in Secondary Mathematics
职业:为中学数学中的统计素养设计有意义的学习体验
  • 批准号:
    2143816
  • 财政年份:
    2022
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Fast and Accurate Statistical Learning and Inference from Large-Scale Data: Theory, Methods, and Algorithms
职业:从大规模数据中快速准确地进行统计学习和推理:理论、方法和算法
  • 批准号:
    2046874
  • 财政年份:
    2021
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: New Statistical Paradigms Reconciling Empirical Surprises in Modern Machine Learning
职业:新的统计范式调和现代机器学习中的经验惊喜
  • 批准号:
    2042473
  • 财政年份:
    2021
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Nonconvex Optimization for Statistical Estimation and Learning: Conditioning, Dynamics, and Nonsmoothness
职业:统计估计和学习的非凸优化:条件、动力学和非平滑性
  • 批准号:
    2047637
  • 财政年份:
    2021
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
CAREER: Smooth statistical distances for a scalable learning theory
职业:可扩展学习理论的平滑统计距离
  • 批准号:
    2046018
  • 财政年份:
    2021
  • 资助金额:
    $ 45万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了