权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Statistical Learning with Recursive Partitioning: Algorithms, Accuracy, and Applications

职业：递归分区的统计学习：算法、准确性和应用

基本信息

批准号：
2239448
负责人：
Jason Klusowski
金额：
$ 45万
依托单位：
Princeton University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-06-01 至 2028-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2239448&HistoricalAwards=false
关键词：
CAREER Statistical Learning Recursive Partitioning

项目摘要

As data-driven technologies continue to be adopted and deployed in high-stakes decision-making environments, the need for fast, interpretable algorithms has never been more important. As one such candidate, it has become increasingly common to use decision trees, a hierarchically organized data structure, for building a predictive or causal model. This trend is spurred by the appealing connection between decision trees and rule-based decision-making, particularly in clinical, legal, or business contexts, as the tree structure mimics the sequential way a human user may think and reason, thereby facilitating human-machine interaction. To make them fast to compute, decision trees are popularly constructed with an algorithm called recursive partitioning, in which the decision nodes of the tree are learned from the data in a greedy, top-down manner. The overarching goal of this project is to develop a precise understanding of the strengths and limitations of decision trees based on recursive partitioning, and, in doing so, gain insights on how to improve their performance in practice. In addition to this impact, high-school, undergraduate, and graduate research assistants will be vertically integrated and benefit both academically and professionally. Innovative curricula, workshops, and data and methods competitions involving students, academics, and industry professionals will facilitate outreach and encourage participation from a broad audience. This proposal aims to provide a comprehensive study of the statistical properties of greedy recursive partitioning algorithms for training decision trees, as is demonstrated in two fundamental contexts. The first thrust of the project will develop a theoretical framework for the analysis of oblique decision trees, where, in contrast to conventional axis-aligned splits involving only a single covariate, the splits at each decision node occur at linear combinations of the covariates. While this methodology has garnered significant attention from the computer science and optimization communities since the mid-80s, the advantages they offer over their axis-aligned counterparts remain only empirically justified, and explanations for their success are largely based on heuristics. Filling this long-standing gap between theory and practice, the PI will investigate how oblique regression trees, constructed by recursively minimizing squared error, can adapt to a rich class of regression models consisting of linear combinations of ridge functions. This provides a quantitative baseline for a statistician to compare and contrast decision trees with other less interpretable methods, such as projection pursuit regression and neural networks, that target similar model forms. Crucially, to address the combinatorial complexity of finding the optimal splitting hyperplane at each decision node, the PI’s framework can accommodate many existing computational tools in the literature. A major component of the research is derived from connections between recursive partitioning and sequential greedy approximation algorithms for convex optimization problems (e.g., orthogonal greedy algorithms). The second thrust focuses on the delicate pointwise properties of axis-aligned recursive partitioning, with implications for heterogeneous causal effect estimation, where accurate pointwise estimates over the entire support of the covariates are essential for valid inference (e.g., testing hypotheses and constructing confidence intervals). Motivated by simple setting where decision trees provably fail to achieve optimal performance, the PI will investigate how the signal-to-noise ratio affects the quality of pointwise estimation. While the focus is on causal effect estimation directly using decision trees, the PI will also investigate implications for multi-step semi-parametric settings, where preliminary unknown functions (e.g., propensity scores) are estimated with machine learning tools, as well as conditional quantile regression, both of which require estimators with high pointwise accuracy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

随着数据驱动的技术继续被采用和部署在高风险的决策环境中，对快速、可解释的算法的需求从未像现在这样重要。作为这样的候选者之一，使用决策树(一种分层组织的数据结构)来构建预测或因果模型已变得越来越常见。这一趋势是由决策树和基于规则的决策之间吸引人的联系推动的，特别是在临床、法律或商业环境中，因为树结构模拟了人类用户可能思考和推理的顺序方式，从而促进了人机交互。为了快速计算决策树，通常使用一种称为递归划分的算法来构造决策树，在该算法中，决策树的决策节点是以贪婪的、自上而下的方式从数据中学习的。这个项目的首要目标是对基于递归划分的决策树的优点和局限性有一个准确的了解，并在这样做的过程中，获得关于如何在实践中改进其性能的见解。除了这一影响，高中、本科生和研究生研究助理将垂直整合，在学术和专业上都受益。由学生、学者和行业专业人士参加的创新课程、研讨会以及数据和方法竞赛将促进推广，并鼓励更广泛的受众参与。该建议旨在提供用于训练决策树的贪婪递归划分算法的统计特性的全面研究，如在两个基本上下文中所展示的那样。该项目的第一个推力将制定一个斜决策树分析的理论框架，与传统的只涉及一个协变量的轴对齐分裂不同，每个决策节点的分裂发生在协变量的线性组合上。虽然这种方法自80年代中期以来就得到了计算机科学和优化社区的极大关注，但它们相对于轴对齐的同行所提供的优势仍然只有经验上的合理性，而且它们的成功主要是基于启发式的。为了填补理论和实践之间的长期空白，PI将研究通过递归最小化平方误差构建的倾斜回归树如何适应由岭函数的线性组合组成的丰富类别的回归模型。这为统计学家提供了一个定量的基准，以便将决策树与其他较难解释的方法进行比较，如投影寻踪回归和神经网络，这些方法针对的是类似的模型形式。重要的是，为了解决在每个决策节点寻找最优分裂超平面的组合复杂性，PI的框架可以容纳文献中的许多现有计算工具。研究的一个主要部分来自于递归划分和凸优化问题的顺序贪婪近似算法(例如，正交贪婪算法)之间的联系。第二个重点集中在轴对齐的递归划分的微妙的逐点性质，以及对异质因果效应估计的影响，其中关于协变量的整个支持的准确的逐点估计对于有效的推理(例如，测试假设和构建可信区间)是必不可少的。在决策树显然无法达到最佳性能的简单设置的激励下，PI将调查信噪比如何影响逐点估计的质量。虽然重点是直接使用决策树进行因果效应估计，但PI还将调查多步骤半参数设置的影响，在多步骤半参数设置中，使用机器学习工具和条件分位数回归估计初步未知函数(例如，倾向性分数)，这两种方法都需要高精度的估计器。该奖项反映了NSF的法定使命，并通过使用基金会的智力优势和更广泛的影响审查标准进行评估，被认为值得支持。