Understanding Complexity and the Bias-Variance Tradeoff in High Dimensions: Theory and Data Evidence
理解高维度的复杂性和偏差-方差权衡:理论和数据证据
基本信息
- 批准号:2015341
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-07-01 至 2024-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The past decade has witnessed a significant rise in the usage of very large machine-learning models in modern data problems; these models have shown success in a variety of tasks, such as image classification, language translation, and speech recognition. More recently, machine learning is entering new fields, such as robotics, autonomous driving, and medicine. However, these models are often not robust to perturbations and are vulnerable to attacks by adversaries. These shortcomings warrant an urgent and insightful understanding of the "black-box" nature of these models. The principal investigator plans to understand these models by characterizing their "complexity" in a technical manner. A new complexity measure, based on the principle of minimum description length, sheds insight into classical statistical foundations as well as informing how and when these new high-dimensional models will work. This novel complexity measure is promising to enable applications to mission-critical fields like precision medicine, where the collection of a labeled dataset is expensive, by sample-size calculations and improving model selection with limited data. This research has both theoretical and applied impacts in the fields of statistics and machine learning including deep learning. In the duration of the project, graduate students will be trained in theory, domain-driven data science, and open-source software development. The research will be further disseminated through courses, an upcoming book, and presentations at workshops and conferences.Deep neural networks (DNNs) in many cases generalize well in the sense that a DNN trained on one task often performs well on similar unseen data for the same task. They can do so despite being highly overparameterized, i.e., the number of parameters is much larger than the number of training samples. Occam's razor and the bias-variance trade-off wisdom suggest to prefer a simple model when choosing from amongst models of varying complexity with similar performance. The good performance of DNNs, despite the overparametrization, has led many researchers to question the validity of the classical statistical principle of bias-variance trade-off (and preferring a simple model) for high-dimensional settings common in modern machine learning (ML) and statistical tasks. In this project, the principal investigator begins by reconsidering the definition of a valid complexity measure – which forms the basis of Occam’s razor and the bias-variance trade-off principle – for high-dimensional models. Finding one such measure for high-dimensional models has remained a difficult task. Merely counting the number of parameters is not a valid complexity measure, especially when the number of training examples is small. The principle of minimum description length will be used to provide a systematic approach to understanding the complexity of high-dimensional linear models, kernel methods, and finally DNNs. The complexity measure will serve as a basis for understanding key concepts such as the bias-variance trade-off and for further analysis into high-dimensional models. The theoretical results will be augmented with an extensive set of data-inspired experiments. After establishing the bias-variance trade-off with the new complexity measures, these measures will then be investigated for (i) selecting a simple model from amongst a set of competitive models, where simple will be defined via the MDL-based complexity and not the number of parameters, and (ii) regularizing or pruning a large (pre-trained) model, for example, in a transfer learning setting with limited dataset, by trading off the training performance with the complexity of the model.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在过去的十年中,在现代数据问题中使用非常大的机器学习模型的情况显著增加;这些模型在各种任务中取得了成功,例如图像分类,语言翻译和语音识别。最近,机器学习正在进入新的领域,如机器人、自动驾驶和医学。然而,这些模型往往是不稳定的扰动,容易受到攻击的对手。这些缺点保证了对这些模型的“黑箱”性质的迫切和深刻的理解。首席研究员计划通过以技术方式描述其“复杂性”来理解这些模型。基于最小描述长度原则的一种新的复杂性度量,揭示了经典统计基础,并告知这些新的高维模型将如何以及何时工作。这种新颖的复杂性度量有望应用于精密医学等关键任务领域,其中通过样本大小计算和改进有限数据的模型选择来收集标记数据集是昂贵的。这项研究在统计学和机器学习(包括深度学习)领域具有理论和应用影响。在项目期间,研究生将接受理论、领域驱动的数据科学和开源软件开发方面的培训。这项研究将通过课程、即将出版的书籍以及研讨会和会议上的演讲进一步传播。在许多情况下,深度神经网络(DNN)的泛化能力很好,因为在一项任务上训练的DNN通常在相同任务的类似未知数据上表现良好。它们可以这样做,尽管它们被高度过度参数化,即,参数的数量比训练样本的数量大得多。 奥卡姆剃刀和偏差-方差权衡智慧建议,在从具有相似性能的不同复杂性的模型中进行选择时,首选简单模型。DNN的良好性能,尽管过度参数化,导致许多研究人员质疑经典的偏差-方差权衡统计原理(并且更喜欢简单的模型)对于现代机器学习(ML)和统计任务中常见的高维设置的有效性。 在这个项目中,首席研究员开始重新考虑有效复杂性度量的定义-这构成了奥卡姆剃刀和偏差-方差权衡原则的基础-对于高维模型。找到一个这样的措施高维模型仍然是一项艰巨的任务。 仅仅计算参数的数量并不是一个有效的复杂性度量,特别是当训练样本的数量很小时。 最小描述长度的原则将用于提供一种系统的方法来理解高维线性模型,核方法和DNN的复杂性。复杂性度量将作为理解关键概念(如偏差-方差权衡)和进一步分析高维模型的基础。理论结果将通过一系列广泛的数据启发实验来增强。 在用新的复杂性度量建立偏差-方差权衡之后,然后将研究这些度量用于(i)从一组竞争模型中选择简单模型,其中简单将经由基于MDL的复杂性而不是参数的数量来定义,以及(ii)正则化或修剪大的(预训练的)模型,例如,在具有有限数据集的迁移学习设置中,该奖项反映了NSF的法定使命,并被认为是值得支持的,使用基金会的知识价值和更广泛的影响审查标准进行评估。
项目成果
期刊论文数量(6)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
MDI+: A Flexible Random Forest-Based Feature Importance Framework
MDI:一种灵活的基于随机森林的特征重要性框架
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Agarwal, Abhineet;Kenney, Ana M.;Tan, Yan Shuo;Tang, Tiffany M.;Yu, Bin
- 通讯作者:Yu, Bin
Fast Interpretable Greedy-Tree Sums (FIGS)
快速可解释的贪婪树和(FIGS)
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Tan, Yan Shuo;Singh, Chandan;Nasseri, Keyan;Agarwal, Abhineet;Duncan, James;Ronen, Omer;Epland, Matthew;Kornblith, Aaron;Yu, Bin
- 通讯作者:Yu, Bin
The Three Stages of Learning Dynamics in High-dimensional Kernel Methods
高维核方法中学习动力学的三个阶段
- DOI:
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Nikhil Ghosh, Song Mei
- 通讯作者:Nikhil Ghosh, Song Mei
An investigation into the effects of pre-training data distributions for pathology report classification
预训练数据分布对病理报告分类影响的调查
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Hsu, Aliyah R.;Cherapanamjeri, Yeshwanth;Park, Briton;Naumann, Tristan;Odisho Anobel Y.;Yu, Bin
- 通讯作者:Yu, Bin
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Bin Yu其他文献
Does ceruloplasmin differential express in the brain of Ts65Dn: a mouse mode of Down syndrome?
铜蓝蛋白在唐氏综合症小鼠模型 Ts65Dn 的大脑中是否存在差异表达?
- DOI:
- 发表时间:
2014 - 期刊:
- 影响因子:3.3
- 作者:
Bin Yu;Jing Kong;Bao;Ziqi Zhu;Bin Zhang;Qiu;S. Shao - 通讯作者:
S. Shao
A PILOT STUDY IN AN APPLICATION OF TEXT MINING TO LEARNING SYSTEM EVALUATION by NITSAWAN KATERATTANAKUL
文本挖掘在学习系统评估中的应用试点研究,作者:NITSAWAN KATERATTANAKUL
- DOI:
- 发表时间:
2010 - 期刊:
- 影响因子:0
- 作者:
Bin Yu - 通讯作者:
Bin Yu
Lamellar gel containing emulsions as an effective carrier for stabilization and transdermal delivery of retinyl propionate
含有乳液的层状凝胶作为丙酸视黄酯的稳定和透皮递送的有效载体
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Yuyan Yang;Shaowei Yan;Bin Yu;Chang Gao;Kuan Chang;Jing Wang - 通讯作者:
Jing Wang
Verifiable Visual Cryptography Based on Iterative Algorithm: Verifiable Visual Cryptography Based on Iterative Algorithm
基于迭代算法的可验证视觉密码:基于迭代算法的可验证视觉密码
- DOI:
10.3724/sp.j.1146.2010.00270 - 发表时间:
2011 - 期刊:
- 影响因子:0
- 作者:
Bin Yu;Jin;Liguo Fang - 通讯作者:
Liguo Fang
Loc680254 regulates Schwann cell proliferation through Psrc1 and Ska1 as a microRNA sponge following sciatic nerve injury
Loc680254 在坐骨神经损伤后作为 microRNA 海绵通过 Psrc1 和 Ska1 调节雪旺细胞增殖
- DOI:
10.1002/glia.24045 - 发表时间:
2021-06 - 期刊:
- 影响因子:6.2
- 作者:
Chun Yao;Qihui Wang;Yaxian Wang;Jiancheng Wu;Xuemin Cao;Yan Lu;Yanping Chen;Wei Feng;Xiaosong Gu;Xin‐Peng Dun;Bin Yu - 通讯作者:
Bin Yu
Bin Yu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Bin Yu', 18)}}的其他基金
Advancing Theory and Methodology for Tree-Based Algorithms in High Dimensions
推进高维树基算法的理论和方法
- 批准号:
2209975 - 财政年份:2022
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Parallel Ensemble Learning and Feature Interaction Discovery: High Volume Dynamic Data
并行集成学习和特征交互发现:大量动态数据
- 批准号:
1953191 - 财政年份:2020
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Understand the functional mechanism of the DSP1 complex in the 3' end maturation of plant small nuclear RNAs
了解DSP1复合物在植物核小RNA 3端成熟中的功能机制
- 批准号:
1818082 - 财政年份:2018
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
BIGDATA: F: Scalable and Interpretable Machine Learning: Bridging Mechanistic and Data-Driven Modeling in the Biological Sciences
BIGDATA:F:可扩展和可解释的机器学习:桥接生物科学中的机械和数据驱动建模
- 批准号:
1741340 - 财政年份:2017
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Canonical Linear Methods and Hierarchical Non-Linear Methods in High-Dimensional Statistics
高维统计中的规范线性方法和分层非线性方法
- 批准号:
1613002 - 财政年份:2016
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Smart Nanofabrication via Rational Assembly of Two-Dimensional Heterosystems
通过二维异质系统的合理组装实现智能纳米制造
- 批准号:
1434689 - 财政年份:2014
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: Leverage Subsampling for Regression and Dimension Reduction
协作研究:利用子采样进行回归和降维
- 批准号:
1228246 - 财政年份:2012
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Direct Self-Assembly of Large Area, High Crystallinity 2D Graphene on Insulator: An Integratable Carbon Platform
绝缘体上大面积、高结晶度二维石墨烯的直接自组装:可集成的碳平台
- 批准号:
1162312 - 财政年份:2012
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Understanding DAWDLE Function in miRNA and siRNA Biogenesis
了解 DAWDLE 在 miRNA 和 siRNA 生物发生中的功能
- 批准号:
1121193 - 财政年份:2011
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Ultra-Low-Power Complementary Logic with On-Chip Directly Assembled, Highly Adaptive 2-D Graphitic Platform
超低功耗互补逻辑,具有片上直接组装、高度自适应的 2D 图形平台
- 批准号:
1002228 - 财政年份:2010
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
相似海外基金
Conference: 17th International Conference on Computability, Complexity and Randomness (CCR 2024)
会议:第十七届可计算性、复杂性和随机性国际会议(CCR 2024)
- 批准号:
2404023 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Addressing the complexity of future power system dynamic behaviour
解决未来电力系统动态行为的复杂性
- 批准号:
MR/S034420/2 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
Addressing the complexity of future power system dynamic behaviour
解决未来电力系统动态行为的复杂性
- 批准号:
MR/Y00390X/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
CAREER: Complexity Theory of Quantum States: A Novel Approach for Characterizing Quantum Computer Science
职业:量子态复杂性理论:表征量子计算机科学的新方法
- 批准号:
2339116 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Low-complexity配列の相分離液滴の分光学的解析法の開発
低复杂度排列相分离液滴光谱分析方法的发展
- 批准号:
23K23857 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Building Molecular Complexity Through Enzyme-Enabled Synthesis
通过酶合成构建分子复杂性
- 批准号:
DE240100502 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Discovery Early Career Researcher Award
Data Complexity and Uncertainty-Resilient Deep Variational Learning
数据复杂性和不确定性弹性深度变分学习
- 批准号:
DP240102050 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Discovery Projects
Taming the complexity of the law: modelling and visualisation of dynamically interacting legal systems [RENEWAL].
驾驭法律的复杂性:动态交互的法律系统的建模和可视化[RENEWAL]。
- 批准号:
MR/X023028/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
Career: The Complexity pf Quantum Tasks
职业:量子任务的复杂性
- 批准号:
2339711 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
22-BBSRC/NSF-BIO Building synthetic regulatory units to understand the complexity of mammalian gene expression
22-BBSRC/NSF-BIO 构建合成调控单元以了解哺乳动物基因表达的复杂性
- 批准号:
BB/Y008898/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant