课题基金基金详情
高维数据分析的分治策略
结题报告
批准号:
11871411
项目类别:
面上项目
资助金额:
53.0 万元
负责人:
练恒
学科分类:
A0402.统计推断与统计计算
结题年份:
2022
批准年份:
2018
项目状态:
已结题
项目参与者:
张君、周彦、贺莘、戴奔
国基评审专家1V1指导 中标率高出同行96.8%
结合最新热点,提供专业选题建议
深度指导申报书撰写,确保创新可行
指导项目中标800+,快速提高中标率
客服二维码
微信扫码咨询
中文摘要
由于大型数据集往往因为数据量太大而无法加载到单个机器的内存中,分而治之的方法近年来已经受到了广泛的关注。也就是说,将不同的数据子集分配到多台机器,分别在每台机器上进行统计模型拟合,最后将多个估计值汇集到一个中央机器进行平均。 ..对于许多模型,理论上可以用以上所述的简单的分而治之的方法来实现,以达到与用单机分析整个数据集相同的估计性能,这就是所谓的分而治之法的oracle性质。然而,在要估计的参数个数可能超过观测数量的高维模型中,情况更为复杂。特别是在用惩罚函数进行变量选择时产生了估计偏差,在汇总之前,纠偏是至关重要的。在这个项目中,我们计划研究几种高维统计模型的分治法,包括部分线性模型,分位数回归模型和支持向量机分类器。本研究的目的是在这些使用LASSO惩罚的模型中提出纠偏的方法,严格地建立最优收敛速度,并通过数值模拟研究其有限样本性质。
英文摘要
Given the recent rapid increase in the availability of extremely large datasets, storage, access, and analysis of such data sets becomes critical. ..Since data sets are often too large to load into the memory of a single machine, let alone conducting statistical analysis for the whole data sets at once, divide and conquer methodology has received significant attention. Conceptually, this simply involves distributing the entire data to multiple machines, carrying out standard statistical model fitting at each local machine separately to obtain multiple estimates of the same quantities/parameters of interest, and finally pooling the estimates into a single estimate on a central machine by a simple averaging step. ..For many models, the simple divide and conquer method described above can be theoretically shown to achieve the same estimation performance as when the entire data set is analyzed by a single machine, which is called the oracle property of the divide and conquer method. However, for high-dimensional models where the number of parameters to estimate could exceed the number of observations, the case is more complicated. In particular, the naïve averaging fails due to the propagation of bias attributed to the penalty used to make high-dimensional estimation feasible. Thus debiasing is critical before aggregation. In this proposal, we plan to study divide and conquer method for several high-dimensional statistical models, including partially linear models, quantile regression models, and support vector classification. The purpose of this study is to propose debiasing method in these penalized models and establish rigorously the optimal convergence rate or even, in some cases, the asymptotic distribution of the aggregated estimates. Once achieved, it will deepen our understanding of the divide and conquer strategy and significantly expand its applicability.
在本项目中,我们对部分线性模型、分位数回归模型、非参数模型等几种复杂模型的分治策略的统计特性及相关方法进行了研究。我们建立的统计理论阐明了这些流行方法在大数据分析中的一些重要理论。特别是,我们展示了在合理的数学假设下不同模型中各种估计量的(通常是最优的)收敛速度。我们的研究结果已经发表在一些顶级国际期刊上。
期刊论文列表
专著列表
科研奖励列表
会议论文列表
专利列表
DOI:--
发表时间:2022
期刊:Journal of Machine Learning Research
影响因子:6
作者:Yingying Zhang;Yanyong Zhao;Heng Lian
通讯作者:Heng Lian
Randomized sketches for kernel CCA
内核 CCA 的随机草图
DOI:10.1016/j.neunet.2020.04.006
发表时间:2020-04
期刊:Neural Networks
影响因子:7.8
作者:Heng Lian;Fode Zhang;Wenqi Lu
通讯作者:Wenqi Lu
DOI:--
发表时间:2020
期刊:Analysis and Applications
影响因子:2.2
作者:Lei Wang;Heng Lian
通讯作者:Heng Lian
DOI:10.1214/18-aos1769
发表时间:2019-10
期刊:Annals of Statistics
影响因子:4.5
作者:Heng Lian;Kaifeng Zhao;Shaogao Lv
通讯作者:Shaogao Lv
Approximate nonparametric quantile regression in reproducing kernel Hilbert spaces via random projection
通过随机投影再现核希尔伯特空间中的近似非参数分位数回归
DOI:10.1016/j.ins.2020.08.039
发表时间:2021-02
期刊:Information Sciences
影响因子:8.1
作者:Fode Zhang;Rui Li;Heng Lian
通讯作者:Heng Lian
再生核希尔伯特空间中的分布式随机投影估计方法
国内基金
海外基金