High-dimensional Data Analysis: Modeling Unobserved Heterogeneity in Data, and Studying Imbalanced Classification Problems
高维数据分析:对数据中未观察到的异质性进行建模,并研究不平衡分类问题
基本信息
- 批准号:RGPIN-2020-05011
- 负责人:
- 金额:$ 1.75万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2021
- 资助国家:加拿大
- 起止时间:2021-01-01 至 2022-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Data science has become the center of attention in a wide range of scientific disciplines, thanks to ever-expanding means of data collection in today's world. Unprecedented size and structural complexity of current data in many applications call for computationally efficient and statistically sound methodologies for extracting useful information from such data. Toward this goal, the general theme of my research program focuses on analyzing high-dimensional data. More specifically, over the five years of this proposal, my short-term objectives are: I) Statistical modeling of heterogeneous high-dimensional data: In applications such as health sciences, engineering and environment, social sciences, and financial econometrics, high-dimensional data often arise from heterogeneous populations consisting of multiple hidden homogeneous sub-populations. Finite mixture of regressions (FMR) and Markov regime-switching autoregressive (MSAR) models provide flexible tools for capturing unobserved heterogeneity in data. The later models are used for modeling time series data. In practice, when fitting such models to a dataset, one faces three inferential problems: order selection or estimation of the number of hidden sub-populations or regimes, variable selection, and so-called post-selection statistical inference such as hypothesis testing or confidence intervals for parameters of a data-driven selected model. Despite their wide applications, rigorous methodological developments addressing the aforementioned problems in the growing literature on high-dimensional statistics have been very limited. In my short-term objectives, I will investigate new likelihood-based regularization techniques for: order selection in FMR and MSAR, and variable selection in sparse dynamic FMR and vector MSAR with fixed order and in high-dimensional settings. Establishment of such results will pave the way toward post-selection inference problems which are the subjects of my long-term objectives. II) High-dimensional imbalanced classification problems: In applications such as fraud detection, medical diagnosis, or equipment malfunction detection, classification tasks often suffer from both high-dimensionality and imbalance in the observed frequency of some classes in the training data. The latter is due to either data collection process or because some classes are indeed rare in the population. Due to data scarcity in minority class(es), conventional discriminative methods are often biased toward the majority class(es) resulting in much higher misclassification rates for the minority class(es). Imbalanced classification problems are generally hard, so I begin by studying imbalanced linear binary cases. I will investigate the utility of divide-and-conquer techniques coupled with hard-thresholding variable selection methods for bias correction in the standard linear discriminant analysis toward the minority class in high-dimensions. I will also study multi-class problems.
由于当今世界数据收集手段的不断扩大,数据科学已经成为广泛科学学科的关注中心。在许多应用中,当前数据的空前规模和结构复杂性要求从这些数据中提取有用信息的计算效率和统计上合理的方法。为了实现这一目标,我的研究计划的总主题集中在分析高维数据。更具体地说,在这个提案的五年里,我的短期目标是:I)异构高维数据的统计建模:在健康科学,工程和环境,社会科学和金融计量经济学等应用中,高维数据通常来自由多个隐藏的同质子群体组成的异构群体。有限混合回归(FMR)和马尔可夫状态转换自回归(MSAR)模型提供了灵活的工具,捕捉未观察到的数据异质性。后一种模型用于对时间序列数据建模。 在实践中,当将这样的模型拟合到数据集时,人们面临三个推理问题:顺序选择或隐藏子群体或制度的数量的估计,变量选择,以及所谓的选择后统计推断,例如假设检验或数据驱动的选定模型的参数的置信区间。尽管其广泛的应用,严格的方法发展解决上述问题,在不断增长的文献高维统计一直非常有限。 在我的短期目标中,我将研究新的基于似然的正则化技术:在FMR和MSAR中的阶数选择,以及在稀疏动态FMR和矢量MSAR中的变量选择,固定阶数和高维设置。这些结果的建立将为选择后推理问题铺平道路,这些问题是我的长期目标的主题。II)高维不平衡分类问题:在欺诈检测、医疗诊断或设备故障检测等应用中,分类任务往往同时存在高维性和训练数据中某些类的观测频率不平衡的问题。后者是由于数据收集过程或因为某些类别在人口中确实罕见。由于少数类的数据稀缺,传统的判别方法往往偏向多数类,导致少数类的误分类率高得多。 不平衡分类问题通常很难,所以我开始研究不平衡线性二元情况。我将研究分而治之的效用,再加上硬阈值变量选择方法的偏差校正在标准的线性判别分析对少数类在高维。我还将研究多类问题。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Khalili, Abbas其他文献
Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space
- DOI:
10.1093/biostatistics/kxq048 - 发表时间:
2011-01-01 - 期刊:
- 影响因子:2.1
- 作者:
Khalili, Abbas;Chen, Jiahua;Lin, Shili - 通讯作者:
Lin, Shili
Disseminated Intravascular Coagulation Associated with Large Deletion of Immunoglobulin Heavy Chain
- DOI:
10.18502/ijaai.v20i6.8030 - 发表时间:
2021-12-01 - 期刊:
- 影响因子:1.5
- 作者:
Khalili, Abbas;Yadegari, Amir Hosein;Abolhassani, Hassan - 通讯作者:
Abolhassani, Hassan
Autosomal Recessive Agammaglobulinemia: A Novel Non-sense Mutation in CD79a
- DOI:
10.1007/s10875-014-9989-3 - 发表时间:
2014-02-01 - 期刊:
- 影响因子:9.1
- 作者:
Khalili, Abbas;Plebani, Alessandro;Aghamohammadi, Asghar - 通讯作者:
Aghamohammadi, Asghar
Order Selection in Finite Mixture Models With a Nonsmooth Penalty
- DOI:
10.1198/016214508000001075 - 发表时间:
2008-12-01 - 期刊:
- 影响因子:3.7
- 作者:
Chen, Jiahua;Khalili, Abbas - 通讯作者:
Khalili, Abbas
Order Selection in Finite Mixture Models With a Nonsmooth Penalty
- DOI:
10.1198/jasa.2009.0103 - 发表时间:
2009-03-01 - 期刊:
- 影响因子:3.7
- 作者:
Chen, Jiahua;Khalili, Abbas - 通讯作者:
Khalili, Abbas
Khalili, Abbas的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Khalili, Abbas', 18)}}的其他基金
High-dimensional Data Analysis: Modeling Unobserved Heterogeneity in Data, and Studying Imbalanced Classification Problems
高维数据分析:对数据中未观察到的异质性进行建模,并研究不平衡分类问题
- 批准号:
RGPIN-2020-05011 - 财政年份:2022
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
High-dimensional Data Analysis: Modeling Unobserved Heterogeneity in Data, and Studying Imbalanced Classification Problems
高维数据分析:对数据中未观察到的异质性进行建模,并研究不平衡分类问题
- 批准号:
RGPIN-2020-05011 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
- 批准号:
RGPIN-2015-03805 - 财政年份:2019
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
- 批准号:
RGPIN-2015-03805 - 财政年份:2018
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
- 批准号:
RGPIN-2015-03805 - 财政年份:2017
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
- 批准号:
RGPIN-2015-03805 - 财政年份:2016
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Statistical inference in finite mixture of regressions and mixture-of-experts models in high-dimensional spaces, and varying coefficient finite mixture of regression models
高维空间中回归和专家混合模型的有限混合的统计推断,以及回归模型的变系数有限混合
- 批准号:
RGPIN-2015-03805 - 财政年份:2015
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Model selection and statistical inference in mixture distributions and hidden markov (regression) models
混合分布和隐马尔可夫(回归)模型中的模型选择和统计推断
- 批准号:
386578-2010 - 财政年份:2014
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Model selection and statistical inference in mixture distributions and hidden markov (regression) models
混合分布和隐马尔可夫(回归)模型中的模型选择和统计推断
- 批准号:
386578-2010 - 财政年份:2013
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Model selection and statistical inference in mixture distributions and hidden markov (regression) models
混合分布和隐马尔可夫(回归)模型中的模型选择和统计推断
- 批准号:
386578-2010 - 财政年份:2012
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
相似国自然基金
Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:合作创新研究团队
Data-driven Recommendation System Construction of an Online Medical Platform Based on the Fusion of Information
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国青年学者研究基金项目
Development of a Linear Stochastic Model for Wind Field Reconstruction from Limited Measurement Data
- 批准号:
- 批准年份:2020
- 资助金额:40 万元
- 项目类别:
基于Linked Open Data的Web服务语义互操作关键技术
- 批准号:61373035
- 批准年份:2013
- 资助金额:77.0 万元
- 项目类别:面上项目
Molecular Interaction Reconstruction of Rheumatoid Arthritis Therapies Using Clinical Data
- 批准号:31070748
- 批准年份:2010
- 资助金额:34.0 万元
- 项目类别:面上项目
高维数据的函数型数据(functional data)分析方法
- 批准号:11001084
- 批准年份:2010
- 资助金额:16.0 万元
- 项目类别:青年科学基金项目
染色体复制负调控因子datA在细胞周期中的作用
- 批准号:31060015
- 批准年份:2010
- 资助金额:25.0 万元
- 项目类别:地区科学基金项目
Computational Methods for Analyzing Toponome Data
- 批准号:60601030
- 批准年份:2006
- 资助金额:17.0 万元
- 项目类别:青年科学基金项目
相似海外基金
I-Corps: Vision analysis system using inferred three-dimensional data to analyze and correct a user’s pose in relation to 3D space
I-Corps:视觉分析系统,使用推断的三维数据来分析和纠正用户相对于 3D 空间的姿势
- 批准号:
2403992 - 财政年份:2024
- 资助金额:
$ 1.75万 - 项目类别:
Standard Grant
Robust Three-Dimensional Pattern Recognition based on Object Oriented Data Analysis
基于面向对象数据分析的鲁棒三维模式识别
- 批准号:
23K16900 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
- 批准号:
10462257 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别:
IMAT-ITCR Collaboration: Combining FIBI and topological data analysis: Synergistic approaches for tumor structural microenvironment exploration
IMAT-ITCR 合作:结合 FIBI 和拓扑数据分析:肿瘤结构微环境探索的协同方法
- 批准号:
10884028 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别:
Tensor decomposition methods for multi-omics immunology data analysis
用于多组学免疫学数据分析的张量分解方法
- 批准号:
10655726 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别:
IMAT-ITCR Collaboration: Combining FIBI and topological data analysis: Synergistic approaches for tumor structural microenvironment exploration
IMAT-ITCR 合作:结合 FIBI 和拓扑数据分析:肿瘤结构微环境探索的协同方法
- 批准号:
10885376 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别:
A web-based platform for robust single-cell analysis, bulk data deconvolution and system-level analysis
基于网络的平台,用于强大的单细胞分析、批量数据反卷积和系统级分析
- 批准号:
10766073 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别:
Machine learning methods for the analysis and modeling of spatial proteomics data
用于空间蛋白质组数据分析和建模的机器学习方法
- 批准号:
10576681 - 财政年份:2023
- 资助金额:
$ 1.75万 - 项目类别: