权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

RUI: Predictive models with Incomplete and Fragmented Observations, and New Advances in Virtual Re-sampling for Big Data

RUI：具有不完整和碎片观测的预测模型，以及大数据虚拟重采样的新进展

基本信息

批准号：
2310504
负责人：
Majid Mojirsheibani
金额：
$ 20万
依托单位：
The University Corporation, Northridge
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2023
资助国家：
美国
起止时间：
2023-09-01 至 2026-08-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2310504&HistoricalAwards=false
关键词：
RUI Predictive models Incomplete Fragmented

项目摘要

A major focus of this project is on the development of new procedures to carry out statistical modeling, prediction, and inference in the presence of missing data. Incomplete, missing, censored, and partially observed data are prevalent in many areas of medical sciences, engineering, economics and social sciences, which can in turn complicate the task of prediction and inference in data-driven decision-making processes. The investigator will study and explore the effectiveness of several new methods for handling missing values in complex data structures without imposing unrealistic or unnecessarily stringent conditions on the underlying mechanisms that cause the absence of information. Another major aim of this research project is to develop efficient data re-sampling methods to alleviate the formidable computational cost of computer-intensive statistical methods in big-data scenarios, where the data analyst must deal with, and sort through, massive amounts of data. The advent of such efficient methods is timely as the wave of ultra-large datasets has taken over many data-analytic initiatives in medicine, agriculture, and environmental protection. Additionally, this project embraces research experiences for graduate and undergraduate students, many of whom will then be persuaded to move on to further studies and research careers in STEM disciplines.This research project deals with two broad classes of problems related to predictive models and inference. The first part focuses on selected topics in predictive models such as regression and classification for a number of nonstandard realistic setups. Specifically, the investigator will develop several local-averaging-type regression estimators in general metric spaces for incomplete and fractionally observed data with applications to statistical classification and the related problem of unsupervised machine learning. The aim is to carry out a rigorous study of the convergence properties of these estimators in various norms which is necessary for correct prediction and inference. In particular, this project will study and develop new exponential performance bounds for the Lp norms of the proposed estimators. The problem of bandwidth estimation for incomplete and fragmented functional data will also be studied; this is particularly important as the optimal bandwidth minimizing quantities such as the MISE or ISE is not necessarily optimal in classification. The second part of this research plan considers new objectives in virtual re-sampling as a method to reduce the formidable computational cost of big-data bootstrap in a number of important and challenging problems, while still retaining the benefits of bootstrap methodology. In particular, the investigator will develop virtual re-sampling strategies to (i) approximate the distribution of several refined higher criticism statistics for multiple testing problems in big-data scenarios, and (ii) to speed up the logarithmically slow rates of convergence of important functionals of density and regression estimators in two-sample problems such as those based on deconvolution density estimators and their sup-functionals for errors-in-variables models in big-data scenarios. To achieve the objectives under (i) and (ii), the investigator will use adaptations of the methodologies used in the strong approximations of bootstrap empirical processes in the literature.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目的一个主要重点是开发新的程序，在缺失数据的情况下进行统计建模，预测和推断。不完整、缺失、删失和部分观察的数据在医学、工程、经济和社会科学的许多领域都很普遍，这反过来又会使数据驱动的决策过程中的预测和推理任务复杂化。研究人员将研究和探索几种新方法的有效性，以处理复杂数据结构中的缺失值，而不会对导致信息缺失的潜在机制施加不切实际或不必要的严格条件。该研究项目的另一个主要目标是开发有效的数据重新采样方法，以减轻大数据场景中计算机密集型统计方法的巨大计算成本，其中数据分析师必须处理和整理大量数据。这种有效方法的出现是及时的，因为超大型数据集的浪潮已经接管了医学，农业和环境保护中的许多数据分析计划。此外，该项目还包括研究生和本科生的研究经验，其中许多人将被说服继续在STEM学科进行进一步的学习和研究。该研究项目涉及与预测模型和推理相关的两大类问题。第一部分集中在预测模型中的选定主题，例如一些非标准现实设置的回归和分类。具体来说，研究人员将在一般度量空间中开发几个局部平均型回归估计器，用于不完整和部分观测数据，并应用于统计分类和无监督机器学习的相关问题。我们的目的是进行严格的研究，这些估计的收敛性在各种规范，这是必要的正确的预测和推理。特别是，这个项目将研究和开发新的指数性能界限的Lp范数的建议估计。还将研究不完整和碎片化功能数据的带宽估计问题;这是特别重要的，因为最佳带宽最小化数量，如MISE或ISE不一定是最佳的分类。本研究计划的第二部分考虑虚拟重采样的新目标，作为一种方法，以减少在一些重要和具有挑战性的问题的大数据引导的强大的计算成本，同时仍然保留引导方法的好处。特别是，研究者将开发虚拟重新采样策略，以（i）近似大数据场景中多个测试问题的几个精细化较高批评统计数据的分布，和（ii）在双样本问题中，例如基于反卷积密度估计及其误差的子泛函的问题，大数据场景中的变量模型。为了实现（i）和（ii）项下的目标，研究者将使用文献中自举经验过程的强近似中使用的方法的改编。该奖项反映了NSF的法定使命，并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估来支持。