权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Sparsity, thresholding and regularization in data science

数据科学中的稀疏性、阈值化和正则化

基本信息

批准号：
RGPIN-2022-04531
负责人：
DiazRodriguez, Jairo
金额：
$ 1.38万
依托单位：
York University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=758949
关键词：
Sparsity thresholding regularization data science

项目摘要

Data science has brought a breakthrough in the way decisions are made in real life problems such as fraud detection, healthcare, targeted advertising, website recommendations, speech recognition, among others. Practical implementations of classic and new statistical earning techniques are the common denominator of such advances. However, most of such applications are still misunderstood by its creators, and solutions are mainly implemented out of trial and error. One of the most useful techniques in machine learning is regularization. It helps to cope with overfitting problems but also impose structures in the solution of the optimization algorithms. Thresholding estimators are a particular set of regularization estimators that impose sparse structure in the solution. Sparsity assumes that only a few covariates compose the model to explain a given response. For instance, just a few genes are relevant to explain a given disease. Moreover, sparsity can give interpretability or physical meaning to the result. The objective of this proposal is to develop theory and innovative methodologies for solving and understanding machine and statistical learning models by using sparse regularization and thresholding estimators. The proposal consists of the following three lines of research. First, high dimensional data routinely arise in econometrics, machine learning, neuroscience, and social science. I will extend my previous work in thresholding estimators to other methodologies for high dimensional data. I will be interested in applications involving categorical data for social sciences. Second, I am interested in the prediction of the risk of indirectly transmitted diseases. This can be seen as a high dimensional tomographic inverse problem. The objective is to perform an epidemiologic tomography of a region by reconstructing the areas of high and low disease risk using non-invasive measurements such as GPS animal movements, by imposing a sparse total variation spatial structure. The resulting methodology will be implemented in a full data science framework. Finally, I propose to use thresholding estimators to impose sparse structures into Deep Learning methodologies. Current deep autoencoders tend to force the architecture of a neural network. I propose to impose thresholding regularizers to jointly estimate the network architecture. I also propose a new dropout framework based on L1 regularization. Instead of randomly dropping units, I propose to perform a random selection on the regularization parameter. These methodologies involving sparsity might lead to produce better interpretation of the methods and might facilitate the derivation of mathematical properties. The success of this research program will have great contribution to the understanding of high dimensional data, machine learning, and big data, and will prompt the applications of interpretable sparse regularization in many fields in the natural sciences, social sciences, and engineering.

在现实生活中的欺诈检测、医疗保健、定向广告、网站推荐、语音识别等问题上，数据科学带来了决策方式的突破。传统的和新的统计盈利技术的实际实施是这些进步的共同点。然而，大多数这样的应用程序仍然被其创建者误解，解决方案主要是在试错中实现的。机器学习中最有用的技术之一是正则化。它有助于处理过拟合问题，但也在优化算法的求解中强加了结构。阈值估计器是一组特殊的正则化估计器，它将稀疏结构强加于解。稀疏性假设只有几个协变量组成模型来解释给定的反应。例如，只有几个基因与解释一种特定疾病有关。此外，稀疏性可以赋予结果可解释性或物理意义。这项提议的目标是发展理论和创新方法，通过使用稀疏正则化和阈值估计器来求解和理解机器和统计学习模型。该提案包括以下三个方面的研究。首先，高维数据经常出现在计量经济学、机器学习、神经科学和社会科学中。我将把我之前在阈值估计器方面的工作扩展到高维数据的其他方法。我会对社会科学中涉及分类数据的应用感兴趣。第二，我对间接传播疾病风险的预测感兴趣。这可以看作是一个高维的层析反问题。其目的是通过施加稀疏全变差空间结构，使用诸如GPS动物活动等非侵入性测量来重建疾病高风险和低风险区域，从而执行区域的流行病学断层扫描。由此产生的方法将在一个完整的数据科学框架中实施。最后，我建议使用阈值估计器将稀疏结构应用到深度学习方法中。目前的深度自动编码器倾向于强制神经网络的体系结构。我建议采用阈值正则化方法来共同评估网络体系结构。提出了一种新的基于L1正则化的丢弃框架。我建议对正则化参数进行随机选择，而不是随机丢弃单元。这些涉及稀疏性的方法可能会导致对方法的更好解释，并可能有助于数学性质的推导。该研究项目的成功将为理解高维数据、机器学习和大数据做出巨大贡献，并将推动可解释稀疏正则化在自然科学、社会科学和工程等多个领域的应用。