权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Bayesian Learning for Sparse High-Dimensional Data

稀疏高维数据的贝叶斯学习

基本信息

批准号：
2889818
负责人：
金额：
--
依托单位：
University of Liverpool
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2023
资助国家：
英国
起止时间：
2023 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2889818
关键词：
Bayesian Learning Sparse Dimensional Data

项目摘要

This project is focused on understanding uncertainty in machine learning models trained on limited datasets. There are many problems where the number of data points is small relative to the number of features. Typical solutions assume independence of features or use dimensionality reduction to learn a maximum likelihood projection of the data. For small data sets, learnt models are critically dependent on the actual data points used. The project will investigate whether Bayesian methods can be used to characterise the uncertainty of estimated parameters efficiently when developing machine learning models for sensor signal time series.Much recent progress in machine learning has relied on the availability of large datasets, which allows the development of complex models. However, many problems in defence and security do not have access to such data, either because they require use of less widely studied sensors (such as sonar) or they relate to adversaries, who strive to limit data about their activities. Most published models rely on point estimates of parameters, achieved through algorithms such as maximum likelihood or stochastic gradient descent. However, when this type of model is applied in situations with limited data, the uncertainty associated with parameter estimates is usually not taken into account, either when integrating machine learning models into wider systems, or when assessing performance to predict how the model might behave in operational scenarios. Even when other approaches to deal with limited datasets are used, such as transfer learning, uncertainty characterisation is still important as there is often a mismatch between the distribution of the pre-training and training datasets.This project aims to investigate to what extent Bayesian methods can be used to characterise the uncertainty of estimated parameters when dealing with sparse but potentially high-dimensional data sets, and how this can be implemented in a distributed computing setting. The expected outcome of the project is the development of suitable Bayesian algorithms, along with a software implementation, and an analysis of algorithm performance on relevant datasets.The research will start with a literature review into appropriate approaches, which could include Variational Bayesian methods, Markov Chain Monte Carlo (MCMC), Sequential Monte Carlo (SMC), Approximate Bayesian Computation (ABC), and other approximate methods. Consideration will be given to the computational feasibility of the algorithms, including the extent to which computing can be distributed to multiple processors or virtual machines in a cloud infrastructure and the transparency (confidence) and performance improvements the various approaches could provide. Suitable innovative techniques will be developed, assessed, and compared against baseline approaches. Bayesian Neural Networks (BNN) will also be researched with implementations containing techniques such as SMC and MCMC methods, amongst others. The algorithms will be applied to a number of sponsor-supplied datasets, such as sonar sensor or electrical device measurement time-series. The research will be to determine the extent to which the uncertainty representation accommodates operational data that may not have the same distribution as the training data. Based on discussions with the sponsor and an analysis of the results, industrially relevant scenarios where the algorithms can be used will be identified.

这个项目的重点是理解在有限数据集上训练的机器学习模型中的不确定性。在许多问题中，数据点的数量相对于特征的数量来说是很小的。典型的解决方案假设特征的独立性或使用降维来学习数据的最大似然投影。对于小数据集，学习到的模型严重依赖于使用的实际数据点。该项目将研究在为传感器信号时间序列开发机器学习模型时，贝叶斯方法是否可以有效地表征估计参数的不确定性。最近机器学习的许多进展都依赖于大型数据集的可用性，这使得复杂模型的开发成为可能。然而，国防和安全领域的许多问题无法获得此类数据，要么是因为它们需要使用研究较少的传感器（如声纳），要么是因为它们与对手有关，后者努力限制有关其活动的数据。大多数已发表的模型依赖于参数的点估计，通过最大似然或随机梯度下降等算法实现。然而，当这种类型的模型应用于数据有限的情况时，无论是在将机器学习模型集成到更广泛的系统中，还是在评估性能以预测模型在操作场景中的表现时，通常都不会考虑与参数估计相关的不确定性。即使使用其他方法来处理有限的数据集，如迁移学习，不确定性表征仍然很重要，因为预训练数据集和训练数据集的分布之间经常存在不匹配。该项目旨在研究贝叶斯方法在处理稀疏但潜在高维数据集时可以在多大程度上用于描述估计参数的不确定性，以及如何在分布式计算设置中实现这一点。该项目的预期结果是开发合适的贝叶斯算法，以及软件实现，并分析算法在相关数据集上的性能。研究将从文献综述开始，包括变分贝叶斯方法、马尔可夫链蒙特卡罗（MCMC）、顺序蒙特卡罗（SMC）、近似贝叶斯计算（ABC）和其他近似方法。将考虑算法的计算可行性，包括计算可以分布到云基础设施中的多个处理器或虚拟机的程度，以及各种方法可以提供的透明度（置信度）和性能改进。将开发、评估适当的创新技术，并与基线方法进行比较。贝叶斯神经网络（BNN）也将通过包含SMC和MCMC方法等技术的实现进行研究。该算法将应用于许多赞助商提供的数据集，如声纳传感器或电气设备测量时间序列。研究将是确定不确定性表示在多大程度上容纳可能与训练数据具有不同分布的操作数据。根据与赞助商的讨论和对结果的分析，将确定可以使用算法的工业相关场景。