权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BIGDATA: F: DKA: Collaborative Research: Theory and Algorithms for Parallel Probabilistic Inference with Big Data, via Big Model, in Realistic Distributed Computing Environments

BIGDATA：F：DKA：协作研究：在现实分布式计算环境中通过大模型进行大数据并行概率推理的理论和算法

基本信息

批准号：
1447721
负责人：
Sinead Williamson
金额：
$ 30万
依托单位：
University of Texas at Austin
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2014
资助国家：
美国
起止时间：
2014-09-01 至 2018-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1447721&HistoricalAwards=false
关键词：
BIGDATA DKA Collaborative Research Theory

项目摘要

This project develops a new framework that enables machine learning (ML) systems to automatically comprehend and mine massive and complex data via parallel Bayesian inference on large computer clusters. The research has a profound impact on the practice and direction of Big Learning. The developed technologies have a catalytic effect on both ML research and applications: ML scientists are able to rapidly experiment on novel, cutting-edge ML models with minimal programming effort, unhindered by the limitations of single machines. Researchers from other fields, like biology and social sciences, are able to run contemporary advanced ML methods that transcend the capabilities of simple models, yielding new scientific insights on data whose size would otherwise be daunting. Data scientists at small start-ups are able to conduct ML analytics with complex models, putting their capabilities on par with huge companies possessing dedicated engineering and infrastructure teams. Students and beginners are able to witness distributed ML in action with just a few lines of code, driving ML education to new heights. Technically, this research focuses on scaling up and parallelizing Bayesian machine learning, which provides a powerful, elegant and theoretically justified framework for modeling a wide variety of datasets. The research team develops a suite of complementary distributed inference algorithms for hierarchical Bayesian models, which cover most commonly used Bayesian ML methods. The project focuses on combining speed and scalability with theoretical guarantees that allow us to assess the accuracy of the resulting methods, and allow practitioners to make trade-offs between speed and accuracy. Rather than focus on a few disconnected models, the project develops techniques applicable to a broad spectrum of hierarchical Bayesian models, resulting in a toolkit of building blocks that can be combined as needed for arbitrary probabilistic models - be they parametric or nonparametric, discriminative or generative. This is in contrast to much existing work on parallel inference, which tends to focus on parallelization in a specific model and cannot be easily extended. The project provides a solid algorithmic foundation for learning on Big Data with powerful models. The research contributes to democratizing advanced and large-scale ML methods for broad applications, by offering the user and developer community a library of general-purpose parallelizable algorithms for working on diverse problems using computer clusters and the cloud, bridging the gap between practical needs from data and basic research in ML.

该项目开发了一个新的框架，使机器学习（ML）系统能够通过大型计算机集群上的并行贝叶斯推理自动理解和挖掘大量复杂的数据。这一研究对大学习的实践和方向有着深远的影响。开发的技术对机器学习研究和应用都有催化作用：机器学习科学家能够以最小的编程工作量快速实验新颖的尖端机器学习模型，不受单机限制的影响。来自生物学和社会科学等其他领域的研究人员能够运行超越简单模型能力的当代先进ML方法，从而对数据产生新的科学见解，否则这些数据的规模将是令人生畏的。小型初创企业的数据科学家能够使用复杂的模型进行机器学习分析，使他们的能力与拥有专门工程和基础设施团队的大型公司相当。学生和初学者只需几行代码就可以见证分布式ML的实际应用，将ML教育推向新的高度。从技术上讲，这项研究的重点是扩展和并行贝叶斯机器学习，它为各种数据集建模提供了一个强大，优雅和理论合理的框架。研究团队为分层贝叶斯模型开发了一套互补的分布式推理算法，其中涵盖了最常用的贝叶斯ML方法。该项目的重点是将速度和可扩展性与理论保证相结合，使我们能够评估结果方法的准确性，并允许从业者在速度和准确性之间进行权衡。该项目不是专注于几个断开的模型，而是开发适用于广泛的分层贝叶斯模型的技术，从而产生一个构建模块的工具包，可以根据需要组合为任意概率模型-无论是参数还是非参数，判别式还是生成式。这与许多现有的并行推理工作形成对比，并行推理往往专注于特定模型中的并行化，并且无法轻松扩展。该项目为使用强大的模型学习大数据提供了坚实的算法基础。该研究有助于使先进的大规模ML方法民主化，以实现广泛的应用，为用户和开发人员社区提供通用并行算法库，用于使用计算机集群和云处理各种问题，弥合数据实际需求与ML基础研究之间的差距。