权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Computational Foundations of Machine Learning in the Era of Big Data

大数据时代机器学习的计算基础

基本信息

批准号：
RGPIN-2017-05032
负责人：
Yu, Yaoliang
金额：
$ 4.08万
依托单位：
University of Waterloo
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=750238
关键词：
Computational Foundations Machine Learning Era

项目摘要

Machine learning (ML), a field that develops software that can improve itself through learning and experience, has been largely driven by the availability of historical data, and by the need to develop efficient and scalable algorithms and supporting theories. Conversely, the success of ML in science, engineering, and commerce, along with technological innovations, has led to an unprecedented growth and enthusiasm in big data collection, thereby redefining computational efficiency and inviting system solutions. For example, the recent AlphaGo system of Deepmind that beats top human Go players needed 1900 CPUs and 280 GPUs to carry out the computation. How to balance computation with communication in this vast distributed cluster, without compromising system throughput or correctness? On the other hand, a small startup developing a mobile app may not afford the same computational power as Google, hence often has to turn into primitive solutions. How to build an algorithmic framework for ML that provides ''knobs'' to adjust the computational load, with explicit, controllable loss on the accuracy? Meeting such diverse computational needs in the big data era has thus been a grand challenge for the ML field.We attempt to address such computational challenge in ML and big data, through three complementary objectives: (1) Real problems are hard, but also structured. Over the years the importance of designing statistical methodologies and computational algorithms that can exploit certain structure in data and model has become evident. Encouraged by our previous work on sparsity and low-rankness, we propose to investigate two additional structures that are common in ML applications: monotonicity and multi-modality (in the tensor format), and developing efficient algorithms that benefit from the presence of such structures. (2) Data is always noisy and full of random fluctuations, hence diminishing the need of obtaining exact or even high-precision solutions in ML. Approximate computation, if done properly, can significantly reduce the computation time in ML. We initiate a systematic study of the tradeoffs of approximate computation in ML, from ''downgrading'' computationally expensive programs to simpler and cheaper ones, to ''optimally" smooth nondifferentiable functions, and to attach measures of nonconvexity to nonconvex functions. (3) Distributed computation has become the norm in handling big datasets. We propose the Bounded Asynchronous Protocol (BAP) to better balance communication and computation in distributed ML systems, and we continue to investigate the speedups and convergence guarantees of typical ML iterative algorithms under BAP and possibly less stringent convex or smooth assumptions. Our work will further advance the computational theory and practice in ML, and the resulting algorithms and system will be fundamental for analyzing big datasets using ML methodologies.

机器学习（ML）是一个开发软件的领域，可以通过学习和经验来改进自己，它在很大程度上受到历史数据可用性的驱动，并且需要开发高效和可扩展的算法和支持理论。相反，机器学习在科学、工程和商业领域的成功，沿着技术创新，导致了大数据收集的空前增长和热情，从而重新定义了计算效率并邀请了系统解决方案。例如，最近Deepmind的AlphaGo系统击败了顶级人类围棋选手，需要1900个CPU和280个GPU来执行计算。如何在这个庞大的分布式集群中平衡计算和通信，而不影响系统吞吐量或正确性？另一方面，开发移动的应用程序的小型初创公司可能无法负担与Google相同的计算能力，因此通常不得不转向原始的解决方案。如何为ML构建一个算法框架，提供“旋钮”来调整计算负载，并在准确性上有明确的可控损失？在大数据时代，满足如此多样化的计算需求是ML领域面临的巨大挑战。我们试图通过三个互补的目标来解决ML和大数据中的计算挑战：（1）真实的问题很难，但也是结构化的。多年来，设计能够利用数据和模型中的某些结构的统计方法和计算算法的重要性已经变得显而易见。在我们之前关于稀疏性和低秩性的工作的鼓舞下，我们建议研究ML应用中常见的两个额外结构：单调性和多模态（张量格式），并开发受益于此类结构的存在的有效算法。(2)数据总是充满噪声和随机波动，因此减少了在ML中获得精确甚至高精度解决方案的需求。近似计算，如果做得好，可以显着减少ML中的计算时间。我们开始了一个系统的研究ML近似计算的权衡，从“降级”计算昂贵的程序更简单，更便宜的，“最佳”光滑不可微函数，并附加措施的非凸函数的非凸性。(3)分布式计算已经成为处理大型数据集的标准。我们提出了有界异步协议（BAP），以更好地平衡分布式ML系统中的通信和计算，我们继续研究BAP和可能不太严格的凸或光滑假设下典型ML迭代算法的加速比和收敛保证。我们的工作将进一步推进ML的计算理论和实践，所产生的算法和系统将是使用ML方法分析大数据集的基础。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Yu, Yaoliang其他文献

DEVIATE: A Deep Learning Variance Testing Framework

DEVIATE：深度学习方差测试框架

DOI：
10.1109/ase51524.2021.9678540
发表时间：
2021
期刊：
2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE
影响因子：
0
作者：
Pham, Hung Viet;Kim, Mijung;Tan, Lin;Yu, Yaoliang;Nagappan, Nachiappan
通讯作者：
Nagappan, Nachiappan