CRII: SHF: Optimizing Deep Learning Training through Modeling and Scheduling Support

CRII:SHF:通过建模和调度支持优化深度学习训练

基本信息

项目摘要

Deep learning models trained on large amounts of data using lots of computing resources have recently achieved state-of-the-art training performance on important yet challenging artificial intelligence tasks. The success of deep learning has attracted significant research interest from hardware and software communities to improve training speed and efficiency. Despite the great efforts and rapid progress made, one important bridge to connect software and hardware support with deep learning domain knowledge is still missing: efficient configuration exploration and runtime scheduling. Both the quality of deep learning models and the training time are very sensitive to many adjustable parameters that are set before and during the training process, including the hyperparameter configurations (such as learning rate, momentum, number and size of hidden layers) and system configurations (such as thread parallelism, model parallelism, and data parallelism). Efficient exploration of hyperparameter configurations and judicious selection of system configurations is of great importance to find high-quality models with affordable time and cost. This is however a challenging problem due to a huge search space, expensive training runtime, sparsity of good configurations, and scarcity of time and resources.The objective of this research work is to systematically study the unique properties of deep learning systems and workloads, and establish new modeling and scheduling methodologies for improving deep learning training. The PI aims to improve the efficiency of discovering high performing models through a dynamic scheduling methodology driven by a novel hyperparameter configuration classification approach. The PI aims at developing an accuracy- and efficiency-aware hybrid scheduling methodology that makes judicious scheduling decisions based on a global view of both the time dimension (accuracy potential) and spatial dimension (efficiency potential) information. This research work integrates techniques in workload characterization, performance modeling, resource management, and scheduling to dramatically speedup the training process while significantly reducing the cost in time and resources. More broadly, this project will gain foundational knowledge about the interaction between software-hardware support and deep learning domain knowledge. This knowledge can help design next generation deep learning systems and frameworks, making deep learning training handy for researchers and practitioners with limited system and machine learning domain expertise. This research will help enhance curriculum and provide research topics for both undergraduate and graduate students, especially students from underrepresented groups.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
使用大量计算资源对大量数据进行训练的深度学习模型最近在重要但具有挑战性的人工智能任务上取得了最先进的训练性能。深度学习的成功吸引了硬件和软件界对提高训练速度和效率的浓厚研究兴趣。尽管已经取得了巨大的努力和快速的进展,但连接软件和硬件支持与深度学习领域知识的一个重要桥梁仍然缺失:高效的配置探索和运行时调度。深度学习模型的质量和训练时间对训练前和训练过程中设置的许多可调参数都非常敏感,包括超参数配置(如学习速度、动量、隐含层的数量和大小)和系统配置(如线程并行、模型并行、数据并行)。对超参数配置的有效探索和对系统配置的明智选择对于以负担得起的时间和成本找到高质量的模型至关重要。然而,这是一个具有挑战性的问题,因为搜索空间巨大,训练时间昂贵,良好的配置稀疏,时间和资源稀缺。本研究工作的目的是系统地研究深度学习系统和工作量的独特属性,并建立新的建模和调度方法来改进深度学习训练。PI旨在通过一种新的超参数配置分类方法驱动的动态调度方法来提高发现高性能模型的效率。PI旨在开发一种精度和效率感知的混合调度方法,该方法基于时间维度(精度潜力)和空间维度(效率潜力)信息的全局视图做出明智的调度决策。这项研究工作集成了工作负载表征、性能建模、资源管理和调度方面的技术,以显著加快培训过程,同时显著降低时间和资源成本。更广泛地说,这个项目将获得关于软硬件支持和深度学习领域知识之间相互作用的基础知识。这些知识可以帮助设计下一代深度学习系统和框架,使深度学习培训对于系统和机器学习领域专业知识有限的研究人员和从业者来说非常方便。这项研究将有助于加强课程,并为本科生和研究生提供研究课题,特别是来自代表性不足群体的学生。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(37)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
CEDULE: A Scheduling Framework for Burstable Performance in Cloud Computing
CEDULE:云计算中突发性能的调度框架
InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache
  • DOI:
  • 发表时间:
    2020-01
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ao Wang;Jingyuan Zhang;Xiaolong Ma;Ali Anwar;Lukas Rupprecht;Dimitrios Skourtis;Vasily Tarasov
  • 通讯作者:
    Ao Wang;Jingyuan Zhang;Xiaolong Ma;Ali Anwar;Lukas Rupprecht;Dimitrios Skourtis;Vasily Tarasov
BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chengliang Zhang;Suyi Li;Junzhe Xia;Wei Wang;Feng Yan;Yang Liu
  • 通讯作者:
    Chengliang Zhang;Suyi Li;Junzhe Xia;Wei Wang;Feng Yan;Yang Liu
Gradient Compression Supercharged High-Performance Data Parallel DNN Training
It's not a Sprint, it's a Marathon: Stretching Multi-resource Burstable Performance in Public Clouds
这不是冲刺,而是马拉松:在公共云中扩展多资源突发性能
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Feng Yan其他文献

Spatial and temporal variations of annual precipitation during 1960–2010 in China
1960—2010年中国年降水量时空变化
  • DOI:
    10.1016/j.quaint.2014.12.047
  • 发表时间:
    2015-09
  • 期刊:
  • 影响因子:
    2.2
  • 作者:
    Yanjiao Wang;Xianyan Chen;Feng Yan
  • 通讯作者:
    Feng Yan
Viscosity of two-dimensional strongly coupled dusty plasma modified by a perpendicular magnetic field
垂直磁场修改的二维强耦合尘埃等离子体的粘度
  • DOI:
    10.1103/physreve.96.053208
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    2.4
  • 作者:
    Feng Yan;Lin Wei;Murillo M. S.
  • 通讯作者:
    Murillo M. S.
Separative extended-gate AlGaAs/GaAs HEMT biosensors based on capacitance change strategy
基于电容变化策略的分离式扩展栅极AlGaAs/GaAs HEMT生物传感器
  • DOI:
    10.1063/5.0001786
  • 发表时间:
    2020-03
  • 期刊:
  • 影响因子:
    4
  • 作者:
    Jiahuan Yu;Mengke Xu;Lingyan Liang;Min Guan;Yang Zhang;Feng Yan;Hongtao Cao
  • 通讯作者:
    Hongtao Cao
Fluctuation theorem convergence in a viscoelastic medium demonstrated experimentally using a dusty plasma
使用尘埃等离子体通过实验证明了粘弹性介质中的涨落定理收敛性
  • DOI:
    10.1103/physreve.104.035207
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    2.4
  • 作者:
    Huang Dong;Lu Shaoyu;Shi Xia-qing;Goree J.;Feng Yan
  • 通讯作者:
    Feng Yan
Structure, Magnetism and Spin Coupling Mechanism of Cyano-Bridged LnIII–FeIII Binuclear Metal Complexes
氰基桥联LnIII-FeIII双核金属配合物的结构、磁性及自旋耦合机制
  • DOI:
    10.1023/a:1015143113847
  • 发表时间:
    2002
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Xianru Sun;Zhi;Feng Yan;Song Gao;K. Cheung;C. Che;Xi
  • 通讯作者:
    Xi

Feng Yan的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Feng Yan', 18)}}的其他基金

CAREER: Photovoltaic Devices with Earth-Abundant Low Dimensional Chalcogenides
职业:具有地球丰富的低维硫属化物的光伏器件
  • 批准号:
    2413632
  • 财政年份:
    2024
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: Machine Learning-assisted Ultrafast Physical Vapor Deposition of High Quality, Large-area Functional Thin Films
合作研究:机器学习辅助超快物理气相沉积高质量、大面积功能薄膜
  • 批准号:
    2226918
  • 财政年份:
    2023
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
Collaborative Research: Photomechanical Behavior in Photovoltaic Semiconductors
合作研究:光伏半导体中的光机械行为
  • 批准号:
    2330728
  • 财政年份:
    2023
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
PFI-TT: Highly Efficient, Scalable, and Stable Carbon-based Perovskite Solar Modules
PFI-TT:高效、可扩展且稳定的碳基钙钛矿太阳能模块
  • 批准号:
    2329871
  • 财政年份:
    2023
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: DMREF: AI-enabled Automated design of ultrastrong and ultraelastic metallic alloys
合作研究:DMREF:基于人工智能的超强和超弹性金属合金的自动化设计
  • 批准号:
    2323766
  • 财政年份:
    2023
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
Collaborative Research: Design and Discovery of Entropy-Stabilized Perovskite Halide Materials for Optoelectronics
合作研究:用于光电子学的熵稳定钙钛矿卤化物材料的设计和发现
  • 批准号:
    2330738
  • 财政年份:
    2023
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
CAREER: Automated and Efficient Machine Learning as a Service
职业:自动化高效的机器学习即服务
  • 批准号:
    2305491
  • 财政年份:
    2022
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: Design and Discovery of Entropy-Stabilized Perovskite Halide Materials for Optoelectronics
合作研究:用于光电子学的熵稳定钙钛矿卤化物材料的设计和发现
  • 批准号:
    2127640
  • 财政年份:
    2021
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
CAREER: Automated and Efficient Machine Learning as a Service
职业:自动化高效的机器学习即服务
  • 批准号:
    2048044
  • 财政年份:
    2021
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
I-Corps: Printable Carbon-based Perovskite Thin Film Solar Cells
I-Corps:可印刷碳基钙钛矿薄膜太阳能电池
  • 批准号:
    2039883
  • 财政年份:
    2020
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant

相似国自然基金

天然超短抗菌肽Temporin-SHf衍生多肽的构效分析与抗菌机制研究
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
衔接蛋白SHF负向调控胶质母细胞瘤中EGFR/EGFRvIII再循环和稳定性的功能及机制研究
  • 批准号:
    82302939
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
EGFR/GRβ/Shf调控环路在胶质瘤中的作用机制研究
  • 批准号:
    81572468
  • 批准年份:
    2015
  • 资助金额:
    60.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: SHF: Medium: Co-optimizing Spectral Algorithms and Systems for High-Performance Graph Learning
合作研究:SHF:中:协同优化高性能图学习的谱算法和系统
  • 批准号:
    2212370
  • 财政年份:
    2022
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: SHF: Medium: Co-Optimizing Computation and Data Transformations for Sparse Tensors
协作研究:SHF:中:稀疏张量的协同优化计算和数据转换
  • 批准号:
    2107556
  • 财政年份:
    2022
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: SHF: Medium: Co-Optimizing Computation and Data Transformations for Sparse Tensors
协作研究:SHF:中:稀疏张量的协同优化计算和数据转换
  • 批准号:
    2106621
  • 财政年份:
    2022
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: SHF: Medium: Co-Optimizing Computation and Data Transformations for Sparse Tensors
协作研究:SHF:中:稀疏张量的协同优化计算和数据转换
  • 批准号:
    2107135
  • 财政年份:
    2022
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
Collaborative Research: SHF: Medium: Co-optimizing Spectral Algorithms and Systems for High-Performance Graph Learning
合作研究:SHF:中:协同优化高性能图学习的谱算法和系统
  • 批准号:
    2212371
  • 财政年份:
    2022
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Continuing Grant
SHF: Small: Characterizing and Optimizing 3D NAND Flash
SHF:小型:表征和优化 3D NAND 闪存
  • 批准号:
    1908793
  • 财政年份:
    2019
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
SHF: Small: Optimizing Consolidation Efficiency of Emerging Virtualized Cloud Applications on Contemporary Server Architecture
SHF:小型:优化当代服务器架构上新兴虚拟化云应用程序的整合效率
  • 批准号:
    1527535
  • 财政年份:
    2015
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
SHF: Medium: Collaborative Research: Principled Optimizing Compilation of Dependently Typed Languages
SHF:媒介:协作研究:依赖类型语言的原则优化编译
  • 批准号:
    1559983
  • 财政年份:
    2015
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
CRII: SHF: Optimizing Program Executions on Non-uniform Threaded Architectures
CRII:SHF:优化非均匀线程架构上的程序执行
  • 批准号:
    1464157
  • 财政年份:
    2015
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
SHF: Medium: Collaborative Research: Principled Optimizing Compilation of Dependently Typed Languages
SHF:媒介:协作研究:依赖类型语言的原则优化编译
  • 批准号:
    1407790
  • 财政年份:
    2014
  • 资助金额:
    $ 17.5万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了