Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers

协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法

基本信息

  • 批准号:
    2401246
  • 负责人:
  • 金额:
    $ 22.64万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-10-01 至 2024-10-31
  • 项目状态:
    已结题

项目摘要

Today's deep learning (DL) revolution is enabled by efficient deep neural network (DNN) training methods that capture important patterns within large quantities of data in compact, easily usable DNN models. DL methods are applied routinely to tasks like natural language translation and image labeling--and, in science and engineering, to problems as diverse as drug design, environmental monitoring, and fusion energy. Yet as data sizes increase and DL methods grow in sophistication, the time required to train new models often emerges as a major challenge. The Scalable Deep Learning (ScaDL) project will address this challenge by making it possible to use specialized high-performance computing (HPC) systems to train bigger models more rapidly. Efficient use of the thousands of powerful processors in modern HPC systems for DNN training has previously been stymied by communication costs that grow rapidly with the number of processors used. ScaDL will overcome this obstacle by developing new DNN training methods that reduce communication requirements by performing additional computation, by validating the effectiveness of these new methods in a range of scientific applications that use DL in different ways, and by integrating the new methods into scalable DL software for use by domain scientists, computer scientists, and engineers supporting DL application in HPC centers. By permitting the use of powerful HPC systems to train DNN models thousands of times faster than on a single computer, ScaDL will enable advances in many areas of science and engineering. The project will also contribute to educational outcomes by engaging PhD students in project goals, by using ScaDL tools in a new DL systems engineering class at the University of Chicago, and by enlisting participants in summer schools at the Texas Advanced Computing Center (TACC) and U. Chicago, both of which target recruitment of students from underserved communities at the graduate, undergraduate, and high-school levels, to apply the tools to scientific problems. ScaDL's focus on science applications and education aligns the project with NSF's mission of promoting the progress of science.The ScaDL project contributes to science in two ways. First, it explores new techniques for enhancing the speed and scalability of commonly used optimization methods without losing model performance, by: 1) exploiting scalable algorithms for second-order information approximation; 2) developing methods for adapting to different computer hardware by tuning computation and communication to maximize training speed; 3) exploring compression techniques to reduce communication overheads; 4) using well-known benchmark applications to evaluate the convergence of ScaDL; and 5) applying its new algorithms and systems to science applications. Second, it will release an open-source implementation of the proposed algorithms and system. The implementation will be available on a variety of hardware platforms and capable of choosing the ratio of computation and communication required to make efficient use of the computation and communication hardware on a particular HPC system. The resulting algorithms and system will help disseminate ScaDL research results to a wide spectrum of research domains and users, and promote the adoption of the new methods in practical settings.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当今的深度学习(DL)革命是由高效的深度神经网络(DNN)训练方法实现的,这些方法可以在紧凑、易于使用的DNN模型中捕获大量数据中的重要模式。深度学习方法通常应用于自然语言翻译和图像标记等任务,在科学和工程领域,它也应用于药物设计、环境监测和聚变能等各种问题。然而,随着数据大小的增加和深度学习方法的复杂性,训练新模型所需的时间往往成为一个主要挑战。可扩展深度学习(ScaDL)项目将通过使用专门的高性能计算(HPC)系统来更快地训练更大的模型来解决这一挑战。在DNN训练中,现代HPC系统中数千个功能强大的处理器的有效使用以前一直受到通信成本的阻碍,通信成本随着所使用的处理器数量的快速增长而迅速增长。ScaDL将通过开发新的DNN训练方法来克服这一障碍,这些方法通过执行额外的计算来减少通信需求,通过验证这些新方法在以不同方式使用DL的一系列科学应用中的有效性,以及通过将新方法集成到可扩展的DL软件中,供领域科学家,计算机科学家和工程师在HPC中心支持DL应用。通过允许使用强大的HPC系统来训练DNN模型,速度比在单台计算机上快数千倍,ScaDL将使许多科学和工程领域取得进步。该项目还将通过让博士生参与项目目标、在芝加哥大学新的DL系统工程课程中使用ScaDL工具以及在德克萨斯州高级计算中心(TACC)和美国大学暑期学校招募参与者来促进教育成果。芝加哥,这两个目标的学生招聘从服务不足的社区在研究生,本科和高中水平,应用工具的科学问题。ScaDL对科学应用和教育的关注使该项目与NSF促进科学进步的使命保持一致。首先,它探索了在不损失模型性能的情况下提高常用优化方法的速度和可扩展性的新技术,通过:1)开发用于二阶信息近似的可扩展算法; 2)通过调整计算和通信以最大化训练速度来开发用于适应不同计算机硬件的方法; 3)探索压缩技术以减少通信开销; 4)使用著名的基准应用程序来评估ScaDL的收敛性; 5)将其新算法和系统应用于科学应用。 其次,它将发布所提出的算法和系统的开源实现。该实现将在各种硬件平台上可用,并且能够选择有效利用特定HPC系统上的计算和通信硬件所需的计算和通信比率。由此产生的算法和系统将有助于传播ScaDL的研究成果,以广泛的研究领域和用户,并促进采用新的方法在实际settings.This奖项反映了NSF的法定使命,并已被认为是值得通过评估使用基金会的智力价值和更广泛的影响审查标准的支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Zhao Zhang其他文献

A Novel Technique for Dressing Fixed Abrasive Lapping Pad with Abrasive Water Jet
磨料水射流修整固定磨料研磨垫的新技术
Tunable nonuniform sampling method for fast calculation and intensity modulation in 3D dynamic holographic display
3D动态全息显示中快速计算和强度调制的可调非均匀采样方法
  • DOI:
    10.1364/ol.38.002676
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    3.6
  • 作者:
    Zhao Zhang;Juan Liu;Jia Jia;Xin Li;Jian Han;Bin Hu;Yongtian Wang
  • 通讯作者:
    Yongtian Wang
Identification of Collective Viewpoints on Microblogs
微博集​​体观点识别
Uncertainty analysis and robust design optimization for the heat-assisted bending of high-strength titanium tube
高强钛管热辅助弯曲的不确定性分析与鲁棒设计优化
Traffic Congestion Pricing Based on Decision Tree
基于决策树的交通拥堵定价
  • DOI:
    10.1061/41039(345)200
  • 发表时间:
    2009
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yamin Huo;Jian Chen;Zhao Zhang
  • 通讯作者:
    Zhao Zhang

Zhao Zhang的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Zhao Zhang', 18)}}的其他基金

CAREER: Efficient and Scalable Large Foundational Model Training on Supercomputers for Science
职业:科学超级计算机上高效且可扩展的大型基础模型训练
  • 批准号:
    2340011
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: Frameworks: hpcGPT: Enhancing Computing Center User Support with HPC-enriched Generative AI
协作研究:框架:hpcGPT:通过 HPC 丰富的生成式 AI 增强计算中心用户支持
  • 批准号:
    2411294
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
  • 批准号:
    2312689
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
  • 批准号:
    2401244
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
  • 批准号:
    2311766
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
  • 批准号:
    2401245
  • 财政年份:
    2023
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers
协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法
  • 批准号:
    2106661
  • 财政年份:
    2021
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Small: Efficient and Policy-driven Burst Buffer Sharing
合作研究:OAC Core:小型:高效且策略驱动的突发缓冲区共享
  • 批准号:
    2008388
  • 财政年份:
    2020
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
SHF: Medium:Collaborative Research: Architectural and System Support for Building Versatile Memory Systems
SHF:媒介:协作研究:构建多功能内存系统的架构和系统支持
  • 批准号:
    1643271
  • 财政年份:
    2016
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant
SHF: Medium:Collaborative Research: Architectural and System Support for Building Versatile Memory Systems
SHF:媒介:协作研究:构建多功能内存系统的架构和系统支持
  • 批准号:
    1514229
  • 财政年份:
    2015
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Continuing Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402946
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
  • 批准号:
    2403399
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403089
  • 财政年份:
    2024
  • 资助金额:
    $ 22.64万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了