Collaborative Research: OAC Core: ScaDL: New Approaches to Scaling Deep Learning for Science Applications on Supercomputers
协作研究:OAC 核心:ScaDL:在超级计算机上扩展深度学习科学应用的新方法
基本信息
- 批准号:2107511
- 负责人:
- 金额:$ 27.16万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-10-01 至 2024-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Today's deep learning (DL) revolution is enabled by efficient deep neural network (DNN) training methods that capture important patterns within large quantities of data in compact, easily usable DNN models. DL methods are applied routinely to tasks like natural language translation and image labeling--and, in science and engineering, to problems as diverse as drug design, environmental monitoring, and fusion energy. Yet as data sizes increase and DL methods grow in sophistication, the time required to train new models often emerges as a major challenge. The Scalable Deep Learning (ScaDL) project will address this challenge by making it possible to use specialized high-performance computing (HPC) systems to train bigger models more rapidly. Efficient use of the thousands of powerful processors in modern HPC systems for DNN training has previously been stymied by communication costs that grow rapidly with the number of processors used. ScaDL will overcome this obstacle by developing new DNN training methods that reduce communication requirements by performing additional computation, by validating the effectiveness of these new methods in a range of scientific applications that use DL in different ways, and by integrating the new methods into scalable DL software for use by domain scientists, computer scientists, and engineers supporting DL application in HPC centers. By permitting the use of powerful HPC systems to train DNN models thousands of times faster than on a single computer, ScaDL will enable advances in many areas of science and engineering. The project will also contribute to educational outcomes by engaging PhD students in project goals, by using ScaDL tools in a new DL systems engineering class at the University of Chicago, and by enlisting participants in summer schools at the Texas Advanced Computing Center (TACC) and U. Chicago, both of which target recruitment of students from underserved communities at the graduate, undergraduate, and high-school levels, to apply the tools to scientific problems. ScaDL's focus on science applications and education aligns the project with NSF's mission of promoting the progress of science.The ScaDL project contributes to science in two ways. First, it explores new techniques for enhancing the speed and scalability of commonly used optimization methods without losing model performance, by: 1) exploiting scalable algorithms for second-order information approximation; 2) developing methods for adapting to different computer hardware by tuning computation and communication to maximize training speed; 3) exploring compression techniques to reduce communication overheads; 4) using well-known benchmark applications to evaluate the convergence of ScaDL; and 5) applying its new algorithms and systems to science applications. Second, it will release an open-source implementation of the proposed algorithms and system. The implementation will be available on a variety of hardware platforms and capable of choosing the ratio of computation and communication required to make efficient use of the computation and communication hardware on a particular HPC system. The resulting algorithms and system will help disseminate ScaDL research results to a wide spectrum of research domains and users, and promote the adoption of the new methods in practical settings.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当今的深度学习(DL)革命是由高效的深度神经网络(DNN)训练方法实现的,这些方法可以在紧凑、易于使用的DNN模型中捕获大量数据中的重要模式。深度学习方法通常应用于自然语言翻译和图像标记等任务,在科学和工程领域,它也应用于药物设计、环境监测和聚变能等各种问题。然而,随着数据大小的增加和深度学习方法的复杂性,训练新模型所需的时间往往成为一个主要挑战。可扩展深度学习(ScaDL)项目将通过使用专门的高性能计算(HPC)系统来更快地训练更大的模型来解决这一挑战。在DNN训练中,现代HPC系统中数千个功能强大的处理器的有效使用以前一直受到通信成本的阻碍,通信成本随着所使用的处理器数量的快速增长而迅速增长。ScaDL将通过开发新的DNN训练方法来克服这一障碍,这些方法通过执行额外的计算来减少通信需求,通过验证这些新方法在以不同方式使用DL的一系列科学应用中的有效性,以及通过将新方法集成到可扩展的DL软件中,供领域科学家,计算机科学家和工程师在HPC中心支持DL应用。通过允许使用强大的HPC系统来训练DNN模型,速度比在单台计算机上快数千倍,ScaDL将使许多科学和工程领域取得进步。该项目还将通过让博士生参与项目目标、在芝加哥大学新的DL系统工程课程中使用ScaDL工具以及在德克萨斯州高级计算中心(TACC)和美国大学暑期学校招募参与者来促进教育成果。芝加哥,这两个目标的学生招聘从服务不足的社区在研究生,本科和高中水平,应用工具的科学问题。ScaDL对科学应用和教育的关注使该项目与NSF促进科学进步的使命保持一致。首先,它探索了在不损失模型性能的情况下提高常用优化方法的速度和可扩展性的新技术,通过:1)开发用于二阶信息近似的可扩展算法; 2)通过调整计算和通信以最大化训练速度来开发用于适应不同计算机硬件的方法; 3)探索压缩技术以减少通信开销; 4)使用著名的基准应用程序来评估ScaDL的收敛性; 5)将其新算法和系统应用于科学应用。 其次,它将发布所提出的算法和系统的开源实现。该实现将在各种硬件平台上可用,并且能够选择有效利用特定HPC系统上的计算和通信硬件所需的计算和通信比率。由此产生的算法和系统将有助于传播ScaDL的研究成果,以广泛的研究领域和用户,并促进采用新的方法在实际settings.This奖项反映了NSF的法定使命,并已被认为是值得通过评估使用基金会的智力价值和更广泛的影响审查标准的支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ian Foster其他文献
GreenFaaS: Maximizing Energy Efficiency of HPC Workloads with FaaS
GreenFaaS:利用 FaaS 最大限度提高 HPC 工作负载的能源效率
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Alok V. Kamatar;Valerie Hayot;Y. Babuji;André Bauer;Gourav Rattihalli;Ninad Hogade;D. Milojicic;Kyle Chard;Ian Foster - 通讯作者:
Ian Foster
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
DeepSpeed4Science 计划:通过复杂的人工智能系统技术实现大规模科学发现
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
S. Song;Bonnie Kruft;Minjia Zhang;Conglong Li;Shiyang Chen;Chengming Zhang;Masahiro Tanaka;Xiaoxia Wu;Jeff Rasley;A. A. Awan;Connor Holmes;Martin Cai;Adam Ghanem;Zhongzhu Zhou;Yuxiong He;Christopher Bishop;Max Welling;Tie;Christian Bodnar;Johannes Brandsetter;W. Bruinsma;Chan Cao;Yuan Chen;Peggy Dai;P. Garvan;Liang He;E. Heider;Pipi Hu;Peiran Jin;Fusong Ju;Yatao Li;Chang Liu;Renqian Luo;Qilong Meng;Frank Noé;Tao Qin;Janwei Zhu;Bin Shao;Yu Shi;Wen;Gregor Simm;Megan Stanley;Lixin Sun;Yue Wang;Tong Wang;Zun Wang;Lijun Wu;Yingce Xia;Leo Xia;Shufang Xie;Shuxin Zheng;Jianwei Zhu;Pete Luferenko;Divya Kumar;Jonathan Weyn;Ruixiong Zhang;Sylwester Klocek;V. Vragov;Mohammed Alquraishi;Gustaf Ahdritz;C. Floristean;Cristina Negri;R. Kotamarthi;V. Vishwanath;Arvind Ramanathan;Sam Foreman;Kyle Hippe;T. Arcomano;R. Maulik;Max Zvyagin;Alexander Brace;Bin Zhang;Cindy Orozco Bohorquez;Austin R. Clyde;B. Kale;Danilo Perez;Heng Ma;Carla M. Mann;Michael Irvin;J. G. Pauloski;Logan Ward;Valerie Hayot;M. Emani;Zhen Xie;Diangen Lin;Maulik Shukla;Thomas Gibbs;Ian Foster;James J. Davis;M. Papka;Thomas Brettin;Prasanna Balaprakash;Gina Tourassi;John P. Gounley;Heidi Hanson;T. Potok;Massimiliano Lupo Pasini;Kate Evans;Dan Lu;D. Lunga;Junqi Yin;Sajal Dash;Feiyi Wang;M. Shankar;Isaac Lyngaas;Xiao Wang;Guojing Cong;Peifeng Zhang;Ming Fan;Siyan Liu;A. Hoisie;Shinjae Yoo;Yihui Ren;William Tang;K. Felker;Alexey Svyatkovskiy;Hang Liu;Ashwin Aji;Angela Dalton;Michael Schulte;Karl Schulz;Yuntian Deng;Weili Nie;Josh Romero;Christian Dallago;Arash Vahdat;Chaowei Xiao;Anima Anandkumar;R. Stevens - 通讯作者:
R. Stevens
An optical microscopy system for 3 D dynamic imagingRandy
用于 3D 动态成像的光学显微镜系统Randy
- DOI:
- 发表时间:
2007 - 期刊:
- 影响因子:0
- 作者:
R. Hudson;John N. Aarsvold;Chin;Jie Chen;Peter Davies;T. Disz;Ian Foster;Melvin Griem;Man K Kwong;B. Lin - 通讯作者:
B. Lin
Review of low-cost self-driving laboratories in chemistry and materials science: the “frugal twin” concept
化学与材料科学低成本自动驾驶实验室综述:“节俭双胞胎”概念
- DOI:
10.1039/d3dd00223c - 发表时间:
2024-05-15 - 期刊:
- 影响因子:5.600
- 作者:
Stanley Lo;Sterling G. Baird;Joshua Schrier;Ben Blaiszik;Nessa Carson;Ian Foster;Andrés Aguilar-Granda;Sergei V. Kalinin;Benji Maruyama;Maria Politi;Helen Tran;Taylor D. Sparks;Alán Aspuru-Guzik - 通讯作者:
Alán Aspuru-Guzik
Exploring Benchmarks for Self-Driving Labs using Color Matching
使用颜色匹配探索自动驾驶实验室的基准
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Tobias Ginsburg;Kyle Hippe;Ryan Lewis;Aileen Cleary;D. Ozgulbas;Rory Butler;Casey Stone;Abraham Stroka;Rafael Vescovi;Ian Foster - 通讯作者:
Ian Foster
Ian Foster的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ian Foster', 18)}}的其他基金
Collaborative Research: NSF Workshop on Automated, Programmable and Self Driving Labs
合作研究:NSF 自动化、可编程和自动驾驶实验室研讨会
- 批准号:
2335910 - 财政年份:2023
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Frameworks: Garden: A FAIR Framework for Publishing and Applying AI Models for Translational Research in Science, Engineering, Education, and Industry
框架:Garden:用于发布和应用人工智能模型进行科学、工程、教育和工业转化研究的公平框架
- 批准号:
2209892 - 财政年份:2022
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
NSF Convergence Accelerator Track D: The Data Hypervisor: Orchestrating Data and Models
NSF 融合加速器轨道 D:数据管理程序:编排数据和模型
- 批准号:
2040718 - 财政年份:2020
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: Frameworks: funcX: A Function Execution Service for Portability and Performance
协作研究:框架:funcX:可移植性和性能的函数执行服务
- 批准号:
2004894 - 财政年份:2020
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Virtual Data Set Services Enabling New Science at NSF Facilities
虚拟数据集服务在 NSF 设施中实现新科学
- 批准号:
1841531 - 财政年份:2018
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Framework: Software: HDR Globus Automate: A Distributed Research Automation Platform
框架:软件:HDR Globus Automate:分布式研究自动化平台
- 批准号:
1835890 - 财政年份:2018
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
EAGER: Designing the OSN Software Platform
EAGER:设计 OSN 软件平台
- 批准号:
1836357 - 财政年份:2018
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate
BD 辐条:辐条:中西部:协作:集成材料设计 (IMaD):利用、创新和传播
- 批准号:
1636950 - 财政年份:2017
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: CyberSEES:Type 2: Framework to Advance Climate, Economics, and Impact Investigations with Information Technology (FACE-IT)
合作研究:CyberSEES:类型 2:利用信息技术推进气候、经济和影响调查的框架 (FACE-IT)
- 批准号:
1331922 - 财政年份:2013
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: Managing Cloud Usage Allocation and Accounting for the NSF Community
协作研究:管理 NSF 社区的云使用分配和核算
- 批准号:
1250555 - 财政年份:2012
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402946 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403088 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403090 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:
2403399 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403089 - 财政年份:2024
- 资助金额:
$ 27.16万 - 项目类别:
Standard Grant