ASCENT: Collaborative Research: Scaling Distributed AI Systems based on Universal Optical I/O
ASCENT:协作研究:基于通用光学 I/O 扩展分布式人工智能系统
基本信息
- 批准号:2023861
- 负责人:
- 金额:$ 65万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-08-15 至 2023-07-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Our society is rapidly becoming reliant on neural networks based artificial intelligence computation. New algorithms are invented daily, increasing the memory and computational requirements for both inference and training. This explosive growth has created an enormous demand for distributed machine learning (ML) training and inference. Estimates by OpenAI illustrate the steady growth of computational requirements of 100x every two years since 2012, which is a 50x faster than the rate of computation improvements enabled previously through Moore’s Law of semiconductor industry that we have enjoyed in the last half-century. This new computation demand has been partly met by rapid development of hardware accelerators and software stacks to support these specialized computations. Hardware accelerators have provided a significant amount of speed-up but today’s training tasks can still take days and even weeks. The reason for this: as the number of workers (e.g. compute nodes) increases, the computation time per worker decreases, but the communication requirements between the nodes increase, creating a bottleneck in the interconnect between the compute nodes. Future distributed ML systems will require 1-2 orders of magnitude higher interconnect bandwidth per node, creating a pressing need for entirely new ways to build interconnects for distributed ML systems. This proposal aims to create a new paradigm for scaling distributed ML computation, by developing a scalable interconnect solution based on advancing the integrated electronics and photonics technology that enables direct node-to-node optical fiber connectivity. The proposed cross-stack collaborative multi-disciplinary work will enable the education and training of a unique crop of engineers and scientists that cross the boundaries of machine learning, networking, and electronic-photonic systems and devices, which are in severe demand. The principal investigators have an established track record of direct engagement with high-school students providing summer internships at Berkeley Wireless Research Center and MIT’s Women’s Technology Program, as well as exemplary undergraduate research activities at Boston University. The educational and outreach activities the PIs have put in place will ensure early exposure and continued training of new generation of leaders in this field, from K-12, through undergraduate and graduate studies, and continuing workforce education, with special focus on underrepresented students.The interconnect has emerged as the key bottleneck in enabling the full potential of distributed ML. Future ML workloads are likely to require tens of Tbps of bandwidth per device. Ubiquitous deployment of logically-connected, physically distributed computation across shelf, rack and row scale can only be enabled by a new universal I/O that enables socket to socket communication at the energy, latency and bandwidth density of in-package interconnects. No such technology currently exists. Silicon-photonics based optical I/O has the potential to address this critical challenge, but fundamental advances–from chip manufacturing to routing algorithms–are still needed to ensure the scalability of these interconnect systems. To enable high-bandwidth density and energy-efficiency, dense wavelength division multiplexing must be used. High-efficiency ring resonator-based modulators and comb laser sources are needed to enable Tbps rates over each fiber connection and socket bandwidth scaling from 10s to 100s of Tbps. New link architectures like the proposed laser-forwarded coherent link are needed to enable high-efficiency external centralized comb laser sources with modest (sub-mW) power per wavelength per fiber port. The proposed work will also develop new scheduling algorithms, network architectures, and workload parallelism strategy to leverage the bandwidth density and low-latency of the universal optical I/O, to map large AI workloads with massive datasets to a scalable distributed compute system.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
我们的社会正迅速变得依赖于基于神经网络的人工智能计算。每天都有新的算法被发明,这增加了推理和训练的内存和计算要求。这种爆炸性的增长产生了对分布式机器学习(ML)培训和推理的巨大需求。OpenAI的估计显示,自2012年以来,计算需求每两年稳步增长100倍,比我们在过去半个世纪通过半导体行业摩尔定律实现的计算改进速度快50倍。硬件加速器和软件堆栈的快速发展部分满足了这种新的计算需求,以支持这些专门的计算。硬件加速器提供了大量的加速,但今天的培训任务仍然需要几天甚至几周的时间。其原因是:随着工作器(例如计算节点)数量的增加,每个工作器的计算时间减少,但节点之间的通信要求增加,从而在计算节点之间的互连中产生瓶颈。未来的分布式ML系统将要求每个节点的互连带宽增加1-2个数量级,这就迫切需要一种全新的方法来构建分布式ML系统的互连。该建议旨在通过开发基于支持直接节点到节点光纤连接的集成电子和光子学技术的可扩展互连解决方案,来创建可扩展的分布式ML计算的新范例。拟议的跨栈协作多学科工作将使教育和培训一批独特的工程师和科学家成为可能,他们跨越了机器学习、网络和电子-光子系统和设备的边界,这些系统和设备是迫切需要的。主要调查人员与在伯克利无线研究中心和麻省理工学院女性技术项目提供暑期实习的高中生直接接触,以及在波士顿大学进行模范的本科生研究活动,都有既定的记录。私人投资机构开展的教育和外展活动将确保早期接触和持续培训这一领域的新一代领导人,从K-12到本科和研究生学习,以及继续劳动力教育,特别关注代表性不足的学生。互联已成为使分布式ML充分发挥潜力的关键瓶颈。未来的ML工作负载可能需要每个设备数十Tbps的带宽。只有通过新的通用I/O才能实现跨机架、机架和行规模的逻辑连接、物理分布式计算的无处不在的部署,该I/O能够以封装内互连的能量、延迟和带宽密度实现插座到插座的通信。目前还不存在这样的技术。基于硅光子学的光学I/O有可能解决这一关键挑战,但仍需要从芯片制造到布线算法的根本性进步,以确保这些互连系统的可扩展性。为了实现高带宽密度和能效,必须使用密集波分复用。需要高效率的基于环形谐振器的调制器和梳状激光光源,以实现每个光纤连接的Tbps速率和从10s到100s的插座带宽扩展。需要像所提出的激光转发相干链路这样的新链路体系结构,以实现每个光纤端口每个波长具有适度(亚毫瓦)功率的高效外部集中式梳状激光光源。拟议的工作还将开发新的调度算法、网络体系结构和工作负载并行策略,以利用通用光纤I/O的带宽密度和低延迟,将具有海量数据集的大型人工智能工作负载映射到可扩展的分布式计算系统。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Vladimir Stojanovic其他文献
End-to-end multi-scale residual network with parallel attention mechanism for fault diagnosis under noise and small samples
具有并行注意力机制的端到端多尺度残差网络用于噪声和小样本下的故障诊断
- DOI:
10.1016/j.isatra.2024.12.023 - 发表时间:
2025-02-01 - 期刊:
- 影响因子:6.500
- 作者:
Yawei Sun;Hongfeng Tao;Vladimir Stojanovic - 通讯作者:
Vladimir Stojanovic
Fault-tolerant control of a hydraulic servo actuator via adaptive dynamic programming
- DOI:
10.3934/mmc.2023016 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Vladimir Stojanovic - 通讯作者:
Vladimir Stojanovic
Blood pressure cut-offs to diagnose impending hypertensive emergency depend on previous hypertension-mediated organ damage and comorbid conditions.
诊断即将发生的高血压急症的血压截止值取决于既往高血压介导的器官损伤和合并症。
- DOI:
10.25259/nmji_160_21 - 发表时间:
2024 - 期刊:
- 影响因子:0.4
- 作者:
Goran Koraćević;Milovan Stojanovic;D. Lovic;Tomislav Kostić;Miloje Tomasevic;S. S. Martinovic;S. C. Zdravkovic;M. Koraćević;Vladimir Stojanovic - 通讯作者:
Vladimir Stojanovic
Quantized control for interconnected PDE systems via mobile measurement and control strategies
- DOI:
10.1016/j.jfranklin.2024.107070 - 发表时间:
2024-09-01 - 期刊:
- 影响因子:
- 作者:
Danjing Zheng;Xiaona Song;Shuai Song;Vladimir Stojanovic - 通讯作者:
Vladimir Stojanovic
Finite-time asynchronous dissipative filtering of conic-type nonlinear Markov jump systems
二次曲线型非线性马尔可夫跳跃系统的有限时间异步耗散滤波
- DOI:
10.1007/s11432-020-2913-x - 发表时间:
2021-03 - 期刊:
- 影响因子:0
- 作者:
Xiang Zhang;Shuping He;Vladimir Stojanovic;Xiaoli Luan;Fei Liu - 通讯作者:
Fei Liu
Vladimir Stojanovic的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Vladimir Stojanovic', 18)}}的其他基金
FuSe-TG: Electronic-Photonic Systems-on-Chip for Computation, Communication and Sensing
FuSe-TG:用于计算、通信和传感的电子光子片上系统
- 批准号:
2235466 - 财政年份:2023
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: FuSe: Collaborative Optically Disaggregated Arrays of Extreme-MIMO Radio Units (CODAeMIMO)
合作研究:FuSe:Extreme-MIMO 无线电单元的协作光学分解阵列 (CODAeMIMO)
- 批准号:
2328945 - 财政年份:2023
- 资助金额:
$ 65万 - 项目类别:
Continuing Grant
OuSense: Electronic-Photonic System-on-Chip for Real-time Endoscopic Ultrasound 3D Imaging
OuSense:用于实时内窥镜超声 3D 成像的电子光子片上系统
- 批准号:
2128402 - 财政年份:2021
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
OP: Collaborative Research: Coherent Integrated Si-Photonic Links
OP:协作研究:相干集成硅光子链路
- 批准号:
1611296 - 财政年份:2016
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Energy-Efficient Compressed Sensing: A joint Algorithmic/Implementation Approach Using Deterministic Sensing
节能压缩传感:使用确定性传感的联合算法/实现方法
- 批准号:
1363447 - 财政年份:2013
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Energy-Efficient Compressed Sensing: A joint Algorithmic/Implementation Approach Using Deterministic Sensing
节能压缩传感:使用确定性传感的联合算法/实现方法
- 批准号:
1128226 - 财政年份:2011
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: Energy-efficient communication with optimized ECC decoders: Connecting Algorithms and Implementations
协作研究:使用优化的 ECC 解码器进行节能通信:连接算法和实现
- 批准号:
0725555 - 财政年份:2007
- 资助金额:
$ 65万 - 项目类别:
Continuing Grant
相似海外基金
Collaborative Research: SWIFT: Context-aware Spectrum Coexistence dEsign aNd implemenTation in satellite bands (ASCENT)
合作研究:SWIFT:卫星频段的上下文感知频谱共存设计和实施 (ASCENT)
- 批准号:
2245910 - 财政年份:2022
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: How faithfully are melt embayments wedded to magma ascent?
合作研究:熔体海湾与岩浆上升的关系有多忠实?
- 批准号:
2221896 - 财政年份:2022
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: SWIFT: Context-aware Spectrum Coexistence dEsign aNd implemenTation in satellite bands (ASCENT)
合作研究:SWIFT:卫星频段的上下文感知频谱共存设计和实施 (ASCENT)
- 批准号:
2128540 - 财政年份:2021
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: SWIFT: Context-aware Spectrum Coexistence dEsign aNd implemenTation in satellite bands (ASCENT)
合作研究:SWIFT:卫星频段的上下文感知频谱共存设计和实施 (ASCENT)
- 批准号:
2128584 - 财政年份:2021
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: Volatile sources, eruption triggers, and magma ascent rates for mafic alkaline magmas at Nyiragongo and Nyamulagira volcanoes, DR Congo, East African Rift
合作研究:刚果民主共和国、东非大裂谷尼拉贡戈火山和尼亚穆拉吉拉火山的镁铁质碱性岩浆的挥发性来源、喷发触发因素和岩浆上升速率
- 批准号:
2043067 - 财政年份:2021
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: Volatile sources, eruption triggers, and magma ascent rates for mafic alkaline magmas at Nyiragongo and Nyamulagira volcanoes, DR Congo, East African Rift
合作研究:刚果民主共和国、东非大裂谷尼拉贡戈火山和尼亚穆拉吉拉火山的镁铁质碱性岩浆的挥发性来源、喷发触发因素和岩浆上升速率
- 批准号:
2043066 - 财政年份:2021
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
ASCENT: Collaborative Research: Scaling Distributed AI Systems based on Universal Optical I/O
ASCENT:协作研究:基于通用光学 I/O 扩展分布式人工智能系统
- 批准号:
2023468 - 财政年份:2020
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
Collaborative Research: How faithfully are melt embayments wedded to magma ascent?
合作研究:熔体海湾与岩浆上升的关系有多忠实?
- 批准号:
2015424 - 财政年份:2020
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
ASCENT: Collaborative Research: Scaling Distributed AI Systems based on Universal Optical I/O
ASCENT:协作研究:基于通用光学 I/O 扩展分布式人工智能系统
- 批准号:
2023751 - 财政年份:2020
- 资助金额:
$ 65万 - 项目类别:
Standard Grant
ASCENT: Collaborative Research: Programmable Photonic Computation Accelerators (PPCA)
ASCENT:协作研究:可编程光子计算加速器(PPCA)
- 批准号:
2023780 - 财政年份:2020
- 资助金额:
$ 65万 - 项目类别:
Standard Grant














{{item.name}}会员




