Scalable Hybrid-parallelism Design for Mega-Size Deep Learning Model

超大规模深度学习模型的可扩展混合并行设计

基本信息

项目摘要

This year, we develop new methods to reduce the computing time by eliminating non-important samples during the training process (submitted to ICML2023).Through our previous work (IPDPS2022), we found that local shuffling could not achieve good accuracy in large-scale training due to non-iid data and overfitting issues. We deal with non-iid by assigning the impact factor for the models from different workers dynamically and use knowledge distillation for dealing with overfitting. The work is the Best Paper Award Finalist in CCGRID2023.We study the method to reduce the communication time by a co-design of collective communication algorithm and the intra-node network architecture (a Q1-journal JPDC is accepted) and the inter-node network architecture (poster at HPCA-Asia2023).
今年,我们开发了新的方法,通过在训练过程中消除非重要样本来减少计算时间(提交给ICML2023)。通过我们之前的工作(IPDPS2022),我们发现局部洗牌在大规模训练中由于非id数据和过拟合问题无法达到很好的准确率。我们通过动态分配不同工作者模型的影响因子来处理非id,并使用知识蒸馏来处理过拟合。该作品是CCGRID2023最佳论文奖决赛入围者。我们研究了通过共同设计集体通信算法和节点内网络架构(JPDC接受q1期刊)和节点间网络架构(HPCA-Asia2023海报)来减少通信时间的方法。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Effective Switchless Inter-FPGA Memory Networks
有效的无开关 FPGA 间内存网络
Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning
Efficient Allreduce Algorithm for Large-Scale Deep Learning on Distributed Loop Networks
用于分布式环网络大规模深度学习的高效 Allreduce 算法
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Truong Thao Nguyen;Peng Chen;Yusuke Tanimura
  • 通讯作者:
    Yusuke Tanimura
Scalable Low-Latency Inter-FPGA Networks
可扩展的低延迟 FPGA 间网络
CADIS: Handling Cluster-skewed Non-IID Data in Federated Learning with Clustered Aggregation and Knowledge DIStilled Regularization
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Nguyen Truong其他文献

Letter to the Editor of Journal of Otolaryngology regarding “Risk of diabetes in patients with sleep apnea: comparison of surgery versus CPAP in a long-term follow-up study”
  • DOI:
    10.1186/s40463-023-00662-5
  • 发表时间:
    2023-09-19
  • 期刊:
  • 影响因子:
    2.200
  • 作者:
    Nguyen Truong;Bao Sciscent;F. Jeffrey Lorenz;David Goldrich;Neerav Goyal
  • 通讯作者:
    Neerav Goyal
Privacy preservation in federated learning: An insightful survey from the GDPR perspective
  • DOI:
    10.1016/j.cose.2021.102402
  • 发表时间:
    2021-07-30
  • 期刊:
  • 影响因子:
    5.6
  • 作者:
    Nguyen Truong;Sun, Kai;Guo, YiKe
  • 通讯作者:
    Guo, YiKe
789 - A MIXED METHODS: PROJECT INVESTIGATING PATIENTS’ PREFERENCES FOR DIGITAL APPLICATIONS IN OSTEOARTHRITIS MANAGEMENT
  • DOI:
    10.1016/j.joca.2024.02.803
  • 发表时间:
    2024-04-01
  • 期刊:
  • 影响因子:
  • 作者:
    Nguyen Truong;Tanja Stamm
  • 通讯作者:
    Tanja Stamm
A solution of obstacle collision avoidance for robotic fish based on fuzzy systems
基于模糊系统的机器鱼避障解决方案
A Method for Controlling Wheelchair Using Hand Gesture Recognition
一种利用手势识别控制轮椅的方法

Nguyen Truong的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

CIF: Small: Numerically-Stable Large-Scale Coded Distributed Computing
CIF:小型:数值稳定的大规模编码分布式计算
  • 批准号:
    2008714
  • 财政年份:
    2020
  • 资助金额:
    $ 3万
  • 项目类别:
    Standard Grant
Characterizing straggler delays in large-scale distributed computing clouds
描述大规模分布式计算云中的落后延迟
  • 批准号:
    511426-2017
  • 财政年份:
    2017
  • 资助金额:
    $ 3万
  • 项目类别:
    University Undergraduate Student Research Awards
II-NEW: Distributed Computing Laboratory for Large Scale System Modeling and Analysis
II-新:大规模系统建模与分析的分布式计算实验室
  • 批准号:
    1205413
  • 财政年份:
    2012
  • 资助金额:
    $ 3万
  • 项目类别:
    Standard Grant
CAREER: Data-aware Distributed Computing for Enabling Large-scale Collaborative Science
职业:数据感知分布式计算支持大规模协作科学
  • 批准号:
    1131889
  • 财政年份:
    2011
  • 资助金额:
    $ 3万
  • 项目类别:
    Continuing Grant
CAREER: Data-aware Distributed Computing for Enabling Large-scale Collaborative Science
职业:数据感知分布式计算支持大规模协作科学
  • 批准号:
    0846052
  • 财政年份:
    2009
  • 资助金额:
    $ 3万
  • 项目类别:
    Continuing Grant
Study on large-scale scalable P2P grid infrastructure for large-capacity distributed computing
适用于大容量分布式计算的大规模可扩展P2P网格基础设施研究
  • 批准号:
    17200002
  • 财政年份:
    2005
  • 资助金额:
    $ 3万
  • 项目类别:
    Grant-in-Aid for Scientific Research (A)
Programming systems for large scale distributed computing
大规模分布式计算的编程系统
  • 批准号:
    121667-1999
  • 财政年份:
    2002
  • 资助金额:
    $ 3万
  • 项目类别:
    Discovery Grants Program - Individual
A large-scale ensemble-based ab-inito protein folding system using distributed computing
使用分布式计算的大规模基于集成的从头蛋白质折叠系统
  • 批准号:
    216936-1999
  • 财政年份:
    2001
  • 资助金额:
    $ 3万
  • 项目类别:
    Discovery Grants Program - Individual
Programming systems for large scale distributed computing
大规模分布式计算的编程系统
  • 批准号:
    121667-1999
  • 财政年份:
    2001
  • 资助金额:
    $ 3万
  • 项目类别:
    Discovery Grants Program - Individual
Programming systems for large scale distributed computing
大规模分布式计算的编程系统
  • 批准号:
    121667-1999
  • 财政年份:
    2000
  • 资助金额:
    $ 3万
  • 项目类别:
    Discovery Grants Program - Individual
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了