Machine learning driven system level heterogeneous memory management for high-performance computing

用于高性能计算的机器学习驱动的系统级异构内存管理

基本信息

批准号：
19K11993
负责人：
GEROFI BALAZS
金额：
$ 2.75万
依托单位：
Institute of Physical and Chemical Research
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
2019
资助国家：
日本
起止时间：
2019-04-01 至 2023-03-31
项目状态：
已结题

项目摘要

Results have been achieved in two parallel efforts of the project.We found that system-software-level heterogeneous memory management solutions utilizing machine learning, in particular nonsupervised learning- based methods such as reinforcement learning, require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies, which renders architecture-level simulators impractical for this purpose. We proposed a differential tracing-based approach using memory access traces obtained by high-frequency sampling-based methods (e.g., Intel's PEBS) on real hardware using of different memory devices. We developed a runtime estimator based on such traces that provides an execution time estimate orders of magnitude faster than full-system simulators. On a number of HPC mini applications we showed that the estimator predicts runtime with an average error of 4.4% compared to measurements on real hardware.For the deep learning data shuffling subtopic, we investigated the viability of partitioning the dataset among DL workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2048 GPUs of ABCI and 4096 compute nodes of Fugaku, we demonstrated that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provided an implementation in PyTorch that enables users to control the proposed data exchange scheme.

在该项目的两项平行努力中已经取得了结果。我们发现，使用机器学习的系统软件级异质管理解决方案，特别是基于不受欢迎的学习方法，例如增强学习，需要快速估算执行时间，以探索不同数据放置策略架构架构架构架构的数据，以探索不同的数据位置的数据布局，以探索不同的数据位置。我们使用使用不同内存设备的真实硬件上的基于高频采样的方法（例如Intel's Pebs）获得了一种基于差异跟踪的方法（例如，基于高频采样的方法（例如，英特尔的PEB）。我们基于此类迹线开发了一个运行时估计器，该轨迹提供的执行时间估计订单比全系统模拟器快。在许多HPC MINI应用程序中，我们表明，与实际硬件的测量相比，估算器的平均误差为4.4％。对于深度学习数据进行调整小主题，我们研究了DL工人之间数据集的生存能力，并仅在每个培训epoch中进行样品的部分分布交换。通过对2048年GPU的ABCI和4096 Compute节点的广泛实验，我们证明在仔细调整部分分布式交换时，可以保持全球改组的验证精度。我们在Pytorch提供了一个实现，使用户能够控制提出的数据交换方案。

项目成果

期刊论文数量（8）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

DOI：
10.1109/ipdps53621.2022.00109
发表时间：
2022-05
期刊：
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
影响因子：
0
作者：
Thao Nguyen;François Trahay;Jens Domke;Aleksandr Drozd;Emil;Vatai;Jianwei Liao;M. Wahib;
通讯作者：
Thao Nguyen;François Trahay;Jens Domke;Aleksandr Drozd;Emil;Vatai;Jianwei Liao;M. Wahib;

Directions for Operating Systems Research

操作系统研究方向

DOI：
发表时间：
2021
期刊：
影响因子：
0
作者：
Fajardo-Diaz Juan L.;Morelos-Gomez Aaron;Cruz-Silva Rodolfo;Matsumoto Akito;Ueno Yutaka;Takeuchi Norihiro;Kitamura Kotaro;Miyakawa Hiroki;Tejima Syogo;Takeuchi Kenji;Tsuzuki Koichi;Endo Morinobu;田中紘生，木原尚，安倍賢一;Balazs Gerofi
通讯作者：
Balazs Gerofi

2020 SIAM Conference on Parallel Processing for Scientific Computing

2020 SIAM 科学计算并行处理会议

DOI：
发表时间：
2020
期刊：
影响因子：
0
作者：
宮地英生;川原慎太郎;Balazs Gerofi
通讯作者：
Balazs Gerofi

Argonne National Laboratory(米国)

阿贡国家实验室（美国）

DOI：
发表时间：
期刊：
影响因子：
0
作者：
通讯作者：

Towards Intelligent Management of Heterogeneous Memory: A Reinforcement Learning Approach

走向异构内存的智能管理：强化学习方法

DOI：
发表时间：
2019
期刊：
影响因子：
0
作者：
宮地英生;川原慎太郎;廣渡祥太，木原尚，安倍賢一;Balazs Gerofi
通讯作者：
Balazs Gerofi

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

GEROFI BALAZS其他文献

GEROFI BALAZS的其他文献

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

相似国自然基金

面向NP难的进化算法理论—近似性能与随机运行时间分析

批准号：
61906062
批准年份：
2019
资助金额：
24.0 万元
项目类别：
青年科学基金项目