Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster

下载PDF

导出

摘要 As the complexity of deep learning(DL)networks and training data grows enormously,methods that scale with computation are becoming the future of artificial intelligence(AI)development.In this regard,the interplay between machine learning(ML)and high-performance computing(HPC)is an innovative paradigm to speed up the efficiency of AI research and development.However,building and operating an HPC/AI converged system require broad knowledge to leverage the latest computing,networking,and storage technologies.Moreover,an HPC-based AI computing environment needs an appropriate resource allocation and monitoring strategy to efficiently utilize the system resources.In this regard,we introduce a technique for building and operating a high-performance AI-computing environment with the latest technologies.Specifically,an HPC/AI converged system is configured inside Gwangju Institute of Science and Technology(GIST),called GIST AI-X computing cluster,which is built by leveraging the latest Nvidia DGX servers,high-performance storage and networking devices,and various open source tools.Therefore,it can be a good reference for building a small or middlesized HPC/AI converged system for research and educational institutes.In addition,we propose a resource allocation method for DL jobs to efficiently utilize the computing resources with multi-agent deep reinforcement learning(mDRL).Through extensive simulations and experiments,we validate that the proposed mDRL algorithm can help the HPC/AI converged cluster to achieve both system utilization and power consumption improvement.By deploying the proposed resource allocation method to the system,total job completion time is reduced by around 20%and inefficient power consumption is reduced by around 40%.

作者 Jargalsaikhan Narantuya Jun-Sik Shin Sun Park JongWon Kim

机构地区 Department of Cloud AI Graduate School

出处《Computers, Materials & Continua》 SCIE EI 2022年第9期4375-4395,共21页 计算机、材料和连续体（英文）

关键词 Deep learning HPC/AI converged cluster reinforcement learning

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1日本研究人员通过优化的金刚石探针实现纳米级传感和成像[J].金属功能材料,2022,29(1):122-122.
2中国抗癌协会整合肿瘤心脏病学分会,中华医学会心血管病学分会肿瘤心脏病学学组,中国医师协会心血管内科医师分会肿瘤心脏病学专业委员会,中国临床肿瘤学会肿瘤心脏病学专家委员会,张志仁,李悦,刘斌,刘基巍,夏云龙.免疫检查点抑制剂相关心肌炎监测与管理中国专家共识(2020版)[J].中国肿瘤临床,2020,47(20):1027-1038. 被引量：83
3Jihun Park,Sanghyun Heo,Kibog Park,Myoung Hoon Song,Ju-Young Kim,Gyouhyung Kyung,Rodney Scott Ruoff,Jang-Ung Park,Franklin Bien.Research on flexible display at Ulsan National Institute of Science and Technology[J].npj Flexible Electronics,2017,1(1):47-59. 被引量：3
4Hwa Seung Han,Song Yi Koo,Ki Young Choi.Emerging nanoformulation strategies for phytocompounds and applications from drug delivery to phototherapy to imaging[J].Bioactive Materials,2022,7(8):182-205.
5Guiju Zhang,Caiyuan Xiao.Dynamic Simulation Analysis of the Working Device of a ZL50 Loader[J].Fluid Dynamics & Materials Processing,2020,16(4):699-707.
6Donglei Zheng,Le Zhou,Zhihuan Song.Kernel Generalization of Multi-Rate Probabilistic Principal Component Analysis for Fault Detection in Nonlinear Process[J].IEEE/CAA Journal of Automatica Sinica,2021,8(8):1465-1476. 被引量：2
7Zihan LI,Ping WANG,Chaojie ZHU,Yunfeng HU,Hong CHEN.MPC-based strategy for longitudinal and lateral stabilization of a vehicle under extreme conditions[J].Science China(Information Sciences),2022,65(7):259-260. 被引量：1
8刘溢,阳加远,张驰.一种基于RTX的移动机器人实时控制平台[J].电子技术与软件工程,2022(8):169-172. 被引量：1
9鞠立鑫,邵琦,陆临川,陆鸿飞.基于嘌呤席夫碱荧光探针检测Al^(3+)及细胞实验应用[J].有机化学,2022,42(6):1706-1712. 被引量：1
10Heba A.Elzeheiry,Sherief Barakat,Amira Rezk.An Efficient Ensemble Model for Various Scale Medical Data[J].Computers, Materials & Continua,2022(10):1283-1305.

Computers, Materials & Continua

2022年第9期

浏览历史

内容加载中请稍等...

Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster

相关作者

相关机构

相关主题

浏览历史