摘要
As the complexity of deep learning(DL)networks and training data grows enormously,methods that scale with computation are becoming the future of artificial intelligence(AI)development.In this regard,the interplay between machine learning(ML)and high-performance computing(HPC)is an innovative paradigm to speed up the efficiency of AI research and development.However,building and operating an HPC/AI converged system require broad knowledge to leverage the latest computing,networking,and storage technologies.Moreover,an HPC-based AI computing environment needs an appropriate resource allocation and monitoring strategy to efficiently utilize the system resources.In this regard,we introduce a technique for building and operating a high-performance AI-computing environment with the latest technologies.Specifically,an HPC/AI converged system is configured inside Gwangju Institute of Science and Technology(GIST),called GIST AI-X computing cluster,which is built by leveraging the latest Nvidia DGX servers,high-performance storage and networking devices,and various open source tools.Therefore,it can be a good reference for building a small or middlesized HPC/AI converged system for research and educational institutes.In addition,we propose a resource allocation method for DL jobs to efficiently utilize the computing resources with multi-agent deep reinforcement learning(mDRL).Through extensive simulations and experiments,we validate that the proposed mDRL algorithm can help the HPC/AI converged cluster to achieve both system utilization and power consumption improvement.By deploying the proposed resource allocation method to the system,total job completion time is reduced by around 20%and inefficient power consumption is reduced by around 40%.