摘要
容错一直是高性能计算领域的热点和难点问题。检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复。容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更的情况下由于环境与资源变化而出现异常。因此,容器和检查点相结合能够更好地支撑任务迁移的研究与实现。文中围绕基于CRIU(Checkpoint/Restore In Userspace)的Singularity容器检查点方案的设计和优化展开,根据检查点技术在高性能计算容器应用中的特点,在CRIU安全使用、迁移性能优化、保持网络状态方面给出了有效的解决方案,基于这些方案拓展了Singularity容器检查点功能,并且实现了原型工具Migrator来验证容器迁移性能。期望本工作能为后续实现高性能计算任务迁移提供有效的支撑。
Fault tolerance has always been a hot and difficult problem in the field of high performance computing.Checkpointing is a common technical means to solve the fault tolerance problem,which can dump the state of running processes into files and recover them.Containers have strong resource isolation capability,which can provide a more ideal running environment and carrier for checkpointing technology and avoid the abnormality caused by the change of environment and resources in the case of node change after migration.Therefore,the combination of container and checkpointing can better support the research and implementation of task migration.This paper focuses on the design and optimization of Singularity checkpointing scheme based on CRIU(Checkpoint/Restore In Userspace).Based on the characteristics of checkpointing technology in HPC container applications,effective solutions are given in terms of safe use of CRIU,migration performance optimization,and maintaining network status.The paper extends the checkpointing function to Singularity and implements the prototype tool Migrator to verify the container migration performance.It can provide support for the subsequent implementation of HPC task migration.
作者
陈轶阳
王小宁
闫晓婷
李冠龙
赵一宁
卢莎莎
肖海力
CHEN Yiyang;WANG Xiaoning;YAN Xiaoting;LI Guanlong;ZHAO Yining;LU Shasha;XIAO Haili(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《计算机科学》
CSCD
北大核心
2024年第9期40-50,共11页
Computer Science
基金
国家重点研发计划青年项目(2021YFB0300800)。
关键词
容器
检查点
高性能计算
热迁移
容错
Container
Checkpoint
High performance computing
Live migration
Fault tolerance