摘要
检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。
Checkpointing is a typical technique for fault tolerance,whereas its scalability is limited by the overhead of file access.According to the multi-level file system architecture,the cache-style parallel checkpointing was introduced,which translates global coordinated checkpointing into local file operation by out-of-order pipelining of checkpoint flushing opportunity.The overhead of write-back is hidden effectively to increase the performance and the scalability of parallel checkpointing.
出处
《计算机科学》
CSCD
北大核心
2011年第5期287-289,F0003,共4页
Computer Science
基金
高效能服务器和存储技术国家重点实验室开放基金项目(2009HSSA04)资助
关键词
Cache式检查点
并行计算
多级文件系统
多处理器
乱序流水线
Cache-style checkpointing
Parallel computing
Multi-level file system
Multi-processor
Out-of-order pipeline