摘要
随着分布式系统规模扩大及计算复杂度增加,分布式计算的平均故障修复时间和容错计算所产生的通信开销呈现日益上升趋势。结合分布式编码计算和副本冗余技术,提出一种新的容错算法。map节点应用分布式编码计算的思想,将数据冗余分配至多个计算节点创建编码中间结果,降低计算节点在shuffle阶段的数据传输量。reduce节点通过对接收到的编码中间结果进行解码,从而验证中间结果的正确性并得到最终计算结果。实验结果表明,在基于MapReduce的分布式计算框架下,与三模冗余和两阶段三模冗余容错算法相比,该算法在完成容错计算的同时能降低计算过程中的通信开销和平均故障修复时间,并提高分布式系统的可用性和可靠性。
The growing size and computational complexity of distributed systems lead to an increase in the Mean Time to Repair(MTTR)of distributed computing systems and the communication load caused by fault-tolerant computing.To solve the problems,this paper integrates distributed coding computing with replica redundancy to propose a novel faulttolerant algorithm.The map node uses the idea of distributed coding computing to allocate data replica to multiple computing nodes to create intermediate coding results and reduce the amount of data transmitted by the computing nodes in the shuffle phase.The reduce node decodes the received intermediate coding result to verify its correctness and obtain the final computing result.Experimental results show that in the MapReduce framework,the proposed algorithm can reduce the communication overhead and MTTR compared with the Triple Modular Redundancy(TMR)and two-stage TMR fault-tolerant algorithms.It also improves the availability and reliability of distributed systems.
作者
张基
谢在鹏
毛莺池
徐媛媛
朱晓瑞
李博文
ZHANG Ji;XIE Zaipeng;MAO Yingchi;XU Yuanyuan;ZHU Xiaorui;LI Bowen(School of Computer and Information,Hohai University,Nanjing 211100,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2021年第4期173-179,共7页
Computer Engineering
基金
国家自然科学基金重点项目(61832005)
国家重点研发计划(2016YFC0402710)。
关键词
分布式系统
分布式计算
容错算法
分布式编码计算
三模冗余
distributed system
distributed computing
fault-tolerant algorithm
distributed coding computing
Triple Modular Redundancy(TMR)