摘要
针对流式数据处理系统Flink无法高效处理单点故障的问题,提出了一种基于增量状态和备份的故障容错系统Flink+。首先,提前建立备份算子和数据通路;然后,对数据流图中的输出数据进行缓存,必要时使用磁盘;其次,在系统快照时进行任务状态同步;最后,在系统故障时使用备份任务和缓存的数据恢复计算。在系统实验测试中,Flink+在无故障运行时没有显著增加额外容错开销;而在单机和分布式环境下处理单点故障时,与Flink系统相比,所提系统在单机8任务并行度下故障恢复时间减少了96.98%,在分布式16任务并行度下故障恢复时间减少了88.75%。实验结果表明,增量状态和备份方法一起使用可以有效减少流式系统单点故障的恢复时间,增强系统的鲁棒性。
Focusing on the issue that the single point of failure cannot be efficiently handled by streaming data processing system Flink,a new fault‑tolerant system based on incremental state and backup,Flink+,was proposed.Firstly,backup operators and data paths were established in advance.Secondly,the output data in the data flow diagram was cached,and disks were used if necessary.Thirdly,task state synchronization was performed during system snapshots.Finally,backup tasks and cached data were used to recover calculation in case of system failure.In the system experiment and test,Flink+dose not significantly increase the additional fault tolerance overhead during fault‑free operation;when dealing with the single point of failure in both single‑machine and distributed environments,compared with Flink system,the proposed system has the failure recovery time reduced by 96.98%in single‑machine 8‑task parallelism and by 88.75%in distributed 16‑task parallelism.Experimental results show that using incremental state and backup method together can effectively reduce the recovery time of the single point of failure of the stream system and enhance the robustness of the system.
作者
刘阳
张扬扬
周号益
LIU Yang;ZHANG Yangyang;ZHOU Haoyi(Beijing Advanced Innovation Center for Big Data and Brain Computing,Beihang University,Beijing 100191,China;School of Computer Science and Engineering,Beihang University,Beijing 100191,China;ShenYuan Honors College,Beihang University,Beijing 100191,China;College of Software,Beihang University,Beijing 100191,China)
出处
《计算机应用》
CSCD
北大核心
2022年第11期3337-3345,共9页
journal of Computer Applications
基金
国家自然科学基金资助项目(U20B2053,61872022)
软件开发环境国家重点实验室开放课题(SKLSDE‑2020ZX‑12)。