期刊文献+

轻量级大数据运算系统Helius 被引量:1

Helius:a lightweight big data processing system
下载PDF
导出
摘要 针对Spark数据集不可变,以及Java虚拟机(JVM)依赖环境引起的代码执行、内存管理、数据序列化/反序列化等开销过多的不足,采用C/C++语言,设计并实现了一种轻量级的大数据运算系统——Helius。Helius支持Spark的基本操作,同时允许数据集整体修改;同时,Helius利用C/C++优化内存管理和网络传输,并采用stateless worker机制简化分布式计算平台的容错恢复过程。实验结果显示:5次迭代中,Helius运行PageRank算法的时间仅为Spark的25.12%~53.14%,运行TPCH Q6的时间仅为Spark的57.37%;在PageRank迭代1次的基础上,运行在Helius系统下时,master节点IP接收和发送数据量约为运行于Spark系统的40%和15%,而且200 s的运行过程中,Helius占用的总内存约为Spark的25%。实验结果与分析表明,与Spark相比,Helius具有节约内存、不需要序列化和反序列化、减少网络交互以及容错简单等优点。 Concerning the limitations of Spark, including immutable datasets and significant costs of code execution, memory management and data serialization/deserialization caused by running environment of Java Virtual Machine (JVM), a light-weight big data processing system, named Helius, was implemented in C/C ++. Helius supports the basic operations of Spark, while allowing the data set to be modified as a whole. In Helius, the C/C + + is utilized to optimize the memory management and network communication, and a stateless worker mechanism is utilized to simplify the fault tolerance and recovery process of the distributed computing platform. The experimental results showed that in 5 iterations, the running time in Helius was only 25.12% to 53.14% of that in Spark when running PageRank iterative jobs, and the running time in Helius was only 57.37% of that in Spark when processing TPCH Q6. On the basis of one iteration of PageRank, the IP incoming and outcoming data sizes of master node in Helius were about 40% and 15% of those in Sparks, and the total memory consumed in the worker node in Helius was only 25% of that in Spark. Compared with Spark, Helius has the advantages of saving memory, eliminating the need for serialization and deserialization, reducing network interaction and simplifying fault tolerance.
作者 丁梦苏 陈世敏 DING Mengsu CHEN Shimin(Key Laboratory of Computer System and Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190, China)
出处 《计算机应用》 CSCD 北大核心 2017年第2期305-310,共6页 journal of Computer Applications
基金 中国科学院"百人计划"项目 国家自然科学基金面上项目(61572468) 国家自然科学基金创新群体项目(61521092)~~
关键词 内存计算 大数据运算 分布式计算 有向无环图调度 容错恢复 in-memory computation big data processing distributed computation Directed Acyclic Graph (DAG) scheduling fault tolerance and recovery
  • 相关文献

同被引文献5

引证文献1

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部