期刊文献+

基于收益模型的Spark SQL数据重用机制 被引量:2

A Benefit Model Based Data Reuse Mechanism for Spark SQL
下载PDF
导出
摘要 通过数据分析发现海量数据中的潜在价值,能够带来巨大的收益.Spark具有良好的系统扩展性与处理性能,因而被广泛运用于大数据分析.Spark SQL是Spark最常用的编程接口.在数据分析应用中存在着大量的重复计算,这些重复计算不仅浪费系统资源,而且导致查询运行效率低.但是Spark SQL无法感知查询语句之间的重复计算.为此,提出了基于收益模型的、细粒度的自动数据重用机制Criss以减少重复计算.针对混合介质,提出了感知异构I O性能的收益模型用于自动识别重用收益最大的算子计算结果,并采用Partition粒度的数据重用和缓存管理,以提高查询效率和缓存空间的利用率,充分发挥数据重用的优势.基于Spark SQL和TachyonFS,实现了Criss系统.实验结果表明:Criss的查询性能比原始Spark SQL提升了46%~68%. Analyzing massive data to discover the potential values in them can bring great benefits.Spark is a widely used data analytics engine for large-scale data processing due to its good scalability and high performance.Spark SQL is the most commonly used programming interface for Spark.There are a lot of redundant computations in data analytic applications.Such redundancies not only waste system resources but also prolong the execution time of queries.However,current implementation of Spark SQL is not aware of redundant computations among data analytic queries,and hence cannot remove them.To address this issue,we present a benefit model based,fine-grained,automatic data reuse mechanism called Criss in this paper.Criss automatically identifies redundant computations among queries.Then it uses an I O performance aware benefit model to automatically choose the operator results with the biggest benefit and cache these results using a hybrid storage consisting of both memory and HDD.Moreover,cache management and data reuse in Criss are partition-based instead of the whole result of an operator.Such fine-grained mechanism greatly improves query performance and storage utilization.We implement Criss in Spark SQL using modified TachyonFS for data caching.Our experiment results show that Criss outperforms Spark SQL by 40%to 68%.
作者 申毅杰 曾丹 熊劲 Shen Yijie;Zeng Dan;Xiong Jin(State Key Laboratory of Computer Architecture(Institute of Computing Technology,Chinese Academy of Sciences),Beijing 100190;University of Chinese Academy of Sciences,Beijing 100049)
出处 《计算机研究与发展》 EI CSCD 北大核心 2020年第2期318-332,共15页 Journal of Computer Research and Development
基金 国家重点研发计划项目(2016YFB1000202) 国家自然科学基金项目(61379042)~~
关键词 数据分析 大数据 Spark SQL 重复计算 数据重用 收益模型 data analytics big data Spark SQL redundant computation data reuse benefit model
  • 相关文献

同被引文献23

引证文献2

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部