摘要
当今人类已经进入大数据时代,海量数据处理已成为大数据技术领域的研究热点。Spark是一种典型的基于内存的分布式大数据处理框架,但在Spark的实际应用中出现的数据倾斜问题会对计算效率产生重要影响。本文针对于Spark在各类应用中出现的数据倾斜问题,梳理国内外相关研究进展,对在出现数据倾斜问题后常用的优化方法进行了分析对比,最后对未来的研究方向进行了展望。
Nowadays,mankind has entered the era of big data,and mass data processing has become a research hotspot in the field of big data technology.Spark is a typical memory-based distributed big data processing framework,but the data skew problem in the actual application of Spark will have an important impact on the computing efficiency.Aiming at the data skew problem in various applications of Spark,this paper sorted out relevant research progress at home and abroad,analyzed and compared the commonly used optimization methods after the occurrence of data skew problem,and finally looked into the future research direction.
作者
张占峰
王文礼
耿珊珊
贾芝婷
ZHANG Zhan-feng;WANG Wen-li;GENG Shan-shan;JIA Zhi-ting(College of Information Technology,Hebei University of Economics and Business,Shijiazhuang Hebei 050061,China)
出处
《河北省科学院学报》
CAS
2020年第1期1-7,共7页
Journal of The Hebei Academy of Sciences
基金
2019年度河北省研究生创新资助项目(CXZZSS2019106)。