摘要
以提升流数据的分类挖掘效率为目标,研究将概念适应快速决策树算法(CVFDT)部署到流数据计算平台Spark上进行并行化的方案。设计了CVFDT基于Spark的并行化实现方案,首先对CVFDT算法进行属性间并行化改造,即分割点计算过程中的并行化;然后基于Spark在CVFDT的建树过程中将节点的所有属性列表转化为Spark特有的弹性分布式数据集RDD,通过计算由每个RDD生成的并行化任务,汇总并且比较每个最佳分割点,再计算Hoeffding边界作为节点分裂条件找到最佳分割点,从而递归创建决策树。实验结果表明,在Spark集群环境下,CVFDT算法的分类效率相对于单机环境有显著提高,改进后的并行化CVFDT算法对大规模流数据处理有良好的适应能力,而且合理设定RDD过滤可使分类效率进一步提高。
Aiming at increase of classification and mining efficiency for stream data,we study a parallelization scheme of deploying the CVFDT( concept-adapting fast decision tree) to the stream data computing platform Spark and design a implementation scheme of CVFDT based on Spark.Firstly,the CVFDT should be parallelized among attributes,that is the parallelization of the splitting point calculation.Then in the process of building decision trees of CVFDT based on Spark,all the attribute lists of the node are transformed into Spark's unique resilient distributed datasets( RDD),and through calculation of parallel task from each RDD,each optimal splitting point is summarized and compared.The Hoeffding boundary is calculated as the node splitting condition to find the optimal splitting point,and the decision tree is recursively created.The experiment shows that the classification efficiency of CVFDT in the Spark cluster environment relative to the stand-alone environment has improved significantly. The improved parallel CVFDT has better adaptability to large-scale stream data processing and the reasonable setting of RDD filtering can further improve the classification efficiency.
作者
庄荣
李玲娟
ZHUANG Rong;LI Ling-juan(School of Computer,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《计算机技术与发展》
2018年第6期35-38,共4页
Computer Technology and Development
基金
国家自然科学基金(61302158
61571238)