期刊文献+

基于Spark的CVFDT分类算法并行化研究 被引量:3

Research on Parallelization of Concept-adapting Very Fast Decision Tree Classification Algorithm Based on Spark
下载PDF
导出
摘要 以提升流数据的分类挖掘效率为目标,研究将概念适应快速决策树算法(CVFDT)部署到流数据计算平台Spark上进行并行化的方案。设计了CVFDT基于Spark的并行化实现方案,首先对CVFDT算法进行属性间并行化改造,即分割点计算过程中的并行化;然后基于Spark在CVFDT的建树过程中将节点的所有属性列表转化为Spark特有的弹性分布式数据集RDD,通过计算由每个RDD生成的并行化任务,汇总并且比较每个最佳分割点,再计算Hoeffding边界作为节点分裂条件找到最佳分割点,从而递归创建决策树。实验结果表明,在Spark集群环境下,CVFDT算法的分类效率相对于单机环境有显著提高,改进后的并行化CVFDT算法对大规模流数据处理有良好的适应能力,而且合理设定RDD过滤可使分类效率进一步提高。 Aiming at increase of classification and mining efficiency for stream data,we study a parallelization scheme of deploying the CVFDT( concept-adapting fast decision tree) to the stream data computing platform Spark and design a implementation scheme of CVFDT based on Spark.Firstly,the CVFDT should be parallelized among attributes,that is the parallelization of the splitting point calculation.Then in the process of building decision trees of CVFDT based on Spark,all the attribute lists of the node are transformed into Spark's unique resilient distributed datasets( RDD),and through calculation of parallel task from each RDD,each optimal splitting point is summarized and compared.The Hoeffding boundary is calculated as the node splitting condition to find the optimal splitting point,and the decision tree is recursively created.The experiment shows that the classification efficiency of CVFDT in the Spark cluster environment relative to the stand-alone environment has improved significantly. The improved parallel CVFDT has better adaptability to large-scale stream data processing and the reasonable setting of RDD filtering can further improve the classification efficiency.
作者 庄荣 李玲娟 ZHUANG Rong;LI Ling-juan(School of Computer,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处 《计算机技术与发展》 2018年第6期35-38,共4页 Computer Technology and Development
基金 国家自然科学基金(61302158 61571238)
关键词 数据流 CVFDT 并行化 SPARK 弹性分布式数据集 data streams CVFDT parallelization Spark resilient distributed datasets
  • 相关文献

参考文献6

二级参考文献82

  • 1蒋良孝,蔡之华,刘钊.一种基于信息增益的分类规则挖掘算法[J].中南大学学报(自然科学版),2003,34(z1):69-71. 被引量:8
  • 2刘华元,袁琴琴,王保保.并行数据挖掘算法综述[J].电子科技,2006,19(1):65-68. 被引量:15
  • 3王涛,李舟军,胡小华,颜跃进,陈火旺.一种高效的数据流挖掘增量模糊决策树分类算法[J].计算机学报,2007,30(8):1244-1250. 被引量:18
  • 4Ordonez C.Integrating K-means clustering with a relational DBMS using SQL[J].IEEE Transactions on Knowledge and Data Mining Engineering,2006,18(2):188-201.
  • 5Ordonez C,Pitchaimalai S K.Bayesian classifiers programmed in SQL[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(1):139-144.
  • 6Lu Hongan,Liu Hongyan.Decision tables:Scalable classification exploring RDBMS capabilities[C] //Proceedings of the 15th International Conference on Data Engineering,1999.
  • 7Scarier K,Dunemann O.SQL database primitives for decision tree classifiers[C] //Proceedings of ACM Conference on Information and Knowledge Management,2003:1113-1116.
  • 8Milenova B,Yarmus J S,Campos M M.SVM in oracle database 10g:Removing the barriers to widespread adoption of support vector machines[C] //Proceedings of the 31st International Conference on Very large Data Bases,2005:1152-1163.
  • 9Lu Hongjun,Liu Hongyan.Decision tables:Scalable classification exploring RDBMS capabilities[C] //Proceedings of the 26th International Conference on Very Large Data Bases,2000.
  • 10Hulten G,Domingos P.Mining high-speed data streams[C] //Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2000:71-80.

共引文献52

同被引文献22

引证文献3

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部