摘要
传统数据挖掘模式在处理海量、多维、复杂等特征的数据时,存在计算能力弱、效率低、可扩展性差等问题。论文提出基于Map/Reduce的决策树分类挖掘方法(C4.5BH算法),该算法采用K-means聚类方法对连续属性进行离散化,并利用Map/Reduce编程模型和属性表结构实现了决策树构造过程中属性的并行计算和节点的并行分裂。实验证明,与传统的C4.5算法相比,C4.5BH算法在处理大规模数据集时具有更高的执行效率和良好的加速比。
The traditional data mining model is weak in computing power, low efficiency and poor scalability when dealng with the data of massive, multi-dimensional and complex characteristics. This paper proposes a mining method (C4. 5BH lgorithm) based on Map/Reduce the decision tree classification, which uses the Kmeans clustering method to discretize the ontinuous attributes and the Map/Reduce programming model and attribute table structure to achieve the parallel computaion of the attributes and the parallel splitting of nodes in the process of constructing decision tree. Experiments show that 4. 5BH algorithm has a higher efficiency and a better speedup when dealing with large data sets, compared with the tradiional C4. 5 algorithm.
出处
《计算机与数字工程》
2016年第8期1504-1510,共7页
Computer & Digital Engineering
基金
国家科技支撑计划课题(编号:2015BAB07B01)
水利部公益性行业科研专项(编号:201501022)资助