摘要
针对当前决策树算法较少考虑训练集的嘈杂程度对模型的影响,以及传统驻留内存算法处理海量数据困难的问题,提出一种基于Hadoop平台的不确定概率C4.5算法——IP-C4.5算法。在训练模型时,IP-C4.5算法认为用于建树的训练集是不可靠的,通过用基于不确定概率的信息增益率作为分裂属性选择标准,减小了训练集的嘈杂性对模型的影响。在Hadoop平台下,通过将IP-C4.5算法以文件分裂的方式进行MapReduce化程序设计,增强了处理海量数据的能力。与C4.5和完全信条树(CCDT)算法的对比实验结果表明,在训练集数据是嘈杂的情况下,IPC4.5算法的准确率相对更高,尤其当数据嘈杂度大于10%时,表现更加优秀;并且基于Hadoop的并行化的IP-C4.5算法具有处理海量数据的能力。
Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4. 5 algorithm named IP-C4. 5 was proposed based on Hadoop. When training model, IP-C4. 5algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4. 5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4. 5 algorithm is higher than that of C4. 5and Complete CDT( CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4. 5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.
出处
《计算机应用》
CSCD
北大核心
2015年第4期1143-1147,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(31370565)
哈尔滨市科技创新人才研究专项资金资助项目(2013RFXXJ089)