期刊文献+

基于Hadoop的海量嘈杂数据决策树算法的实现 被引量:11

Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop
下载PDF
导出
摘要 针对当前决策树算法较少考虑训练集的嘈杂程度对模型的影响,以及传统驻留内存算法处理海量数据困难的问题,提出一种基于Hadoop平台的不确定概率C4.5算法——IP-C4.5算法。在训练模型时,IP-C4.5算法认为用于建树的训练集是不可靠的,通过用基于不确定概率的信息增益率作为分裂属性选择标准,减小了训练集的嘈杂性对模型的影响。在Hadoop平台下,通过将IP-C4.5算法以文件分裂的方式进行MapReduce化程序设计,增强了处理海量数据的能力。与C4.5和完全信条树(CCDT)算法的对比实验结果表明,在训练集数据是嘈杂的情况下,IPC4.5算法的准确率相对更高,尤其当数据嘈杂度大于10%时,表现更加优秀;并且基于Hadoop的并行化的IP-C4.5算法具有处理海量数据的能力。 Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4. 5 algorithm named IP-C4. 5 was proposed based on Hadoop. When training model, IP-C4. 5algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4. 5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4. 5 algorithm is higher than that of C4. 5and Complete CDT( CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4. 5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.
出处 《计算机应用》 CSCD 北大核心 2015年第4期1143-1147,共5页 journal of Computer Applications
基金 国家自然科学基金资助项目(31370565) 哈尔滨市科技创新人才研究专项资金资助项目(2013RFXXJ089)
关键词 HADOOP C4.5 不确定概率 嘈杂数据 并行化 Hadoop C4.5 imprecise probability noisy data parallelization
  • 相关文献

参考文献18

  • 1GANTZ J, REINSEL D. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east -- United States [ EB/OL]. [ 2010-10-10]. http://www, emc. com/collater- al/analyst-reports/idc-the-digital-universe-in-2020, pdf.
  • 2QUINLAN J R. C4.5: programs for machine learning[ M]. Burling- ton: Morgan Kaufmann Publishers, 1993:17-42.
  • 3QUINLAN J R. Induction of decision trees[ J]. Machine Learning, 1986, 1(1): 81-106.
  • 4WALLEY P. Inferences from multinomial data: learning about a bag of marbles [ J]. Journal of the Royal Statistical Society, Series B: Methodological, 1996, 58(1) : 3 - 57.
  • 5ABELLAN J, MORAL S. Building classification trees using the total uncertainty criterion [ J]. International Journal of Intelligent Sys- tems, 2003, 18(12) : 1215 - 1225.
  • 6ABELLAN J, MASEGOSA A R. An experimental study about sim- ple decision trees for bagging ensemble on datasets with classification noise [ C] // Proceedings of the 10th European Conference on Sym- bolic and Quantitative Approaches to Reasoning with Uncertainty, LNCS 5590. Berlin: Springer-Verlag, 2009:446-456.
  • 7ABELLAN J, MASEGOSA A R. Bagging schemes on the presence of class noise in classification [ J]. Expert Systems with Applica- tions, 2012, 39(8): 6827-6837.
  • 8MANTAS C J , ABELLAN J . Analysis and extension of decision trees based on imprecise probabilities: application on noisy data [J]. Expert Systems with Applications, 2014, 41 (5): 2514 - 2525.
  • 9LI Y, JIANG D, LI F. The application of generating fuzzy ID3 algo- rithm in performance evaluation [ J]. Procedia Engineering, 2012, 29:229-234.
  • 10JIN C, LI F, LI Y. A generalized fuzzy ID3 algorithm using gener- alized information entropy [ J]. Knowledge-Based Systems, 2014, 64:13-21.

同被引文献85

引证文献11

二级引证文献63

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部