摘要
为解决大样本数据条件下C4.5决策树算法需要训练集常驻内存、分类精度达不到需求以及如何选取最优分类规则等问题,提出了一种基于分类规则选取的C4.5决策树改进算法。通过数次有放回的随机抽取训练集形成多个分类规则,在多次分类规则内寻找特征的最优取值以建立最优分类规则,以划分相似度为标准进行C4.5决策树最优特征选取,在此基础上利用选定的最优分类规则和最优特征对C4.5决策树算法进行改进。实验结果表明,改进后的算法可有效解决C4.5决策树与初始训练集相关性较大的问题,对大样本数据集的分类识别在识别率上有显著提高,训练时间明显减少。
Under the condition of large sample data set of memory-resident, classification accuracyneed to meet the demand, and how to select the optimal classification rules, the improved CA. 5 decision tree algorithm based on classification rules selecting is put forward. The algorithm forms a plurality of classification rules through several times back in the random training set. By several classification rules, the optimal value is found in order to establish the optimal classification rules, and use partition similarity as standard to select C4.5 decision tree optimal feature. Based on the use of optimal classification rules and selected optimal feature, CA. 5 decision tree algorithm is improved. The experiments show that the improved algorithm can effectively solve the problem that C4.5 decision tree is large correlated with initial training set, classification rate of large sample data sets is significantly increased. The training time is significantly reduced.
出处
《计算机工程与设计》
CSCD
北大核心
2013年第12期4321-4325,4330,共6页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2011AA010603
2011AA010605)
关键词
C4
5决策树
分类规则
属性度量
划分相似度
特征选取
C4.5 decision tree
classification rules
attribute measures
partition similarity
feature selection