摘要
针对传统C4.5算法存在容易产生冗余规则、决策树规模过大、分类速度过慢等问题,提出一种基于余弦相似度的改进C4.5决策树算法。计算每个属性的信息熵和增益率,如果任意属性的任意两个属性值的信息熵之差在一个很小范围内时,计算两个属性值的余弦相似度;合并相似度在阈值范围内的属性值,重新计算合并后属性的信息增益率,依据传统的C4.5算法进行计算。抽取某医院普检数据进行仿真,仿真结果表明,所提算法能够有效降低分裂属性维度,缩减了决策树规模,减少了冗余规则,提高了分类速度。
There are some defects of traditional C4. 5 algorithm including redundant rules, large decision size and slow speed. To solve these problems, an improved C4. 5 decision tree algorithm was proposed based on cosine similarity. Information entropy of each attribute and gain rate were calculated, if any attribute of the information entropy difference of any two attribute value was in a small range, the cosine similarity of two attribute values was calculated. Attribute values with the similarity within the scope of the threshold value were merged and the information gain rate of combined attribute was recalculated based on the tradi- tionll C4. 5 algorithm. The hospitll data generated in geneml inspection were picked up for simulation. Results show that the proposed algorithm can effectively reduce split attribute dimension, the size of the decision tree and redundant rules, while im-prove the classification speed.
出处
《计算机工程与设计》
北大核心
2018年第1期120-125,共6页
Computer Engineering and Design
基金
山东省自然科学基金项目(ZR2014FL019)
山东省高等学校科技计划基金项目(J14LN31)
关键词
数据挖掘
C4.5算法
余弦相似度
决策树
降维
data mining
C4.5 algorithm
cosine-similarity
decision-tree
dimensionality reduction