摘要
针对数据挖掘中的文本分类问题,提出了一种基于遗传算法和信息熵的文本分类规则抽取算法Genet-ic-Miner(简称GM),该算法的目标是在数据集中发现分类规则。首先利用信息熵生成初始种群,然后利用优化的遗传算法抽取相应规则。采用六个标准的公共领域的数据集比较了GM与其它两个非常著名的同类算法Ant-Miner和CN2,实验结果表明,无论是预测准确性和规则的简单性,GM都明显优于Ant-Miner和CN2,并且该算法能大大提高对知识的理解力。
Aimed at the text classification problems in data mining, a text classification rule extraction method is proposed based on genetic algorithm and entropy for rule discovery called Genetic-Miner (GM). The goal of GM is to discover classification rules in data sets. It produces population with the entropy and then extract classification rule with genetic algorithm. Compared the performance of GM with other tWO well-known algorithms Ant-miner and CN2 in six public domain data sets, the results showed that GM has a better performance in both predictive accuracy and rule list simplicity criteria than Ant-Miner and CN2. It can also mostly improve the comprehensibility of the discovered knowledge.
出处
《中山大学学报(自然科学版)》
CAS
CSCD
北大核心
2007年第5期18-21,24,共5页
Acta Scientiarum Naturalium Universitatis Sunyatseni
基金
国家自然科学基金资助项目(60573127)
关键词
文本分类规则
知识发现
信息熵
遗传算法
数据挖掘
text classification rule
data mining
discover knowledge
information entropy
genetic algorithm