摘要
文本分类是文本数据挖掘中一个非常重要的技术,已经被广泛地应用于信息管理、搜索引擎、推荐系统等多个领域。现有的文本分类方法,大多是基于向量空间模型的算法。这些算法很难适用于大规模的文本数据集。为此,我们提出了一种基于遗传算法和信息熵的文本分类规则抽取方法。在该方法中,信息熵技术用来辅助遗传算法初始种群的生成。遗传算法和信息熵的有效集成,极大地提高了该混合方法的分类效率。实验结果表明,本文方法适用于大规模文本数据集;该方法提取规则的分类正确率较高,分类速度较快。
Text classification is a very important technique in the field of text mining, and it has been widely applied to the information management, search engine, recommendation systems, and some other fields. Most classification methods are based on vector models, these approaches are highly complicated on computation, and cannot be used on the occasion of classifying a large number of samples. For this reason, a hybrid approach combining genetic algorithm with information entropy is presented for text classification rule extraction. In this hybrid approach, the information entropy technique is applied to assist the generation of initial populations for genetic algorithm. The classification performance of the proposed approach has been improved largely by integrating genetic algorithm with information entropy effectively. The proposed approach can be applied to classify a large number of samples. Experimental results show that both the accuracy and the speed of categorization are high.
出处
《微计算机信息》
北大核心
2008年第27期268-270,共3页
Control & Automation
关键词
文本分类
遗传算法
信息熵
文本挖掘
Text classification
genetic algorithm
information entropy
text mining