摘要
针对传统的文本分类算法存在着各特征词对分类的结果影响相同、分类准确率较低、造成算法时间复杂度增加的问题,提出了一种改进的最大熵C-均值聚类文本分类方法。该方法充分结合了C-均值聚类和最大熵值算法的优点,以香农熵作为最大熵模型中的目标函数,简化分类器的表达形式,然后采用C-均值聚类算法对最优特征进行分类。仿真实验结果表明,与传统的文本分类方法相比,提出的方法能够快速得到最优分类特征子集,大大提高了文本分类准确率。
In view of the traditional text classification algorithm has the problems of the characteristics having same influence on classification results,the low rate of classification accuracy,and the increasing of the algorithm time complexity,this paper presented an improved maximum entropy C-means clustering text classification methods.This method combined the C-means clustering algorithm and the maximum entropy algorithm,set Shannon entropy as a maximum entropy model in the target function,simplified classifier forms of expression,and then used the C-means clustering algorithm to the optimal features for classification.The simulation results show that,compared with traditional text classification methods,the proposed method can fast obtain the optimal classification feature subset,greatly improve the accuracy of text classification.
出处
《计算机应用研究》
CSCD
北大核心
2012年第4期1297-1299,共3页
Application Research of Computers
基金
广西教育厅科研项目基金资助项目(200911LX486
201106LX745)
关键词
文本分类
最大熵
C-均值聚类
特征选择
text classification
maximum entropy
C-means clustering
feature selection