摘要
本文提出了一种新的基于词频和文档频率的特征词权重计算方法ETFC.首先构造了新的函数作为特征词的类别区分度,加强了低文档频数特征词的类别区分能力.然后运用k-means算法进行聚类实验.结果表明,改进后的权重算法ETFC比现有的权重算法TFIDF和TFC在聚类纯度和算法的稳定性方面均有较大提高,从而表明改进策略是可行的.
A new algorithm ETFC for feature weight of words based on word and document frequency is put forward in this paper. Firstly, we propose an exponential function to distinguish the text categories so as to enhance the feature weight of the words that appear in a few documents. Then, k-means algorithm is applied in text clustering and some experiments are conducted. The results show that the ETFC method improves the efficiency and stability, and therefore it is superior to the traditional TFIDF and TFC methods in text clustering to a certain extent.
出处
《工程数学学报》
CSCD
北大核心
2012年第4期523-528,共6页
Chinese Journal of Engineering Mathematics
基金
中央高校基本科研业务费专项资金(xjj2009068)~~