摘要
TF-IDF是一种在文本分类领域获得广泛应用的特征词权重算法,着重考虑了词频与逆文档频等因素,但无法把握特征词在类间与类内的分布情况。为提高在同类中频繁出现、类内均匀分布的具有代表性的特征词权重,引入特征词分布集中度系数改进IDF函数、用分散度系数进行加权,提出TF-IIDF-DIC权重函数。实验结果表明,基于TF-IIDF-DIC权重算法的K-NN文本分类宏平均F1值比TF-IDF算法提高了6.79%。
TF-IDF as one of feature weighting schemes in Vector Space Model(VSM) is widely used and makes good results in the realm of text categorization.Although traditional algorithms consider about term frequency and inverse document frequency,Term Frequency/Inverse Document Frequency(TF-IDF) is oblivious to the term distribution information among and inside class.A new feature weighting algorithm based on the improved IDF and distribution coefficient is put forward to enhance the feature weighting of high frequency and homogeneous distribution in the same class.Experimental results show that compared with the conventional TF-IDF algorithm,f1 based on TF-IIDF-DIC raises by 6.79%.
出处
《计算机工程》
CAS
CSCD
北大核心
2010年第9期197-199,202,共4页
Computer Engineering
基金
安徽省高校省级自然科学基金资助项目(KJ2008B120)
关键词
向量空间模型
文本分类
特征权重
特征分布
Vector Space Model(VSM)
text categorization
feature weighting
feature distribution