摘要
文本分类是研究文本数据挖掘、信息检索的重要手段,文本特征项权重值的计算是文本分类算法的关键。针对经典的特征权重计算方法 TF-IDF中存在的不足,提出了一种动态自适应特征权重计算方法(DATW)。该算法不仅考虑了特征项在文本中出现的频率及该特征项所属文本在训练集中的数量,而且通过考查特征项的分散度和特征向量梯度差以自适应动态文本的分类。实验结果表明,采用DATW方法计算特征权重可以有效提高文本分类的性能。
Text classification plays an important role while studying text data mining and information retrieve,and computing and allocating term weight is the key process while classifying text.Therefore,this paper proposed a dynamic self-adaptive term weighting(DATW) for multi-class text classification,which overcame the disadvantages of the traditional term weighting algorithm TF-IDF.DATW not only considered the term frequency within a text and the number of a text corresponding the term within the whole training set,but also took into account the distribution coefficient and the gradient descent of a term to self-adapting dynamic text classification.It is validated that the performance of using DATW is superior to that of using TF-IDF.
出处
《计算机应用研究》
CSCD
北大核心
2011年第11期4092-4096,共5页
Application Research of Computers
基金
上海市教委优秀青年教师科研基金资助项目(SLG10005)
上海理工大学科研创新基金资助项目(GDCX-Y-102)
AMD大学合作计划专项基金资助项目(BOW-02)
关键词
文本分类
特征权重
TF-IDF
分散度
梯度差
text classification
term weighting
TF-IDF
distribution coefficient
gradient descent