期刊文献+

动态自适应特征权重的多类文本分类算法研究 被引量:9

Research on dynamic self-adaptive term weighting for multi-class text classification algorithm
下载PDF
导出
摘要 文本分类是研究文本数据挖掘、信息检索的重要手段,文本特征项权重值的计算是文本分类算法的关键。针对经典的特征权重计算方法 TF-IDF中存在的不足,提出了一种动态自适应特征权重计算方法(DATW)。该算法不仅考虑了特征项在文本中出现的频率及该特征项所属文本在训练集中的数量,而且通过考查特征项的分散度和特征向量梯度差以自适应动态文本的分类。实验结果表明,采用DATW方法计算特征权重可以有效提高文本分类的性能。 Text classification plays an important role while studying text data mining and information retrieve,and computing and allocating term weight is the key process while classifying text.Therefore,this paper proposed a dynamic self-adaptive term weighting(DATW) for multi-class text classification,which overcame the disadvantages of the traditional term weighting algorithm TF-IDF.DATW not only considered the term frequency within a text and the number of a text corresponding the term within the whole training set,but also took into account the distribution coefficient and the gradient descent of a term to self-adapting dynamic text classification.It is validated that the performance of using DATW is superior to that of using TF-IDF.
出处 《计算机应用研究》 CSCD 北大核心 2011年第11期4092-4096,共5页 Application Research of Computers
基金 上海市教委优秀青年教师科研基金资助项目(SLG10005) 上海理工大学科研创新基金资助项目(GDCX-Y-102) AMD大学合作计划专项基金资助项目(BOW-02)
关键词 文本分类 特征权重 TF-IDF 分散度 梯度差 text classification term weighting TF-IDF distribution coefficient gradient descent
  • 相关文献

参考文献11

  • 1SALTON G, WONG A, YANG C.S. A vector space model for automatic indexing [ J ]. Communications of the ACM, 1975,18 ( 11 ) : 613- 620.
  • 2DEERWESTER S, DUMAIS S, FURNAS G, et al. Indexing by latent semantic analysis[ J]. Journal of the American Society for Information Science, 1990,41 ( 6 ) : 391 - 407.
  • 3范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量:53
  • 4XING Zheng-zheng, PEI Jian, YU P. Early prediction on time series : a nearest neighbor approach [ C ]//Proc of the 21 st International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers ,2009 : 1297-1302.
  • 5CHEN Ji-song, YEH C H, CHAU R. Identifying multi-word temps by text-segments[ C ]//Proc of the 7th International Conference on Webage Information Management Workshops. Washington DC:IEEE Computer Society,2006 : 10-19.
  • 6ZHANG Wen, YOSHIDA T, TANG Xi-jin. A comparative study of TF * IDF, LSI and multi-words for text classification [ J]. Expert Systems with Applications,2011,38 (3) :2758-2765.
  • 7SOUCY P, MINEAU G. Beyond TF-IDF weighting for text categorization in the vector space mode [ C]//Proc of the 19th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers ,2005 : 1130-1135.
  • 8ZHANG Yun-tao, GONG Ling, WANG Yortg-cheng. An improved TF-IDF approach for text classification[ J ]. Journal of Zhejiang Universify Science ,2005,6A( 1 ) :49-55.
  • 9NOVOVICOVA J, MALIK A, PUDIL P. Feature selection using improved mutual information for text classification [ C ]//Proc of Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition. Berlin:Springer,2004: 1010-1017.
  • 10Newsgroups dataset [ EB/OL]. http://people, csail, mit. edu/jrennie/20Newsgroups.

二级参考文献1

  • 1Lang K,Proc the 12th Int Conference on Machine Learning(ICML 95),1995年,331页

共引文献52

同被引文献70

引证文献9

二级引证文献42

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部