期刊文献+

基于信息熵理论的特征权重算法研究 被引量:22

Research on term weighting algorithm based on information entropy theory
下载PDF
导出
摘要 文本表示是使用分类算法处理文本时必不可少的环节,文本表示方法的选择对最终的分类精度起着至关重要的作用。针对经典的特征权重计算方法TFIDF(Term Frequency and Inverted Document Frequency)中存在的不足,提出了一种基于信息熵理论的特征权重算法ETFIDF(Entropy based TFIDF)。ETFIDF不仅考虑特征项在文档中出现的频率及该特征项在训练集中的集中度,而且还考虑该特征项在各个类别中的分散度。实验结果表明,采用ETFIDF计算特征权重可以有效地提高文本分类性能,对ETFIDF与特征选择的关系进行了较详细的理论分析和实验研究。实验结果表明,在文本表示阶段考虑特征与类别的关系可以更为准确地表示文本;如果综合考虑精度与效率两个方面因素,ETFIDF算法与特征选择算法一起采用能够得到更好的分类效果。 Text representation is an important process to perform text categorization, and the method of text representation plays an important role in the final classification accuracy. This paper proposes a new term weighting algorithm ETFIDF(Entropy based TFIDF) based on information entropy theory to overcome the limitations of the traditional term weighting algorithm TFIDF (Term Frequency and Inverted Document Frequency). ETFIDF not only considers the number of times a term occurs in a document and the number of documents in training set in which a term occurs, but also takes into account the distribution of documents in the training set in which the term occurs. Experimental results show that ETFIDF outperforms TFIDF in text categorization. Furthermore, detailed theoretical analysis and experimental study on the relationship between ETFIDF and feature selection have been done in this paper. Experimental results show that, it can represent the text more accurately if we take into account the distribution of documents in the training set in which the term occurs in the text representation stage. Moreover, it can achieve higher performance for the combination of ETFIDF and feature selection algorithm if we consider both the accuracy and efficiency.
作者 郭红钰
出处 《计算机工程与应用》 CSCD 2013年第10期140-146,共7页 Computer Engineering and Applications
关键词 信息熵 特征权重 特征选择 文本分类 information entropy term weighting feature selection text categorization
  • 相关文献

参考文献21

  • 1Debole F, Sebastiani F.Supervised term weighting for auto- mated text categorization[C]//Proceedings of the 18th ACM Symposium on Applied Computing.New York: ACM Press, 2003:784-788.
  • 2Zobel J, Moffat A.Exploring the similarity space[J].ACM SIGIR Forum, 1998,32(1 ) : 18-34.
  • 3Salton G, Wong A, Yang C S.A vector space model for auto- matic indexing[J].Communications of the ACM, 1975,18 ( 11 ) : 613-620.
  • 4Souey P,Mineau G W.Beyond TFIDF weighting for text eat- egorization in the vector space mdoe[C]//Interuational Joint Conference on Artifical Intelligence, Edinburgh, Scotland, UK, 2005: 1130-1135.
  • 5Zhang Y, Gong L, Wang Y.An improved TFIDF approach for text classification[J].Journal of Zhejiang University Science, 2005,6A( 1 ) :49-55.
  • 6张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:121
  • 7景丽萍,黄厚宽,石洪波.用于文本挖掘的特征选择方法TFIDF及其改进[J].广西师范大学学报(自然科学版),2003,21(A01):142-145. 被引量:23
  • 8van Rijsbergen C J.Information retrieval[M].London: Butter- worths Scientific Publication, 1979.
  • 9Porter M F.An Algorithm for Suffix Stripping[J].Program, 1980,14(3) : 130-137.
  • 10Hull D A.Improving text retrieval for the routing problem using latent semantic indexing[C]//Croft W B,van Rijsber- gen C J.Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval.Heidelberg:Springer Verlag, 1994.

二级参考文献9

  • 1梁久祯,兰东俊.基于先验知识的网页特征压缩与线性分类器设计[C].第十二届全国神经计算学术大会讨论文集.北京:人民邮电出版社,2002:494-501.
  • 2Rudolph G.Convergence Properties of Canonical Genetic Algorithms[J].IEEE Trans.on Neural Networks,1994,5(1):96-101.
  • 3Yiming Y.An Evaluation of Statistic Approaches to Text Categorization[J].Information Retrieval,1999,1(1/2):69-90.
  • 4Salton G,Wong A,Yang C.A Vector Space Model for Automatic Indexing[J].Communications of ACM,1975,18(11):613-620.
  • 5Mnic D,Grobelnik M.Feature Selection for Unbalanced Class Distribution and Naive Bayees[C].Proceedings of the 6^th International Conference on Machine Learning.Blrf:Morgan Kaufmann,1999:258-267.
  • 6Rocchio J.Relevance Feedback in Information Retrieval[C].Proc.of SMART Retrieval System:Experiments in Automatic Doc.,NJ,USA:Prentice-hall,1971:313-323.
  • 7邹涛,王继成,朱华宇,金翔宇,张福炎.WWW上的信息挖掘技术及实现[J].计算机研究与发展,1999,36(8):1019-1024. 被引量:120
  • 8范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量:53
  • 9刘斌,黄铁军,程军,高文.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002,16(6):18-24. 被引量:48

共引文献139

同被引文献184

引证文献22

二级引证文献223

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部