摘要
文本表示是使用分类算法处理文本时必不可少的环节,文本表示方法的选择对最终的分类精度起着至关重要的作用。针对经典的特征权重计算方法TFIDF(Term Frequency and Inverted Document Frequency)中存在的不足,提出了一种基于信息熵理论的特征权重算法ETFIDF(Entropy based TFIDF)。ETFIDF不仅考虑特征项在文档中出现的频率及该特征项在训练集中的集中度,而且还考虑该特征项在各个类别中的分散度。实验结果表明,采用ETFIDF计算特征权重可以有效地提高文本分类性能,对ETFIDF与特征选择的关系进行了较详细的理论分析和实验研究。实验结果表明,在文本表示阶段考虑特征与类别的关系可以更为准确地表示文本;如果综合考虑精度与效率两个方面因素,ETFIDF算法与特征选择算法一起采用能够得到更好的分类效果。
Text representation is an important process to perform text categorization, and the method of text representation plays an important role in the final classification accuracy. This paper proposes a new term weighting algorithm ETFIDF(Entropy based TFIDF) based on information entropy theory to overcome the limitations of the traditional term weighting algorithm TFIDF (Term Frequency and Inverted Document Frequency). ETFIDF not only considers the number of times a term occurs in a document and the number of documents in training set in which a term occurs, but also takes into account the distribution of documents in the training set in which the term occurs. Experimental results show that ETFIDF outperforms TFIDF in text categorization. Furthermore, detailed theoretical analysis and experimental study on the relationship between ETFIDF and feature selection have been done in this paper. Experimental results show that, it can represent the text more accurately if we take into account the distribution of documents in the training set in which the term occurs in the text representation stage. Moreover, it can achieve higher performance for the combination of ETFIDF and feature selection algorithm if we consider both the accuracy and efficiency.
出处
《计算机工程与应用》
CSCD
2013年第10期140-146,共7页
Computer Engineering and Applications
关键词
信息熵
特征权重
特征选择
文本分类
information entropy
term weighting
feature selection
text categorization