摘要
在中文文本分类中,特征词的选择会严重影响文本分类的准确率。针对这一问题,提出了基于风险决策的文本特征选择方法,通过构造效用函数来评价文本中每个特征词对分类结果的效用值,再采用风险决策方法计算出每个特征词的损失期望,最终选择部分损失期望小的特征词以达到降维的目的。将该方法应用于中文垃圾邮件过滤与网页分类中,实验结果表明,该方法可以选取出对分类结果影响更大的特征词,使文本分类的各项指标明显提高。
The selection of feature words would severely affect the accuracy of text categorization. In view of this situation, this paper proposes a novel text feature selection approach based on dynamic venture decision. This approach uses utility function to evaluate the utility value of each feature word in text categorization, then uses venture decision method to work out the loss of each feature word, finally selects some feature words with lower losses for reducing dimensions. The proposed approach is applied to the spam filtering and Web category in Chinese. The experimental results on several benchmark datasets show that the proposed feature selection approach can select those feature words which will influence the classification results greatly. In so doing, the accuracy of text classification can be improved significantly.
出处
《计算机科学与探索》
CSCD
2013年第10期933-941,共9页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金Nos.60975035
61273291
山西省回国留学人员科研基金No.2012008~~
关键词
文本分类
特征选择
风险决策
text categorization
feature selection
venture decision