摘要
文本分类是文本挖掘的一个重要的研究方向,用机器学习方法进行特征选择对文本分类起关键作用.文章比较了文档频数、信息增益、互信息、X2统计量、期望交叉熵、文本证据权以及几率比等7种常用于文本分类的特征选择方法.采用人民网的中文文本语料和Rocch io算法对以上的特征选择方法分别进行了评估实验,实验结果表明,几率比的性能优于其它特征选择方法.表1.参5.
Text categorization is a very important direction in data mining. Feature selection using machine learning approach is keystone and difficult point in text categorization. This paper presented an investigation of seven feature selection methods that are commonly used in text categorization: document frequency, informa- tion gain, mutual information, X2 statistic, expected cross entropy, weight of evidence for text, and odds ratio. In order to evaluate these methods, experiments had been carried out combined with Chinese texts set in people net and Rocchio classifier. The results of measure indicated that odds ratio method is superior to other methods. 1 tab. ,5refs.
出处
《湖南环境生物职业技术学院学报》
CAS
2008年第3期24-26,共3页
JOurnal of Hunan Environment Biological Polytechnic
基金
湖南环境生物职业技术学院院长基金项目(编号:T05-13)
湖南省教育厅项目(编号:07D036)
关键词
文本分类
特征选择
评价函数
text categorization
feature selection
evaluation function