摘要
CHI是文本分类中特征选择的重要方法.本文分析了CHI特征选择的特点,针对该方法的不足之处,提出了一种新的基于最低词频CHI的特征选择算法.该方法通过设置最低词频阈值去除了部分低频词,减少了CHI特征选择时低频词带来的干扰.同时本文对传统的TF-IDF特征权重计算方法进行了改进,在特征权重计算里加入改进后的CHI特征选择函数,使文本的表示更合理.通过在均衡语料和非均衡语料上的实验验证,新的方法有效提高了文本分类的效果.
CHI is an important method of feature selection in text categorization .In order to overcome the deficiencies of this method ,a new feature selection algorithm based on the lowest word frequency of CHI is proposed in this paper .This new approach removes some low‐frequency terms by setting a threshold value of the lowest frequency terms ,and thus reduces the interference of the low‐frequency terms in CHI feature selection .Meanwhile ,the classical feature weighting method of TF‐IDF is improved in this study ,and the addition of an improved feature selection function of CHI to the feature weighting method makes the text more reasonable .The experimental results on the corpora of even distribution and uneven distribution show that the new approach has effectively improved the quality of text categorization .
出处
《西南大学学报(自然科学版)》
CAS
CSCD
北大核心
2015年第6期137-142,共6页
Journal of Southwest University(Natural Science Edition)
基金
国家自然科学基金项目(61462008)
重庆市教委科技项目(KJ120622)
关键词
文本分类
向量空间模型
特征选择
χ2统计
低频词
权重计算
text categorization
vector space model
feature selection
Chi-square statistic
low-frequency term
term weighting