摘要
在文本分类中,不均衡数据集广泛存在.本文从特征选择优化方面出发,分析了特征项在类内和类间的分布情况以及不均衡数据集下文档的差异性对CHI特征选择影响,引入了类内词频概率因子、类间文档概率集中因子和类内均匀因子对传统卡方统计模型进行改进,提出了一种改进的CHI特征选择方法.实验结果表明,与改进前的方法相比,该方法在不均衡数据集上具有更好的分类效果.
In text classification,unbalanced data sets exist widely.From the aspect of feature selection optimization,this paper analyzes the distribution of feature items within and between classes and the influence of document differences under unbalanced data sets on CHI feature selection,introduces the probability factor of word frequency within classes,the probability concentration factor of document between classes and the uniformity factor within classes to improve the traditional CHI square statistical model,and proposes an improved CHI feature selection method.The experimental results show that compared with the improved method,this method has better classification effect on the unbalanced data set.
作者
骆魁永
LUO Kuiyong(School of Information Engineering,Xinyang Agriculture and Forestry University,Xinyang 464000,China)
出处
《商丘师范学院学报》
CAS
2021年第6期9-13,共5页
Journal of Shangqiu Normal University
基金
校级青年基金资助项目(20200115)