摘要
传统类别区分词特征选择算法以类间分散度和类内重要度作为度量指标,忽略了2个指标对特征评分函数的贡献权重往往不同这一事实,从而在一定程度上影响了特征选择效果。在类别区分词特征选择算法基础上,引入平衡因子,通过调节平衡因子来调整2个指标对特征评价函数的贡献权重,完成更加高效的特征选择,进而达到更好的文本分类效果。使用朴素贝叶斯算法进行文本分类,相比主流特征选择算法,改进算法在分类准确率、查准率、查全率和F1指标上都取得了可观的性能提升。
The traditional category distinguished words(CDW) feature selection algorithm, which takes inter-class dispersion degree and intra-class importance degree as comprehensive metrics, ignores the fact that contribution weights of the two indicators to feature scoring function are often different, and thus affects feature selection efficiency to some extent. A CDW feature selection algorithm combining with balance factor(ICDW) is proposed. During feature selection, the contribution weights of two indicators to feature scoring function are adjusted by continuously adjusting the value of the balance factor to complete more efficient feature selection. Using Na?ve Bayes classification algorithm for text categorization, experiments show that classification performance of ICDW algorithm not only outperforms that of CDW algorithm, but also exceeds that of ECE, IG and CHI, which are commonly used for feature selection.
作者
李富星
蒙祖强
LI Fu-xing;MENG Zu-qiang(School of Computer and Electronic Information,Guangxi University,Nanning 530004,China)
出处
《计算机与现代化》
2019年第3期73-77,共5页
Computer and Modernization
基金
广西自然科学基金资助项目(2015GXNSFAA139292)
关键词
文本分类
特征选择
平衡因子
类别区分词
text categorization
feature selection
balance factor
category distinguished words