期刊文献+

基于改进CHI和TF-IDF的短文本分类的研究

Short Text Classification Based on Improved CHI and TF-IDF
下载PDF
导出
摘要 为提高对数据量较少的短文本分类效果,有效降低特征空间的特征维度,本文针对传统CHI统计方法和TF-IDF权重计算方法的缺点,提出一种新的类词因子改进特征选择方法,用来提高分类准确性。对于传统CHI统计方法对低频词敏感、TF-IDF权重计算方法忽略特征项在类别间和类别内分布情况,通过引入类词因子来改进传统CHI统计方法和TF-IDF权重计算方法,并将两者结合使用,减少低频词带来的干扰。同时考虑类内和类间特征词分布的特殊情况,使用XGBoost分类算法将提出方法用在数据量少且文本短的话题文本分类实验中,实验结果表明,与传统的CHI和TF-IDF方法相比,加入类词因子的特征选择方法提高了在平衡和非平衡语料集上的分类准确性,大幅度降低了对内存的占用。 In order to improve the effect of classifying short texts with a small amount of data,and effectively reduce the feature dimension of the feature space,aiming at the defects of the traditional CHI statistical method and the TF-IDF weight calculation method,this paper proposes a new factor of word class and frequency to improve the feature selection method,and consequently to enhance the classification accuracy.As the traditional CHI statistical method is sensitive to low-frequency words,and the TF-IDF weight calculation method ignores the distribution of feature items between and within classes,the paper introduces the factor of word class and frequency to improve the traditional CHI statistical method and the TF-IDF weight calculation method,and uses the two methods in combination to reduce the interference caused by low-frequency words,with consideration to the special situation of the distribution of feature words within and between classes.The paper uses the XGBoost classification algorithm to apply the proposed method in the classification experiment of topic text with small amount of data and short text.The experimental results show that,compared with the traditional CHI and TF-IDF methods,the feature selection method with factor of word class and frequency improves the classification accuracy on the balanced and unbalanced corpus,and greatly reduces the memory usage.
作者 代继鹏 邵峰晶 孙仁诚 DAI Ji-peng;SHAO Feng-jing;SUN Ren-cheng(College of Computer Science and Technology, Qingdao University, Qingdao 266071, China)
出处 《计算机与现代化》 2021年第6期6-11,共6页 Computer and Modernization
基金 国家自然科学青年基金资助项目(41706198)。
关键词 文本分类 特征选择 XGBoost 卡方统计量 TF-IDF text classification feature selection XGBoost chi-square statistics TF-IDF
  • 相关文献

参考文献10

二级参考文献73

共引文献171

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部