期刊文献+

不均衡数据集上文本分类方法研究 被引量:11

Study of text categorization on imbalanced data
下载PDF
导出
摘要 文本分类中数据集的不均衡问题是一个在实际应用中普遍存在的问题。从特征选择优化和分类器性能提升两方面出发,提出了一种组合的不均衡数据集文本分类方法。在特征选择方面,综合考虑特征项与类别的正负相关特性及类别区分强度对传统CHI统计特征选择方法予以改进。在数据层上,采用数据重取样方法对不均衡训练语料的不平衡性过滤减少其对分类性能的影响。实验结果表明该方法对不均衡数据集上文本可达到较好分类效果。 Class imbalance problems are often encountered in real application of automatic text classifications. From the view of the optimistic feature selection methods and the improvement of classifiers, a new text classification method on imbalanced data set is proposed. The positive and negative correlation between items and categorizations are combined with the strength of class information in the aspect of the feature selection scheme. Then on the data layer, the imbalanced characters of the training corpus are filtered by data resampling methods in order to reduce the effect on the classification. Experimental results show that the new approach can achieve better performance.
出处 《计算机工程与应用》 CSCD 2013年第20期118-121,共4页 Computer Engineering and Applications
基金 国家自然科学基金(No.61173129)
关键词 特征选择 CHI统计 文本分类 不均衡数据集 重取样 feature selection CHI statistical approach text categorization imbalanced data resampling
  • 相关文献

参考文献6

二级参考文献56

  • 1唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,2005,42(1):47-53. 被引量:26
  • 2王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 3李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 4苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:378
  • 5[2]Y Yang,JO Pedersen.A comparative study on feature selection in text categorization.In:Proc of the 14th Int'lConf on Machine Learning (ICML-97).San Francisco:Morgan Kaufmann Publishers,1997.412-420
  • 6[3]NV Chawla,N Japkowicz,A Kotcz.Editorial:Special issue on learning from imbalanced data sets.SIGKDD Explorations Newsletters,2004,6(1):1-6
  • 7[4]D Mladenic,M Grobelnk.Feature selection for unbalanced class distribution and naive bayes.In:Proc of the 16th Int'lConf on Machine Learning (ICML'99).San Francisco:Morgan Kaufmann Publishers,1999.258-267
  • 8[6]Bong,Chih How,K Narayanan.An empirical study of feature selection for text categorization based on term weightage.IEEE/WIC/ACM Int'lConf on Web Intelligence(WI'04),Beijing,2004
  • 9[7]Shoushan Li,Chengqing Zong.A new approach to feature selection for text categorization.IEEE Int'lConf on Natural Language Processing and Knowledge Engineering (NLP-KE),Wuhan,2005
  • 10[8]Castillo MDd,Serrano JI.A multistrategy approach for digital text categorization from imbalanced documents.SIGKDD Explorations Newsletter,2004,6(1):70-79

共引文献466

同被引文献89

引证文献11

二级引证文献50

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部