期刊文献+

不平衡数据集上的文本分类特征选择新方法 被引量:8

New feature selection approach for imbalanced text classification
下载PDF
导出
摘要 针对不平衡数据集上进行文本分类,传统的特征选择方法容易导致分类器倾向于大类而忽视小类,提出一种新的特征选择方法 IPR(integrated probability ratio)。该方法综合考虑特征在正类和负类中的分布性质,结合四种衡量特征类别相关性的指标对特征词进行评分,能够更好地解决传统特征选择方法在不平衡数据集上的不适应性,在不降低大类分类性能的同时提高了小类的识别率。实验结果表明,该方法有效可行。 Handing unbalanced data sets in text classification, the traditional feature selection approach more likely tends to large categories and neglects sub-categories. To tackle this problem, this paper proposed a new feature selection approach IPR. This approach considered the distribution property of feature between the positive class and negative class, combined four measure indicators for features with categories distinguishing ability, this approach had solved the problem which traditional fea- ture selection was not adaptive to unbalanced data set and improving the recognition rate of sub-categories, but hadn' t reduced performance of the large categories. Experimental result shows that it is an effective and feasible feature selection approach.
出处 《计算机应用研究》 CSCD 北大核心 2011年第12期4532-4534,共3页 Application Research of Computers
基金 中央高校研究生创新基金资助项目(CDJXS11180013)
关键词 不平衡数据集 文本分类 特征选择 正类 负类 unbalanced data sets text classification feature selection positive class negative class
  • 相关文献

参考文献8

二级参考文献140

共引文献496

同被引文献66

  • 1徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量:20
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:378
  • 3杨绪兵,陈松灿,杨益民.局部化的广义特征值最接近支持向量机[J].计算机学报,2007,30(8):1227-1234. 被引量:10
  • 4搜狗实验室.文本分类语料库[EB/OL].[2008-07-20].http://www.sogou.com/labs/dl/c.html.
  • 5YANG Q, WU X D. 10 challenging problems in data mining research [ J]. International Journal of Inforamtion Technology & Decision Making, 2006, 5:597 -604.
  • 6BREIMAN L. Random forests [ J ]. Machine Learning, 2001, 45(1) : 5 -32.
  • 7GENUER R, POGGI J M, TULEAU-MALOT C. Varia- ble selection using random forests [ J ]. Pattern Recogni- tion Letters, 2010, 31(14): 2225-2236.
  • 8ASUNCION A, NEWMAN D. UCI machine learning re- pository [ G]. [ 2014 - 04 - 30 ]. http://archive. ics. uci. edu/ml/.
  • 9GEURTS P, ERNST D, WEHENKEL L. Extremely randomized trees [ J]. Machine learning, 2006, 63:3 - 42.
  • 10BERNARD S, ADAM S, HEUTTE L. Using random forests for handwritten digit recognition [ C ]//Ninth In- ternational Conference on Document Analysis and Recog- nition, 2007, 2 : 1043 - 1047.

引证文献8

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部