期刊文献+

随机森林针对小样本数据类权重设置 被引量:19

Setting of class weights in random forest for small-sample data
下载PDF
导出
摘要 随机森林已经被证明是一种高效的分类与特征选择方法。尽管参数的设置对结果影响较小,但合适的参数可以使分类器得到理想的效果。主要针对癌症研究中小样本不均衡数据的分类和特征选择问题,研究了随机森林中类权重的设置。为了比较在不同的类权重下特征选择的效果,同时使用支持向量机(Support Vector Machine,SVM)方法。最终结果显示最优的类权重是不确定的。最后总结出几条规律指导研究者选择合适的权重使分类和特征选择效果得到改善。 Random forest has been proved to be an efficient algorithm for classification and feature selection in bioinformatics.Although the effect of parameter setting on results is very limited,a group of appropriate parameters can generate excellent performance.This paper focuses on the setting of class weights in random forest to deal with classification and feature selection prob- lems of unbalanced small-sample data and determines the optimal class weight.In order to compare the performance of feature selection with different weights,SVM is applied in the paper.The results show that optimal class weight is variable and cannot form a standard.However,people can find some weights with which not only classification but also feature selection can get better performance.
出处 《计算机工程与应用》 CSCD 北大核心 2009年第26期131-134,共4页 Computer Engineering and Applications
基金 国家自然科学基金No.60234020 北京市自然科学基金No.4092021 北京市教育委员会科技计划项目No.JC002011200903~~
关键词 随机森林 类权重 小样本 支持向量机 特征选择 random forest class weight small-sample Support Vector Machine(SVM ) feature selection
  • 相关文献

参考文献14

  • 1Breiman L.Random forest[J].Machine Learning,2001,45 : 5-32.
  • 2Stolfo S .J Fan D W S,Lee W,et al.Credit card fraud detection using meta-learning:Issues~nd initial resuhs[C]//AAAI-97 Wrokshop on AI Methods in Fraud and Risk Mangement,1997.
  • 3Pednanlt E P D,Rosen B K,Apte C.Handling imbalanced data sets in insurance risk modeling,Technical Report RC-21731[R].IBM Research Report, 2000-03.
  • 4Batista G E A P A,Bazzan A L C.Balancing training data for automated annotation of keywords:A case study[C]//Proe of the Second Brazilian Workshop on Bioinformaties,SBC,2003.
  • 5Kubar M,Matwin S.Addressing the course of imbalanced training sets:One-sided selection[C]//Proceedings of 14th International Conference in Machine Learning,San Francisco,CA,1997:179-186.
  • 6Breiman L,Freidman J.Classification and regression trees [M].[S.l.]: Wadsworth, 1984.
  • 7张启蕊,张凌,董守斌,谭景华.训练集类别分布对文本分类的影响[J].清华大学学报(自然科学版),2005,45(S1):1802-1805. 被引量:26
  • 8Liu X Y,Wu J.Exploratory under-sampling for class-imbalance learning[C]//Proceedings of the 6th IEEE International Conference on Data Mining(ICDM'06),Hong Kong,China,2006.
  • 9Chawla N V,Bowyer K W.SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16: 321-357.
  • 10Chen C,Liaw A,Breiman L.Using random forest to learn imbalanced data,Technical Report 666[R].Statistics Department,University of California at Berkeley, 2003.

二级参考文献2

  • 1Hull D A.Improving text retrieval for the routing problem using latent semantic indexing[].Proceedings of the th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.1994
  • 2Sebastiani F.Machine learning in automated text categorization[].ACM Computing Surveys.2002

共引文献25

同被引文献236

引证文献19

二级引证文献197

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部