期刊文献+

基于二次随机森林的不平衡数据分类算法 被引量:3

An new Algorithm for Imbalanced Data Based on Twice Random Froest
下载PDF
导出
摘要 不平衡数据集的分类问题是现今机器学习的一个热点问题。传统分类学习器以提高分类精度为准则导致对少数类识别准确率下降。本文首先综合描述了不平衡数据集分类问题的研究难点和研究进展,论述了对分类算法的评价指标,进而提出一种新的基于二次随机森林的不平衡数据分类算法。首先,用随机森林算法对训练样本学习找到模糊边界,将误判的多数类样本去除,改变原训练样本数据集结构,形成新的训练样本。然后再次使用随机森林对新训练样本数据进行训练。通过对UCI数据集进行实验分析表明新算法在处理不平衡数据集上在少数类的召回率和F值上有提高。 Imbalanced data's classification(IDC) is one of the hot issues in machine learning. The recall rate of minority class probably reduced as a result of most traditional classified learners only aim for improving system accuracy. Firstly, analyzes the research difficulties and research progress of IDC in recent year are reviewed. Then, this thesis discusses some evaluation indexes of classification algorithms. Based on these studie, an new algorithm for IDC on implementing twice random forest algorithm, named as TRF is proposed in this paper. Firstly, applying random forest algorithm is to search the fuzzy boundary, then the majority class samples that are predicted to be minority class will be removed and change the data structure to build new train data sets. This new data sets will be trained to obtain a new classification model by random forest. The experiment results show the TRF algorithm can effectively improve F-measure and the minority class recall rate.
作者 刘学 张素伟
出处 《软件》 2016年第7期75-79,共5页 Software
关键词 模式识别 不平衡数据 随机森林 模糊边界 Pattern recognition Imbalanced data Random forest Fuzzy boundary
  • 相关文献

参考文献10

  • 1Miroslav Kubat,Robert C. Holte,Stan Matwin.??Machine Learning for the Detection of Oil Spills in Satellite Radar Images(J)Machine Learning . 1998 (2)
  • 2Nathalie Japkowicz,Shaju Stephen.The class imbalance problem: A systematic study. Intelligent Data Analysis . 2002
  • 3陈海红.多核SVM文本分类研究[J].软件,2015,36(5):7-10. 被引量:27
  • 4全雪峰.基于奇异熵和随机森林的人脸识别[J].软件,2016,37(2):35-38. 被引量:11
  • 5王和勇,樊泓坤,姚正安,李成安.不平衡数据集的分类方法研究[J].计算机应用研究,2008,25(5):1301-1303. 被引量:23
  • 6Leo Breiman.Random Forests[J]. Machine Learning . 2001 (1)
  • 7Nitesh V. Chawla,Kevin W. Bowyer,Lawrence O. Hall,W. Philip Kegelmeyer.SMOTE: synthetic minority over-sampling technique. Journal of Artificial Organs . 2002
  • 8Gary M. Weiss,Foster Provost.Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Organs . 2003
  • 9Joshi,M,Kumar,V,Agarwal,R.Evaluating Boosting Algorithms to Classify Rare Classes:Comparison and Improvements. the 1st IEEE International Conference on Data Mining . 2001
  • 10黄衍,查伟雄.随机森林与支持向量机分类性能比较[J].软件,2012,33(6):107-110. 被引量:70

二级参考文献65

  • 1于功志,关德林,段树林,李国宾,Gong-zhi De-lin Shu-lin Guo-bin.基于奇异熵的钢球表面缺陷特征提取研究[J].计量学报,2009(6). 被引量:1
  • 2王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量:129
  • 3奉国和,朱思铭.基于聚类的大样本支持向量机研究[J].计算机科学,2006,33(4):145-147. 被引量:14
  • 4王志玲,王效岳.国内文本分类研究论文的统计分析[J].图书情报工作,2006,50(11):136-138. 被引量:2
  • 5EZAWA K J, SINGH M, NORTON S W. Learning goal oriented Bayesian networks for telecommunications management [ C ]//Proc of the 13th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 1996:139-147.
  • 6CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[ J ]. Journal of Artificial Intelligence Research, 2002,16:321-357.
  • 7KUBAT M, HOLTE R, MATWIN S. Machine learning for the detection of oil spills in satellite radar images [ J ]. Machine Learning, 1998,30(2) :195-215.
  • 8BOSCH A T, HERIK H J, DAELEMANS W. When small disjuncts abound, try lazy learning: a case study[ C ]//Proc of the 7th Belgian- Dutch Conference on Machine Learning. 1997 : 109-118.
  • 9ZHENG Zhao-hui, WU Xiao-yun, SRIHARI R. Feature selection for text categorization on imbalanced data[ J ]. SIGKDD Explorations, 2004,6( 1 ) :80-89.
  • 10FAWCETT T, PROVOST F. Combining data mining and machine learning for effective user profile [ C ]//Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996:8-13.

共引文献138

同被引文献14

引证文献3

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部