期刊文献+

融合级联上采样与下采样的改进随机森林不平衡数据分类算法 被引量:9

Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
下载PDF
导出
摘要 数据不平衡会严重影响传统分类算法的性能,不平衡数据分类是机器学习领域的一个热点和难点问题。为提高不平衡数据集中少数类样本的检出率,提出一种改进的随机森林算法。该算法的核心是对每一棵通过Bootstrap采样后的随机森林子树数据集进行混合采样。首先采用基于高斯混合模型的逆权重上采样,然后基于SMOTE-borderline1算法进行级联上采样,再用随机下采样方式进行下采样,得到每棵子树的平衡训练子集,最后以决策树为基学习器实现改进机随机森林不平衡数据分类算法。此外,以G-mean和AUC为评价指标,在15个公开数据集上将所提算法与10种不同算法进行比较,结果显示其两项指标的平均排名和平均值均为第一。进一步,在其中9个数据集上将其与6种state-of-the-art算法进行比较,在32次结果对比中,所提算法有28次取得的成绩都优于其他算法。实验结果表明,所提算法有助于提高少数类的检出率,具有更好的分类性能。 Data imbalance will seriously deteriorate the performance of traditional classification algorithms.Imbalance data classification has become a hot and difficult problem in the field of machine learning.In order to improve the detection rate of minority samples in imbalance data sets,an improved random forest algorithm is proposed in this paper.The core of the algorithm is to use hybrid sampling for each random forest subtree data set sampled by Bootsrap.Firstly,inverse weight up-sampling based on Gaussian mixture model is adopted,then cascade up-sampling based on SMOTE-borderline1 algorithm is carried out,and down-sampling is carried out in a random down-sampling way,so as to obtain a balanced training subset of each subtree.Finally,adecision tree-based improved random forest learner is used to implement the unbalanced data classification algorithm.In addition,this paper uses G-means and AUC as evaluation indexes,and compares them with 10 different algorithms on 15 public data sets.The results show that the average ranking and average value of the two indexes rank first.Furthermore,this paper compares with 6 state-of-the-art algorithms on 9 data sets.Among the 32 comparisons,the proposed algorithm achieves better results than that of other algorithms for 28 times.The experimental results show that the proposed algorithm is helpful to improve the detection rate of minority class and has better classification performance.
作者 郑建华 李小敏 刘双印 李迪 ZHENG Jian-hua;LI Xiao-min;LIU Shuang-yin;LI Di(College of Information Science and Technology,Zhongkai University of Agriculture and Engineering,Guangzhou 510225,China;Guangdong Engineering&Technology Research Center for Smart Agriculture,Guangzhou 510225,China;College of Mechanical and Electrical Engineering,Zhongkai University of Agriculture and Engineering,Guangzhou 510225,China;School of Mechanical and Automotive Engineering,South China University of Technology,Guangzhou 510640,China)
出处 《计算机科学》 CSCD 北大核心 2021年第7期145-154,共10页 Computer Science
基金 国家重点研发计划(2018YFB1700500) 国家自然科学基金(61471133,61871475) 广东省科技计划项目(2017A070712019,2017B010126001,2020A1414050062) 广东省教育厅项目(2016KZDXM001,2017GCZX001,2020KZDZX1121) 广州市科技计划项目(201704030098)。
关键词 级联上采样 随机森林 不平衡数据 分类算法 Cascaded up-sampling Random forest Imbalance data Classification algorithm
  • 相关文献

参考文献5

二级参考文献29

共引文献261

同被引文献124

引证文献9

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部