摘要
在处理高度不平衡数据时,代价敏感随机森林算法存在自助法采样导致小类样本学习不充分、大类样本占比较大、容易削弱代价敏感机制等问题.文中通过对大类样本聚类后,多次采用弱平衡准则对每个集群进行降采样,使选择的大类样本与原训练集的小类样本融合生成多个新的不平衡数据集,用于代价敏感决策树的训练.由此提出基于聚类的弱平衡代价敏感随机森林算法,不仅使小类样本得到充分学习,同时通过降低大类样本数量,保证代价敏感机制受其影响较小.实验表明,文中算法在处理高度不平衡数据集时性能较优.
For highly unbalanced data,insufficient learning of minority class samples is caused by self-sampling method of the traditional cost sensitive random forest algorithm,and the cost sensitive mechanism of the algorithm is easily weakened by the large proportion of majority class samples.Therefore,a weak balance cost sensitive random forest algorithm based on clustering is proposed.After clustering the majority class samples,the weak balance criterion is used to reduce the samples of each cluster repeatedly.The selected majority class samples and the minority class samples of the original training set are fused to generate a number of new unbalanced datasets for the training of cost sensitive decision tree.The proposed algorithm not only enables the minority class samples to be fully learned,but also ensures that the cost sensitive mechanism is less affected by reducing the majority class samples.Experiment indicates the better performance of the proposed algorithm in processing highly unbalanced datasets.
作者
平瑞
周水生
李冬
PING Rui;ZHOU Shuisheng;LI Dong(School of Mathematics and Statistics,Xidian University,Xi'an 710126)
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2020年第3期249-257,共9页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.61772020)资助。
关键词
不平衡数据
聚类采样
代价敏感学习
随机森林
Imbalanced Data
Cluster Sampling
Cost Sensitive Learning
Random Forest