期刊文献+

密度峰值优化的球簇划分欠采样不平衡数据分类算法 被引量:6

Imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization
下载PDF
导出
摘要 在集成算法中嵌入代价敏感和重采样方法是一种有效的不平衡数据分类混合策略。针对现有混合方法中误分代价计算和欠采样过程较少考虑样本的类内与类间分布的问题,提出了一种密度峰值优化的球簇划分欠采样不平衡数据分类算法DPBCPUSBoost。首先,利用密度峰值信息定义多数类样本的抽样权重,将存在“近邻簇”的多数类球簇划分为“易误分区域”和“难误分区域”,并提高“易误分区域”内样本的抽样权重;其次,在初次迭代过程中按照抽样权重对多数类样本进行欠采样,之后每轮迭代中按样本分布权重对多数类样本进行欠采样,并把欠采样后的多数类样本与少数类样本组成临时训练集并训练弱分类器;最后,结合样本的密度峰值信息与类别分布为所有样本定义不同的误分代价,并通过代价调整函数增加高误分代价样本的权重。在10个KEEL数据集上的实验结果表明,与现有自适应增强(AdaBoost)、代价敏感自适应增强(AdaCost)、随机欠采样增强(RUSBoost)和代价敏感欠采样自适应增强(USCBoost)等不平衡数据分类算法相比,DPBCPUSBoost在准确率(Accuracy)、F1分数(F1-Score)、几何均值(G-mean)和受试者工作特征(ROC)曲线下的面积(AUC)指标上获得最高性能的数据集数量均多于对比算法。实验结果验证了DPBCPUSBoost中样本误分代价和抽样权重定义的有效性。 It is an effective hybrid strategy for imbalanced data classification of integrating cost-sensitivity and resampling methods into the ensemble algorithms.Concerning the problem that the misclassification cost calculation and undersampling process less consider the intra-class and inter-class distributions of samples in the existing hybrid methods,an imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization was proposed,named Boosting algorithm based on Ball Cluster Partitioning and UnderSampling with Density Peak optimization(DPBCPUSBoost).Firstly,the density peak information was used to define the sampling weights of majority samples,and the majority ball cluster with“neighbor cluster”was divided into“area misclassified easily”and“area misclassified hardly”,then the sampling weight of samples in“area misclassified easily”was increased.Secondly,the majority samples were undersampled based on the sampling weights in the first iteration,then the majority samples were undersampled based on the sample distribution weight in every iteration.And the weak classifier was trained on the temporary training set combining the undersampled majority samples with all minority samples.Finally,the density peak information of samples was combined with the categorical distribution of samples to define the different misclassification costs for all samples,and the weights of samples with higher misclassification cost were increased by the cost adjustment function.Experimental results on 10 KEEL datasets indicate that,the number of datasets with the highest performance achieved by DPBCPUSBoost is more than that of the imbalanced data classification algorithms such as Adaptive Boosting(AdaBoost),Cost-sensitive AdaBoost(AdaCost),Random UnderSampling Boosting(RUSBoost)and UnderSampling and Cost-sensitive Boosting(USCBoost),in terms of evaluation metrics such as Accuracy,F1-Score,Geometric Mean(G-mean)and Area Under Curve(AUC)of Receiver Operating Characteristic(ROC).Experimental results verify that the definition of sample misclassification cost and sampling weight of the proposed DPBCPUSBoost is effective.
作者 刘学文 王继奎 杨正国 李强 易纪海 李冰 聂飞平 LIU Xuewen;WANG Jikui;YANG Zhengguo;LI Qiang;YI Jihai;LI Bing;NIE Feiping(School of Information Engineering,Lanzhou University of Finance and Economics,Lanzhou Gansu 730020,China;Key Laboratory of E‑Business Technology and Application of Gansu Province(Lanzhou University of Finance and Economics),Lanzhou Gansu 730020,China;Center for OPTical IMagery Analysis and Learning(OPTIMAL),Northwestern Polytechnical University,Xi’an Shaanxi 710072,China)
出处 《计算机应用》 CSCD 北大核心 2022年第5期1455-1463,共9页 journal of Computer Applications
基金 国家自然科学基金资助项目(61772427) 甘肃省高等学校创新能力提升资助项目(2021B-145,2021B-147) 甘肃省自然科学基金资助项目(17JR5RA177) 甘肃省重点研发计划项目(21YF5FA087)。
关键词 不平衡数据分类 密度峰值 球聚类 代价敏感 欠采样 imbalanced data classification density peak ball clustering cost-sensitive undersampling
  • 相关文献

参考文献10

二级参考文献46

共引文献186

同被引文献52

引证文献6

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部