基于联合熵的非平衡数据边界混合重采样

Boundary Mixed Resampling Based on Joint Entropy for Imbalanced Data

下载PDF

导出

摘要为了克服在数据平衡处理过程中单一重采样方法易生成冗余样本及误删重要样本信息的局限,本文提出一种基于联合熵的非平衡数据边界混合重采样算法。该算法首先通过引入边界因子对边界集和非边界集进行有效的区分,进一步构建一个联合熵指标体系以判断出边界集中少数类样本的重要程度,并根据其重要程度对细分后的少数类样本点设置不同的过采样方法和采样数量,最后使用NearMiss-2算法对非边界集中多数类样本点进行筛选并删除,从而实现数据的相对平衡。通过对9组UCI数据集进行对比实验,实验结果表明:该算法在F1-Score、G-mean及AUC这3个指标上均有提升,验证了其有效性,有较好的非平衡数据分类性能表现。 In order to overcome the limitations of single resampling methods in data imbalance handling,which often lead to the generation of redundant samples and the inadvertent deletion of crucial sample information,this paper proposes a novel non balanced data boundary mixed resampling algorithm based on joint entropy.The algorithm first effectively distinguishes between the boundary set and the non-boundary set by introducing a boundary factor.It further constructs a joint entropy indicator system to assess the importance of minority class samples within the boundary set.Based on this assessment,different oversampling methods and sampling quantities are applied to the segmented minority class samples.Finally,the NearMiss-2 algorithm is used to filter and remove most of the sample points in the non-boundary set,thus achieving a relative data balance.Through compara‐tive experiments on nine sets of UCI datasets,the experimental results show that the proposed algorithm achieves improvements in F1-Score,G-mean,and AUC metrics,which validates its effectiveness and exhibiting favorable performance in non balanced data classification.

作者周传华任太娇罗岚周昊 ZHOU Chuanhua;REN Taijiao;LUO Lan;ZHOU Hao(School of Management Science and Engineering,Anhui University of Technology,Ma’anshan 243032,China;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230026,China)

机构地区安徽工业大学管理科学与工程学院中国科学技术大学计算机科学与技术学院

出处《计算机与现代化》 2024年第9期95-100,113,共7页 Computer and Modernization

基金国家自然科学基金资助项目(71772002,61702006) 复杂系统多学科管理与控制安徽普通高校重点实验室资助项目(CS2020-04)。

关键词不平衡数据分类边界因子联合熵混合采样 imbalanced data classification boundary factor joint entropy mixed sampling

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

1化广华,赵则祥,赵新宇.几种优化算法在平面度误差评定中的适应性研究[J].工具技术,2024,58(1):154-160.
2陈刚,侯宾杰.基于生成对抗网络的高斯型数据的过采样算法[J].信息与控制,2024,53(2):182-190.
3鲁淑霞,张振莲,翟俊海.代价敏感惩罚AdaBoost算法的非平衡数据分类[J].南京航空航天大学学报,2023,55(2):339-346. 被引量：5
4吴立胜,皮珣珣.基于交叉区域SMOTE算法的非平衡数据分类[J].电脑与电信,2023(7):75-79.
5李长洪,郑凯,林博宇.针对不平衡数据分类的改进GBDT算法[J].计算机与数字工程,2024,52(7):1932-1937.
6蓝春梅.心理干预结合饮食护理在妊娠剧吐治疗中的临床效果分析[J].中文科技期刊数据库（全文版）医药卫生,2024(10):0201-0204.
7衡欣,焦禹淦,郑延斌.改进旋转平衡森林的数据密度峰值聚类算法[J].计算机仿真,2024,41(8):338-343.
8陈兴国,许静,李扬,罗玉盘.基于贪心组合优化的分布极端不平衡分类算法[J].小型微型计算机系统,2024,45(10):2411-2419.
9杨红.蒲水河河道治理及水土保持技术措施探讨[J].工程与建设,2024,38(4):873-875.
10王译,李嘉飞.嵌入与转型:人工智能赋能国家监察的系统逻辑与治理之策[J].中共天津市委党校学报,2024,26(5):43-52.

计算机与现代化

2024年第9期

浏览历史

内容加载中请稍等...

基于联合熵的非平衡数据边界混合重采样

相关作者

相关机构

相关主题

浏览历史