摘要
为了克服在数据平衡处理过程中单一重采样方法易生成冗余样本及误删重要样本信息的局限,本文提出一种基于联合熵的非平衡数据边界混合重采样算法。该算法首先通过引入边界因子对边界集和非边界集进行有效的区分,进一步构建一个联合熵指标体系以判断出边界集中少数类样本的重要程度,并根据其重要程度对细分后的少数类样本点设置不同的过采样方法和采样数量,最后使用NearMiss-2算法对非边界集中多数类样本点进行筛选并删除,从而实现数据的相对平衡。通过对9组UCI数据集进行对比实验,实验结果表明:该算法在F1-Score、G-mean及AUC这3个指标上均有提升,验证了其有效性,有较好的非平衡数据分类性能表现。
In order to overcome the limitations of single resampling methods in data imbalance handling,which often lead to the generation of redundant samples and the inadvertent deletion of crucial sample information,this paper proposes a novel non balanced data boundary mixed resampling algorithm based on joint entropy.The algorithm first effectively distinguishes between the boundary set and the non-boundary set by introducing a boundary factor.It further constructs a joint entropy indicator system to assess the importance of minority class samples within the boundary set.Based on this assessment,different oversampling methods and sampling quantities are applied to the segmented minority class samples.Finally,the NearMiss-2 algorithm is used to filter and remove most of the sample points in the non-boundary set,thus achieving a relative data balance.Through compara‐tive experiments on nine sets of UCI datasets,the experimental results show that the proposed algorithm achieves improvements in F1-Score,G-mean,and AUC metrics,which validates its effectiveness and exhibiting favorable performance in non balanced data classification.
作者
周传华
任太娇
罗岚
周昊
ZHOU Chuanhua;REN Taijiao;LUO Lan;ZHOU Hao(School of Management Science and Engineering,Anhui University of Technology,Ma’anshan 243032,China;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230026,China)
出处
《计算机与现代化》
2024年第9期95-100,113,共7页
Computer and Modernization
基金
国家自然科学基金资助项目(71772002,61702006)
复杂系统多学科管理与控制安徽普通高校重点实验室资助项目(CS2020-04)。
关键词
不平衡数据分类
边界因子
联合熵
混合采样
imbalanced data classification
boundary factor
joint entropy
mixed sampling