摘要
传统欠采样方法在处理不平衡数据问题时只考虑多数类样本的绝对位置而忽略了其相对位置,从而使产生的平衡数据集存在边界模糊问题。提出一种改进K均值聚类的不平衡数据欠采样算法(UD-PK)。该算法首先利用改进的PSO算法迭代寻找全局最优解作为K-means聚类所需初始值,然后通过K-means进行聚类,再按照每个类别中多数类与少数类的比例定义所取多数类样本个数,并根据多数类样本与簇心距离择优选择参与平衡数据集构造。在UCI数据集上的对比试验表明,该算法在少数类准确率上较一些经典算法有很大提升。
The traditional undersampling method only considers the problem that the absolute position of most class samples ignores its relative position when dealing with the unbalanced data problem,so that the resulting balanced data set has boundary blurring prob⁃lems.This paper proposes an improved unbalanced data undersampling algorithm for K-means clustering(UD-PK).The algorithm first uses the improved PSO algorithm to iteratively find the global optimal solution as the initial value needed for K-means clustering,and clusters by K-means;then according to the ratio of most classes to minority classes in each category the number of samples taken from the majority of the class is defined to participate in the construction of the balanced data set according to the selection of the major⁃ity class sample and the cluster center distance.The comparison experiments on the UCI dataset show that the proposed algorithm has a great improvement in the accuracy of a few classes compared with some classical algorithms.
作者
于艳丽
江开忠
王珂
盛静文
YU Yan-li;JIAN Kai-zhong;WANG Ke;SHENG Jing-wen(School of Mathematics,Physics&Statistics,Shanghai University of Engineering Science;School of Electrical and Electronic Engineering,Shanghai University of Engineering Science,Shanghai 201620,China)
出处
《软件导刊》
2020年第6期205-209,共5页
Software Guide