摘要
数据类间分布不均衡是不平衡数据集分类效果不好的主要原因,为了克服类间分布的不均衡,本文提出了一种基于邻近样本类别判断的不平衡数据分类算法。首先,对待判定样本,计算它的k个最邻近样本,然后将待判定样本的类别指派到它的k个最邻近中的多数类。由于本文所提出的不平衡数据分类算法在类别决策时,只考虑少量的邻近样本的类别,而不是考虑所有的训练样本,因此可以较好地克服类间不平衡对少数类分类结果的影响。在客户流失数据集上的仿真实验充分证明了本文算法能较好地处理不平衡数据分类问题。
Uneven distribution between ctasses is the main reason for the bad effects of imbalanced data sets classification, in order to overcome the uneven distribution between classes, in this paper, we proposed an imbalanced data classification algorithm based on adjacent samples labels judgment. First, for the sample undetermined, calculate its k most adjacent samples, and then assign the sample undetermined to the most common class among its k nearest neighbors. As the imbalanced data classification algorithm proposed in this paper only considered the categories of a small number of neighboring samples, rather than considering those of all the training samples, so it can overcome the influence to the minority class caused by the uneven distribution between classes. The simulation experiments on churn datasets fully proved that the proposed algorithm can effectively deal with unbalanced data classification.
出处
《科技通报》
北大核心
2013年第10期58-60,共3页
Bulletin of Science and Technology
关键词
不平衡数据集
邻近样本
数据分类
少数类
imbalanced data sets
adjacent samples
data classification
the minority class