摘要
针对传统采样方式准确率与鲁棒性不够明显,欠采样容易丢失重要的样本信息,而过采样容易引入冗杂信息等问题,以UCI公共数据集中的不平衡数据集Pima-Indians为例,综合考虑数据集正负类样本的类间距离、类内距离与不平衡度之间的关系,提出一种基于样本特性的新型过采样方式.首先对原始数据集进行距离带的划分,然后提出一种改进的基于样本特性的自适应变邻域Smote算法,在每个距离带的少数类样本中进行新样本的合成,并将此方式推广到UCI数据集中其他5种不平衡数据集.最后利用SVM分类器进行实验验证的结果表明:在6类不平衡数据集中,应用新型过采样SVM算法,相比已有的采样方式,少(多)数类样本的分类准确率均有明显提高,且算法具有更强的鲁棒性.
Aiming at the problem that the accuracy and robustness of the traditional sampling methods are not obvious,under-sampling is easy to lose important sample information, and oversampling is easy to introduce redundant information,the Pima-Indians dataset in the UCI common unbalanced datasets is taken as an example to consider the relationship between the distance within classes, the distance within classes and the imbalance, therefore, a new type oversampling method based on sample characteristics is presented. Firstly, the algorithm divides the original data set into some distance belts. Then an improved adaptive neighborhood neighborhood(Smote) algorithm based on sample characteristics is proposed to synthesize new samples in each class with several samples, and is extended to other five unbalanced data sets of UCI dataset. Finally, experiments are conducted using the traditional SVM classifier, and the results show that, in the six categories of unbalanced data sets, compared with the existing sampling method, the proposed algorithm improves the classification accuracy of the minority or majority class samples, and has stronger robustness.
作者
黄海松
魏建安
康佩栋
HUANG Hai-song;WEI Jian-an;KANG Pei-dong(Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang 550025,China)
出处
《控制与决策》
EI
CSCD
北大核心
2018年第9期1549-1558,共10页
Control and Decision
基金
贵州工业攻关重点项目(黔科合GZ字[2015]3009)
贵州省自然科学基金项目(黔科合J字[2015]2043)
贵州省重大专项项目(黔科合JZ字[2014]2001)
贵州省教育厅项目(黔教合协同创新字[2015]02)
贵州大学研究生创新基金项目(研理工2017037)