摘要
在分析重取样技术的基础上,设计并实现了自适应选择近邻的混合重取样算法。该方法结合过取样和欠取样方法的优势,改进了SMOTE过取样算法在产生合成样本过程中存在的盲目性及只能复制生成数值属性的问题,新算法能根据实例样本集内部分布的真实特性,自适应调整近邻选择策略,对不同属性的数据采取不同的复制方法生成新的少数类实例,控制和提高合成样本的质量;并通过对合成之后的数据集用改进的邻域清理方法进行适当程度欠取样,去掉多数类中的冗余实例和边界上的噪音数据,减少其规模,在一定程度上达到相对均衡,从而可有效地处理非均衡数据分类问题,提高分类器的性能。
On the basis of analyzing re-sampling technology,a novel hybrid re-sampling technique based on Automated Adaptive Selection of the Number of Nearest Neighbors (ADSNNHRS) is proposed.This method in fact is combining the advantages of both technology of improved Synthetic Minority Over-sampling Technique(SMOTE) method with neighborhood cleaning rule(NCL) data cleaning method.In our procedure of over-sampling,in the SMOTE method,blindfold new synthetic minority class examples by randomly interpolating pairs of closest neighbors are added into the minority class;and data sets with nominal features can not be handled,these two problems are solved by the automated adaptive selection of nearest neighbors and adjusting the neighbor selective strategy.As a consequence,the quality of the new samples can be well controlled.In the procedure of under-sampling,by using the improved under-sampling technique of neighborhood cleaning rule,borderline majority class examples and the noisy or redundant data are removed.The main motivation behind these methods is not only to balance the training data,but also to remove noisy examples lying on the wrong side of the decision border.The removal of noisy examples might aid in finding better-defined class clusters,therefore,allowing the creation of simpler models with better generalization capabilities,therefore,promising effective processing of IDS and a considerably enhanced classifier performance.
出处
《武汉理工大学学报》
CAS
CSCD
北大核心
2010年第20期55-60,共6页
Journal of Wuhan University of Technology
基金
国家高技术研究发展863计划项目(2009AA12Z117)
襄樊学院规划项目(2009YA012)
关键词
非均衡数据集
重取样
机器学习
分类
imbalanced data sets
re-sampling
machine learning
classification