摘要
K近邻的分类性能依赖于训练集的质量。设计高效的训练集优化算法具有重要意义。针对传统的进化训练集优化算法效率较低、误删率较高的不足,提出了一种遗传训练集优化算法。该算法采用基于最大汉明距离的高效遗传算法,每次交叉保留父代并生成两个新的具有最大汉明距离的子代,既提高了效率,又保证了种群多样性。该算法将局部的噪声样本删除策略与特征选择策略相结合。首先使用决策树算法确定噪声样本存在的范围,然后使用遗传算法精准删除此范围内的噪声样本和全局的噪声特征,降低了误删率,提高了效率。该算法采用基于最近邻规则的验证集选择策略,进一步提高了遗传算法实例选择和特征选择的准确度。在15个标准数据集上,该方法相较于协同进化实例特征选择算法IFS-CoCo、加权协同进化实例特征选择算法CIW-NN、进化特征选择算法EIS-RFS、进化实例选择算法PS-NN、K近邻算法KNN,在分类精度上分别平均提升了2.18%,2.06%,5.61%,4.06%和4.00%。实验结果表明,所提方法的分类精度和优化效率优于当前的进化训练集优化算法。
The classification performance of K-Nearest Neighbor depends on the quality of training set.It is significant to design an efficient training set optimization algorithm.Two major drawbacks of traditional evolutionary training set optimization algorithm are low efficiency and removing the non-noise samples and features by mistake.To address these issues,this paper proposes a genetic training set optimization algorithm.The algorithm uses the efficient genetic algorithm based on the maximum Hamming distance.Each cross preserves the parent and generates two new children with the largest Hamming distance,which not only improves the efficiency but also ensures the population diversity.In the proposed algorithm,the local noise sample deletion strategy is combined with the feature selection strategy.Firstly,the decision tree is used to determine the range of noise samples.Then the genetic algorithm is used to remove the noise samples in this range and select the features simultaneously.It reduces the risk of mistaken and improves the efficiency.At last,the 1NN-based selection strategy of validation set is used to improve the instance and feature selection accuracy of the genetic algorithm.Compared with co-evolutionary instance feature selection algorithm(IFS-CoCo),weighted co-evolutionary instance feature selection algorithm(CIW-NN),evolutionary feature selection algorithm(EIS-RFS,evolutionary instance selection algorithm(PS-NN)and traditional KNN,the average improvement of the proposed algorithm in classification accuracy is 2.18%,2.06%,5.61%,4.06%,4.00%,respectively.The experiments results suggest that the proposed method has higher classification accuracy and optimization efficiency.
作者
董明刚
黄宇扬
敬超
DONG Ming-gang;HUANG Yu-yang;JING Chao(College of Information Science and Engineering,Guilin University of Technology,Guilin,Guangxi 541004,China;Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin,Guangxi 541004,China)
出处
《计算机科学》
CSCD
北大核心
2020年第8期178-184,共7页
Computer Science
基金
国家自然科学基金(61563012,61802085,61203109)
广西自然科学基金(2014GXNSFAA118371,2015GXNSFBA139260)
广西嵌入式技术与智能系统重点实验室基金(2018A-04)。
关键词
遗传算法
K近邻
实例选择
特征选择
噪声样本
决策树
Genetic algorithm
K-nearest neighbor
Instance selection
Feature selection
Noise sample
Decision tree