摘要
针对现有过采样方法存在的易引入噪声点、合成样本重叠的问题,提出一种基于自然最近邻的不平衡数据过采样方法。确定少数类样本的自然最近邻,每个样本的近邻个数由算法自适应计算生成,反映了样本分布的疏密程度。基于自然近邻关系对少数类样本聚类,由位于同一类簇中密集区域的核心点和稀疏区域的非核心点生成新样本。在二维合成数据集和UCI数据集上的对比实验验证了该方法的可行性和有效性,提高了不平衡数据的分类精度。
Aiming at the problem of introducing noise points and synthesizing overlapping samples in existing oversampling methods,this paper proposes an oversampling method based on natural nearest neighbors.The proposed method firstly determines the natural nearest neighbor for minority samples.Each sample’s number of nearest neighbors is generated by adaptive calculation in the algorithm,which reflects the density of distribution.After cluster analysis for minority samples based on relations of natural neighbor,this method generates new samples using core points in dense area and non-core points in sparse area from the same cluster.The comparison experiments on a two-dimensional synthesis dataset and UCI datasets verify the feasibility and effectiveness of this method and improve the classification accuracy of unbalanced data.
作者
孟东霞
李玉鑑
MENG Dongxia;LI Yujian(School of Financial Technology,Hebei Finance University,Baoding,Hebei 071051,China;School of Artificial Intelligence,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China)
出处
《计算机工程与应用》
CSCD
北大核心
2021年第2期91-96,共6页
Computer Engineering and Applications
基金
河北省高校智慧金融应用技术研发中心基金(XGZJ2020008)
国家自然科学基金(61876010)。
关键词
不平衡数据集
过采样
自然最近邻
聚类
imbalanced data set
over sampling
natural nearest neighbor
clustering