期刊文献+

基于k近邻中心偏移因子的欠采样方法

An Undersampling Method Based on k-Nearest Neighbor Center Offset Factor
下载PDF
导出
摘要 针对不平衡数据集在实际应用中分类效果较差的问题,文章提出一种基于k近邻中心偏移因子对多数类样本欠采样的数据处理方法。k近邻中心是样本的k个最近邻覆盖区域的中心点,所在位置随着k值的增加而发生偏移,偏移变化的波动程度用中心偏移因子来表示。中心偏移因子的值反映了样本周围的局部密度,数值较小的因子代表样本及其近邻处于密集区域,或近邻在样本的同一侧密集分布,样本可能为冗余样本。为了在不改变原始数据分布的前提下尽可能地删除冗余度较高的多数类样本,首先,移除多数类样本中的噪声点,计算多数类样本的中心偏移因子;然后,将多数类样本按照偏移因子的数值从低到高排序;最后,通过比较样本与k近邻的中心偏移因子来删除部分多数类样本,使数据集趋于平衡。实验使用支持向量机对多种欠采样方法平衡后的14个数据集进行了分类,实验结果表明,所提方法在大多数数据集上表现较优,有效提高了少数类的分类精度。 Aiming at the problem that the imbalanced dataset has poor classification effect in practical applications,this paper proposes a data processing method based on the center offset factor of k-nearest neighbors to perform undersampling for the majority class samples.The k-nearest neighbors center is the center point of the k-nearest neighbor coverage region of the sample.Its location shifts with the increase of k value,and the fluctuation degree of the shift change is expressed by the center shift factor.The value of the center offset factor reflects the local density around the instance.The smaller value of the factor indicates that the sample and its neighbors are in a dense area,or that the neighbors are evenly distributed on the same side of the sample,and the sample may be redundant.In order to remove the redundant majority class samples as much as possible without changing the original distribution of the majority class samples,the proposed method first removes the noisy sample in the majority class,calculates the center offset factor of the majority class samples,then sorts the majority class samples according to the value of center offset factor from low to high,some samples are removed by comparing the center offset factor of the sample and its k-nearest neighbors,so that the dataset tends to be balanced.In the experimental part,the support vector machine is used to classify 14 datasets balanced by a variety of undersampling methods.Experimental results show that the proposed method performs better on most datasets and effectively improves the classification accuracy of minority classes.
作者 孟东霞 谢林燕 Meng Dongxia;Xie Linyan(Intelligence Finance Application Technology R&D Center of Hebei Colleges;School of Financial Technology,Hebei Finance University,Baoding Hebei 071051,China;Hebei Branch of National Computer Network Emergency Technology Coordination Center,Shijiazhuang 050021,China)
出处 《统计与决策》 北大核心 2023年第12期40-44,共5页 Statistics & Decision
基金 河北省高校智慧金融应用技术研发中心项目(IFDC2022030C) 河北省省级科技计划资助项目(20310701D) 中央引导地方科技发展资金项目(216Z0701G)。
关键词 不平衡数据集 欠采样 K近邻 中心偏移因子 imbalanced dataset undersampling k-nearest neighbor center offset factor
  • 相关文献

参考文献7

二级参考文献26

共引文献56

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部