期刊文献+

基于样本势和噪声进化的不平衡数据过采样方法

Oversampling method for imbalanced data based on sample potential and noise evolution
下载PDF
导出
摘要 在处理不平衡数据分类问题中,过采样方法是一种有效的策略。现有方法大多采用K近邻(KNN)技术选取采样种子样本,但KNN参数值的改变会导致多数过采样方法表现出明显的不适定性。径向基过采样(RBO)方法能解决这个问题,但在采样后易出现大量噪声。基于此,提出一种基于样本势和噪声进化的不平衡数据过采样方法,进一步对采样后的数据集迭代进化。首先,使用RBO方法通过计算样本势合成少数类样本,并改善原始数据的不平衡;其次,使用自然近邻(NaN)作为错误检测技术检测过采样后数据集中存在的疑似噪声样本;最后,利用改进的差分进化(DE)方法对检测出的疑似噪声样本迭代进化。相较于传统过采样方法,所提方法能更充分挖掘数据集中的重要边界信息,从而为分类器提供更多辅助以改善其分类性能。在22个基准数据集上,与7种经典采样方法(结合3种不同分类器)进行了大量对比实验。实验结果表明,所提方法具有更高的F1值和G-mean值,并且在噪声处理方面也优于带有后置过滤器的采样方法,可以更有效地解决不平衡数据分类问题。此外,统计分析也表明它的弗里德曼排名更高。 In dealing with the problem of imbalanced data classification,oversampling methods are effective strategies.Existing methods mostly employ K-Nearest Neighbor(KNN)technique to select oversampling seed samples,but changes in KNN parameter values often lead to significant instability for most oversampling methods.Radial-Basis Oversampling(RBO)method can address this issue,but it tends to introduce a substantial amount of noise after oversampling.An imbalanced data oversampling method based on sample potential and noise evolution was proposed to further iteratively refine the oversampled dataset.Firstly,the RBO method was used to synthesize minority class samples and improve the imbalance of the original data by calculating sample potential.Secondly,Natural Neighbor(NaN)was employed as an error detection technique to identify suspected noise samples in the oversampled dataset.Finally,an improved Differential Evolution(DE)method was applied to iteratively refine the detected suspected noise samples.Compared to traditional oversampling methods,the proposed method can better explore important boundary information in the dataset,thus providing more assistance to classifiers to improve their classification performance.Extensive comparative experiments were conducted on 22 benchmark datasets with seven classical sampling methods(combined with three different classifiers).The experiment results show that the proposed method achieves higher F1 values and G-mean values and is superior in noise handling compared to sampling methods with post-filters,which can more effectively deal with the problem of imbalanced data classification.In addition,
作者 冷强奎 孙薛梓 孟祥福 LENG Qiangkui;SUN Xuezi;MENG Xiangfu(School of Electronics and Information Engineering,Liaoning Technical University,Huludao Liaoning 125105,China)
出处 《计算机应用》 CSCD 北大核心 2024年第8期2466-2475,共10页 journal of Computer Applications
基金 国家自然科学基金资助项目(61602056,61772249) 辽宁省自然科学基金资助项目(2019-ZD-0493) 辽宁省教育厅科研项目(LQ2019012) 辽宁工程技术大学博士科研启动基金资助项目(21-1043)。
关键词 K近邻 径向基过采样 样本势 自然近邻 差分进化 不平衡数据分类 K-Nearest Neighbor(KNN) Radial-Basis Oversampling(RBO) sample potential natural neighbor Differential Evolution(DE) imbalanced data classification
  • 相关文献

参考文献6

二级参考文献18

共引文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部