期刊文献+

改进SMOTE的非平衡数据集分类算法研究 被引量:27

Research on classification algorithm of imbalanced datasets based on improved SMOTE
下载PDF
导出
摘要 针对随机森林和SMOTE组合算法在处理不平衡数据集上存在数据集边缘化分布以及计算复杂度大等问题,提出了基于SMOTE的改进算法TSMOTE(triangle SMOTE)和MDSMOTE(Max Distance SMOTE),其核心思想是将新样本的产生限制在一定区域,使得样本集分布趋于中心化,用更少的正类样本点人为构造样本,从而达到限制样本区域、降低算法复杂度的目的。在6种不平衡数据集上的大量实验表明,改进算法与传统算法相比,算法消耗时间大幅减少,取得更高的G-mean值、F-value值和AUC值。 There are dataset marginal distribution problem and the computational complexity shortcomings using random forest combined SMOTE algorithm in dealing with imbalanced dataset.This paper proposes a TSMOTE algorithm(triangle SMOTE)and MDSMOTE algorithm(Max Distance SMOTE).The core idea of the improved algorithm is to restrict the generation of new samples in a certain area,so that the distribution of the sample set tends to be centralized,which reduces the complexity of the traditional SMOTE algorithm and the time complexity of the algorithm.Extensive experiments on six imbalanced datasets show that the improved algorithm reduces the time consumption and achieves higher Gmean value,F-value value,AUC value compared with the state-of-art method SMOTE.
作者 赵清华 张艺豪 马建芬 段倩倩 ZHAO Qinghua;ZHANG Yihao;MA Jianfen;DUAN Qianqian(MicroNano System Research Center,College of Information Engineering&Key Lab of Advanced Transducers and Intelligent Control System(Ministry of Education),Taiyuan University of Technology,Taiyuan 030600,China)
机构地区 太原理工大学
出处 《计算机工程与应用》 CSCD 北大核心 2018年第18期168-173,共6页 Computer Engineering and Applications
基金 国家自然科学基金(No.51505324) 山西省国际科技合作计划项目(No.2013-036)。
关键词 随机森林 SMOTE算法 不平衡数据集 random forest SMOTE algorithm imbalanced dataset
  • 相关文献

参考文献4

二级参考文献60

  • 1李瑞,邱玉辉.基于离散点的蚁群聚类算法的研究[J].计算机科学,2005,32(6):111-113. 被引量:4
  • 2田铮,李小斌,句彦伟.谱聚类的扰动分析[J].中国科学(E辑),2007,37(4):527-543. 被引量:33
  • 3Phua C, Alahakoon D, Lee V. Minority Report in Fraud Detection: Classification of Skewed Data. ACM SIGKDD Explorations Newsletter, 2004, 6 ( 1 ) : 50 - 59.
  • 4Zheng Zhaohui, Srihari R. Optimally Combining Positive and Negative Features for Text Categorization [ EB/OL]. [ 2003-08-24 ]. http ://www. site. uottwa. ca/-nat/Workshop2003/zheng.pdf.
  • 5Ertekin S, Huang Jian, Bottou L, et al. Learning on the Border: Active Learning in Imbalanced Data Classification [ EB/OL ]. [ 2007-11-08 ]. http://www. personal. psu. edu/juh177/pubs/ CIKM2007. pdf.
  • 6Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One Sided Selection// Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997: 179- 186.
  • 7Barandela R, Valdovinos R M, Sanchez J S, et al. The Imbalanced Training Sample Problem: Under or over Sampling// Proc of the Joint IAPR International Workshops on Structural, Syntactic and Statistical Pattern Recognition. Lisbon, Portugal, 2004 : 806 - 814.
  • 8Chawla N V, Hall L O, Bowyer K W, et al. Smote: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, 16 : 321 - 357.
  • 9Han Hui, Wang Wenyuan, Mao Binghua. Borderline Smote: A New Over-Sampling Method in Imbalanced Data Sets Learning//Proc of the International Conference on Intelligent Computing. Hefei, China, 2005 : 878 -887.
  • 10Jo T, Japkowicz N. Class Imbalances versus Small Disjuncts. ACM SIGKDD Explorations Newsletter, 2004, 6( 1 ) : 40 -49.

共引文献131

同被引文献177

引证文献27

二级引证文献219

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部