摘要
针对随机森林和SMOTE组合算法在处理不平衡数据集上存在数据集边缘化分布以及计算复杂度大等问题,提出了基于SMOTE的改进算法TSMOTE(triangle SMOTE)和MDSMOTE(Max Distance SMOTE),其核心思想是将新样本的产生限制在一定区域,使得样本集分布趋于中心化,用更少的正类样本点人为构造样本,从而达到限制样本区域、降低算法复杂度的目的。在6种不平衡数据集上的大量实验表明,改进算法与传统算法相比,算法消耗时间大幅减少,取得更高的G-mean值、F-value值和AUC值。
There are dataset marginal distribution problem and the computational complexity shortcomings using random forest combined SMOTE algorithm in dealing with imbalanced dataset.This paper proposes a TSMOTE algorithm(triangle SMOTE)and MDSMOTE algorithm(Max Distance SMOTE).The core idea of the improved algorithm is to restrict the generation of new samples in a certain area,so that the distribution of the sample set tends to be centralized,which reduces the complexity of the traditional SMOTE algorithm and the time complexity of the algorithm.Extensive experiments on six imbalanced datasets show that the improved algorithm reduces the time consumption and achieves higher Gmean value,F-value value,AUC value compared with the state-of-art method SMOTE.
作者
赵清华
张艺豪
马建芬
段倩倩
ZHAO Qinghua;ZHANG Yihao;MA Jianfen;DUAN Qianqian(MicroNano System Research Center,College of Information Engineering&Key Lab of Advanced Transducers and Intelligent Control System(Ministry of Education),Taiyuan University of Technology,Taiyuan 030600,China)
出处
《计算机工程与应用》
CSCD
北大核心
2018年第18期168-173,共6页
Computer Engineering and Applications
基金
国家自然科学基金(No.51505324)
山西省国际科技合作计划项目(No.2013-036)。