基于概率密度估计的SMOTE改进算法研究

An Improved SMOTE Algorithm Based on Probability Density Estimation

下载PDF

导出

摘要类别不平衡问题是机器学习与数据挖掘领域中主要关注的问题之一,目前已有多种解决方法,而样本采样技术是其中最为简单有效、同时也是最为常用的一类方法.本文主要针对SMOTE(synthetic minority oversampling technique)这一最为流行的采样算法易于受到噪声样本影响及泛化能力差的缺点,提出了一种基于概率密度估计的改进算法.首先,假定各类样本均服从高斯混合分布,并采用高斯混合模型测得各样本的概率密度,针对各样本在类内与类间所测得概率密度间的排序比较关系来实现噪声信息的过滤.其次,在过滤后的少数类样本上进行概率密度的重新计算,并根据其特点将其划分为三类:边界样本、安全样本与离群样本.最后,针对上述三类样本,分别采取不同的策略来进行SMOTE采样.此外,为了进一步提升泛化性能,本文也对SMOTE算法的邻域计算规则进行了修正.通过多个基准的二类不平衡数据集对该算法进行了验证,实验结果表明其是有效且可行的,同时显著优于多种已有的采样算法. Class imbalance problem is one of the main problems in the fields of machine learning and data mining. To address this problem,the researchers have proposed lots of methods,in which instance sampling is the simplest,the most effective and the most used approach. As a popular instance sampling algorithm,SMOTE(synthetic minority oversampling technique)tends to be influenced by the noise instances and has poor generalization ability. To deal with this problem, an improved SMOTE algorithm which considers the probability density information is presented in this paper. Firstly, we assume that the instances in each class satisfy Gaussian mixture distribution,hence the Gaussian mixture model is adopted to estimate the probability density of each instance. Then the noisy instances could be removed by comparing rankings of the intra-class and inter-class probability density information. Next,the probability density information would be calculated again on the filtered data set,and then the instances belonging to the minority class could be divided into three groups as below:boundary,safety and outlier. Finally,for the instances in different group,different SMOTE strategies are used to generate the new instances. In addition,to further promote the generalization,the neighborhood calculation rule in SMOTE has also been modified. The experimental results on several binary-class imbalance data sets indicate that the proposed algorithm is effective and feasible. Moreover,it also shows that the proposed algorithm is significantly better than multiple previous algorithms.

作者李涛郑尚邹海涛于化龙 Li Tao;Zheng Shang;Zou Haitao;Yu Hualong(School of Computer Science,Jiangsu University of Science and Technology,Zhenjiang 212003,China)

机构地区江苏科技大学计算机学院

出处《南京师大学报（自然科学版）》 CAS CSCD 北大核心 2019年第1期65-72,共8页 Journal of Nanjing Normal University(Natural Science Edition)

基金国家自然科学基金(61305058 61572242) 江苏省自然科学基金(BK20130471) 中国博士后特别资助计划项目(2015T80481) 中国博士后科学基金(2013M540404) 江苏省博士后基金(1401037B)

关键词类别不平衡概率密度样本采样 SMOTE 高斯混合分布 class imbalance probability density instance sampling SMOTE Gaussian mixture distribution

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献2

1向日华,王润生.一种基于高斯混合模型的距离图像分割算法[J].软件学报,2003,14(7):1250-1257. 被引量：54
2吴福仙,温卫东.极大似然最大熵概率密度估计及其优化解法[J].南京航空航天大学学报,2017,49(1):110-116. 被引量：7

二级参考文献11

1Jiang XY, Bunke H. Edge detection in range images based on scan line approximation. Computer Vision and Image Understanding,1999,73(2): 183~ 199.
2Hoover A, Jean-Baptiste G, Jiang XY, Flynn PJ, Bunke H, Goldgof DB, Bowyer K, Eggert DW, Fitzgibbon A, Fisher RB. An experimental comparison of range image segmentation algorithms. IEEE Transactions on PAMI, 1996,18(7):673--689.
3Hoffman R, Jain AK. Segment and classification of range images. IEEE Transactions on PAMI, 1996,9(5):608---620.
4Bihnes JA. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. 1998. http://ssli.ee.washington.edu/people/bihnes/mypapers/em.ps.gz.
5Redner RA, Walker HF. Mixture density, maximum likelihood and the EM algorithm. SIAM Review, 1984,26(2):195~239.
6Hoover A, Powell MW. Range image segmentation comparison project. Department of Computer Science and Engineering,University of South Florida, 1996. http://marathon.csee.usf.edu/range/seg-comp/SegComp.html.
7Raflery AE. Approximate Bayes factors and accounting for model uncertainty in generalizes linear model. Technical Report, 1993.http://www.stat.washington.edu/www/research/reports/1993/tr255 .ps.
8Fraley C, Raftery AE. How many clusters? Which clustering method? Answers via model-based cluster analysis. Technical Report,1998. http://www.stat.washington.edu/www/research/reports/1998/tr329.ps.
9Buhmann/M. Data clustering and learning. 2002. http://www-dbv.cs.uni-bonn.de,/pdf/buhmann.hobtann02.pdf.
10高翔,郑建祥.基于最大熵概念的复杂随机变量统计模型[J].农业机械学报,2008,39(2):43-46. 被引量：7

共引文献59

1罗德安,廖丽琼,丁岩辉.基于不完整点云数据的3D柱状模型快速重建[J].测绘科学,2006,31(4):100-102. 被引量：2
2罗德安,廖丽琼.基于四叉树划分的地面激光雷达数据简化[J].计算机应用,2005,25(2):420-421. 被引量：5
3霍华,冯博琴.基于混合模型的多搜索引擎融合[J].西安交通大学学报,2005,39(4):356-359. 被引量：1
4刘聪,张建伟,江志红.基于多元信息的高斯混合模型左心室MR图像分割[J].计算机工程与应用,2005,41(11):18-21. 被引量：2
5陈付幸,王润生.基础矩阵估计的聚类分析算法[J].计算机辅助设计与图形学学报,2005,17(10):2251-2256. 被引量：9
6任厚平,张永明,张维农,袁非牛,余春雨.基于混合高斯模型定位的火灾烟雾纹理特征提取[J].微计算机信息,2005,21(11S):83-85. 被引量：7
7张建伟,夏德深.高斯混合模型改进的活动轮廓模型MRI分割[J].计算机辅助设计与图形学学报,2005,17(12):2647-2653. 被引量：12
8侯一民,郭雷.一种鲁棒的MRF-MAP图象分割框架研究[J].计算机工程与应用,2006,42(27):62-64.
9施智平,胡宏,李清勇,史俊,史忠植.视频数据库的聚类索引方法[J].计算机学报,2007,30(3):397-404. 被引量：6
10朱文球,刘强.一种新的图像语义自动标注与检索算法[J].计算机应用研究,2007,24(7):318-320. 被引量：6

1张军超,蒋强荣.一种GMMHMM隐状态与高斯混合成份初始化算法[J].软件导刊,2019,18(1):81-85. 被引量：1
2吴东苑,杨伟,唐进法,李学林,王晓艳,刘红梅,易丹辉.不平衡数据处理方法对中药不良反应预测的应用研究[J].世界科学技术-中医药现代化,2017,19(9):1455-1461. 被引量：4
3刘科研,吴心忠,石琛,贾东梨.基于数据挖掘的配电网故障风险预警[J].电力自动化设备,2018,38(5):148-153. 被引量：47
4李静.股票指数收益率分布研究[J].科技与创新,2018(24):59-61.
5万中英,王明文,左家莉,刘长红.一种新的样本选择算法及其在文本分类中的应用[J].江西师范大学学报（自然科学版）,2019,43(1):76-83. 被引量：4
6曹雅茜,黄海燕.基于概率采样和集成学习的不平衡数据分类算法[J].计算机科学,2019,46(5):203-208. 被引量：13
7武月红,王洪波.基于GA和灰色SVC的图像通用密写分析[J].图像与信号处理,2017,6(4):168-173.
8谢国荣,郑宏,林伟圻,徐鸣,郭昆,陈基杰.基于改进随机森林算法的停电敏感用户分类[J].计算机系统应用,2019,28(3):104-110. 被引量：7
9邓扬,李爱群.基于断裂力学和长期监测数据的钢箱梁桥顶板U肋焊缝疲劳可靠度分析[J].东南大学学报（自然科学版）,2019,49(1):68-75. 被引量：9
10郭冰楠,吴广潮.改进的随机平衡采样Bagging算法的网络贷款研究[J].计算机与现代化,2019(4):11-16. 被引量：1

南京师大学报（自然科学版）

2019年第1期

浏览历史

内容加载中请稍等...

基于概率密度估计的SMOTE改进算法研究

参考文献2

二级参考文献11

共引文献59

相关作者

相关机构

相关主题

浏览历史