期刊文献+

基于概率密度估计的SMOTE改进算法研究

An Improved SMOTE Algorithm Based on Probability Density Estimation
下载PDF
导出
摘要 类别不平衡问题是机器学习与数据挖掘领域中主要关注的问题之一,目前已有多种解决方法,而样本采样技术是其中最为简单有效、同时也是最为常用的一类方法.本文主要针对SMOTE(synthetic minority oversampling technique)这一最为流行的采样算法易于受到噪声样本影响及泛化能力差的缺点,提出了一种基于概率密度估计的改进算法.首先,假定各类样本均服从高斯混合分布,并采用高斯混合模型测得各样本的概率密度,针对各样本在类内与类间所测得概率密度间的排序比较关系来实现噪声信息的过滤.其次,在过滤后的少数类样本上进行概率密度的重新计算,并根据其特点将其划分为三类:边界样本、安全样本与离群样本.最后,针对上述三类样本,分别采取不同的策略来进行SMOTE采样.此外,为了进一步提升泛化性能,本文也对SMOTE算法的邻域计算规则进行了修正.通过多个基准的二类不平衡数据集对该算法进行了验证,实验结果表明其是有效且可行的,同时显著优于多种已有的采样算法. Class imbalance problem is one of the main problems in the fields of machine learning and data mining. To address this problem,the researchers have proposed lots of methods,in which instance sampling is the simplest,the most effective and the most used approach. As a popular instance sampling algorithm,SMOTE(synthetic minority oversampling technique)tends to be influenced by the noise instances and has poor generalization ability. To deal with this problem, an improved SMOTE algorithm which considers the probability density information is presented in this paper. Firstly, we assume that the instances in each class satisfy Gaussian mixture distribution,hence the Gaussian mixture model is adopted to estimate the probability density of each instance. Then the noisy instances could be removed by comparing rankings of the intra-class and inter-class probability density information. Next,the probability density information would be calculated again on the filtered data set,and then the instances belonging to the minority class could be divided into three groups as below:boundary,safety and outlier. Finally,for the instances in different group,different SMOTE strategies are used to generate the new instances. In addition,to further promote the generalization,the neighborhood calculation rule in SMOTE has also been modified. The experimental results on several binary-class imbalance data sets indicate that the proposed algorithm is effective and feasible. Moreover,it also shows that the proposed algorithm is significantly better than multiple previous algorithms.
作者 李涛 郑尚 邹海涛 于化龙 Li Tao;Zheng Shang;Zou Haitao;Yu Hualong(School of Computer Science,Jiangsu University of Science and Technology,Zhenjiang 212003,China)
出处 《南京师大学报(自然科学版)》 CAS CSCD 北大核心 2019年第1期65-72,共8页 Journal of Nanjing Normal University(Natural Science Edition)
基金 国家自然科学基金(61305058 61572242) 江苏省自然科学基金(BK20130471) 中国博士后特别资助计划项目(2015T80481) 中国博士后科学基金(2013M540404) 江苏省博士后基金(1401037B)
关键词 类别不平衡 概率密度 样本采样 SMOTE 高斯混合分布 class imbalance probability density instance sampling SMOTE Gaussian mixture distribution
  • 相关文献

参考文献2

二级参考文献11

  • 1Jiang XY, Bunke H. Edge detection in range images based on scan line approximation. Computer Vision and Image Understanding,1999,73(2): 183~ 199.
  • 2Hoover A, Jean-Baptiste G, Jiang XY, Flynn PJ, Bunke H, Goldgof DB, Bowyer K, Eggert DW, Fitzgibbon A, Fisher RB. An experimental comparison of range image segmentation algorithms. IEEE Transactions on PAMI, 1996,18(7):673--689.
  • 3Hoffman R, Jain AK. Segment and classification of range images. IEEE Transactions on PAMI, 1996,9(5):608---620.
  • 4Bihnes JA. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. 1998. http://ssli.ee.washington.edu/people/bihnes/mypapers/em.ps.gz.
  • 5Redner RA, Walker HF. Mixture density, maximum likelihood and the EM algorithm. SIAM Review, 1984,26(2):195~239.
  • 6Hoover A, Powell MW. Range image segmentation comparison project. Department of Computer Science and Engineering,University of South Florida, 1996. http://marathon.csee.usf.edu/range/seg-comp/SegComp.html.
  • 7Raflery AE. Approximate Bayes factors and accounting for model uncertainty in generalizes linear model. Technical Report, 1993.http://www.stat.washington.edu/www/research/reports/1993/tr255 .ps.
  • 8Fraley C, Raftery AE. How many clusters? Which clustering method? Answers via model-based cluster analysis. Technical Report,1998. http://www.stat.washington.edu/www/research/reports/1998/tr329.ps.
  • 9Buhmann/M. Data clustering and learning. 2002. http://www-dbv.cs.uni-bonn.de,/pdf/buhmann.hobtann02.pdf.
  • 10高翔,郑建祥.基于最大熵概念的复杂随机变量统计模型[J].农业机械学报,2008,39(2):43-46. 被引量:7

共引文献59

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部