期刊文献+

基于改进SMOTE的非平衡数据集分类研究 被引量:19

Research on classification for imbalanced dataset based on improved SMOTE
下载PDF
导出
摘要 针对SMOTE(Synthetic Minority Over-sampling Technique)在合成少数类新样本时存在的不足,提出了一种改进的SMOTE算法(SSMOTE)。该算法的关键是将支持度概念和轮盘赌选择技术引入到SMOTE中,并充分利用了异类近邻的分布信息,实现了对少数类样本合成质量和数量的精细控制。将SSMOTE与KNN(K-Nearest Neighbor)算法结合来处理不平衡数据集的分类问题。通过在UCI数据集上与其他重要文献中的相关算法进行的大量对比实验表明,SSMOTE在新样本的整体合成效果上表现出色,有效提高了KNN在非平衡数据集上的分类性能。 Based on analyzing the shortages of SMOTE (Synthetic Minority Over-sampling Technique), an improved SMOTE (SSMOTE) is presented. The key of SSMOTE lies on leading the concept of support and roulette wheel selection into SMOTE and making full use of the heterogeneous nearest-neighbor distribution information to achieve the fine control of the synthesis quality and quantity to the minority class samples. SSMOTE and KNN(K-Nearest Neighbor) are combined to handle the classi- fication problem on imbalanced datasets, and extensive experiments are conducted to compare SSMOTE and algorithms in perti- nent literatures on the UCI datasets. The simulation results show SSMOTE promises prominent synthesis effect to the minority class samples, and brings better classification performance on imbalanced datasets with KNN.
出处 《计算机工程与应用》 CSCD 2013年第2期184-187,245,共5页 Computer Engineering and Applications
基金 国家自然科学基金(No.31170393) 陕西省教育厅自然科学项目(No.2010JK620)
关键词 非平衡数据集 分类 支持度 轮盘赌选择 合成少数过采样技术(SMOTE) imbalanced datasets classification support roulette wheel selection Synthetic Minority Over-sampling Technique (SMOTE)
  • 相关文献

参考文献11

  • 1Paolo S.A multi-objective optimisation approach for class im- balance learning[J].Pattem Recognition, 2011,44 ( 8 ) : 1801-1810.
  • 2郝秀兰,陶晓鹏,徐和祥,胡运发.kNN文本分类器类偏斜问题的一种处理对策[J].计算机研究与发展,2009,46(1):52-61. 被引量:33
  • 3王晓芹,张化祥,柴青.基于级联结构的不平衡数据集分类研究[J].计算机工程与应用,2010,46(13):115-117. 被引量:3
  • 4Han Hui, Wang Wen-yuan, Mao Bing-huan.Borderline-SMOTE: a new over-sampling method in imbalanced data sets learn- ing[C]//Proc of International Conference on Intelligent Com- puting( ICIC' 05 ).Hefei : [s.n.], 2005 : 878-887.
  • 5Jason V H, Taghi K.Knowledge discovery from imbalanced and noisy data[J].Data Knowledge Engineering, 2009,68: 1513-1542.
  • 6杨智明,乔立岩,彭喜元.基于改进SMOTE的不平衡数据挖掘方法研究[J].电子学报,2007,35(B12):22-26. 被引量:30
  • 7Chawla N, Bowyer K, Hall L, et aI.SMOTE : synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16( 1 ) :321-357.
  • 8Yen Show-jane, Lee Yue-shi.Cluster-based under-sampling app- roaches for imbalanced data distributions[J].Expert Systems with Applications, 2009,36 : 5718-5727.
  • 9李明方,张化祥.针对不平衡数据集的Bagging改进算法[J].计算机工程与应用,2010,46(30):40-42. 被引量:12
  • 10Frank A, Asuncion A.UCI machine learning repository[EB/ OL].[2011-07-10].http ://archive.ics.uci.edu/ml.

二级参考文献77

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:378
  • 2Witten IH,Frank E.数据挖掘实用机器学习技术[M].2版.北京:机械工业出版社.2006:126-324.
  • 3Japkowicz N. Learning from imbalanced data sets: A comparison of various strategies, WS-00-05 [R]. Menlo Park, CA: AAAI Press, 2000
  • 4Chawla N V, Japkowicz N, Kotcz A. Editorial: Special issue on learning from imbalaneed data sets [J]. Sigkdd Explorations Newsletters, 2004, 6( 1 ) : 1-6
  • 5Weiss Gary M. Mining with rarity: A unifying frameworks [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 7-19
  • 6Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown [OL]. [2008-01-06]. http://www. site. uottawa. ca/-nat/workshop2003/workshop 2003. html
  • 7Chawla N V, Hall L O, Bowyer K W, et al. SMOTE: Synthetic minority oversampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16 : 321-357
  • 8Jo Taeho, Japkowicz Nathalie. Class imbalances versus small disjunets [J]. SIGKDD Explorations Newsletters, 2004, 6 (1): 40-49
  • 9Batista E A P A, Prati R C, Monard M C. A study of the behavior of several methods for halaneing machine learning training data [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 20-29
  • 10Guo Hongyu, Viktor Herna L. Learning from imbalanced data sets with boosting and data generation: The DataBoostIM approach [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 30-39

共引文献94

同被引文献162

引证文献19

二级引证文献128

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部