期刊文献+

基于非均衡数据集的新型混合重取样算法 被引量:1

Novel Hybrid Re-sampling Algorithm Based Imbalanced Data Sets
原文传递
导出
摘要 在分析重取样技术的基础上,设计并实现了自适应选择近邻的混合重取样算法。该方法结合过取样和欠取样方法的优势,改进了SMOTE过取样算法在产生合成样本过程中存在的盲目性及只能复制生成数值属性的问题,新算法能根据实例样本集内部分布的真实特性,自适应调整近邻选择策略,对不同属性的数据采取不同的复制方法生成新的少数类实例,控制和提高合成样本的质量;并通过对合成之后的数据集用改进的邻域清理方法进行适当程度欠取样,去掉多数类中的冗余实例和边界上的噪音数据,减少其规模,在一定程度上达到相对均衡,从而可有效地处理非均衡数据分类问题,提高分类器的性能。 On the basis of analyzing re-sampling technology,a novel hybrid re-sampling technique based on Automated Adaptive Selection of the Number of Nearest Neighbors (ADSNNHRS) is proposed.This method in fact is combining the advantages of both technology of improved Synthetic Minority Over-sampling Technique(SMOTE) method with neighborhood cleaning rule(NCL) data cleaning method.In our procedure of over-sampling,in the SMOTE method,blindfold new synthetic minority class examples by randomly interpolating pairs of closest neighbors are added into the minority class;and data sets with nominal features can not be handled,these two problems are solved by the automated adaptive selection of nearest neighbors and adjusting the neighbor selective strategy.As a consequence,the quality of the new samples can be well controlled.In the procedure of under-sampling,by using the improved under-sampling technique of neighborhood cleaning rule,borderline majority class examples and the noisy or redundant data are removed.The main motivation behind these methods is not only to balance the training data,but also to remove noisy examples lying on the wrong side of the decision border.The removal of noisy examples might aid in finding better-defined class clusters,therefore,allowing the creation of simpler models with better generalization capabilities,therefore,promising effective processing of IDS and a considerably enhanced classifier performance.
出处 《武汉理工大学学报》 CAS CSCD 北大核心 2010年第20期55-60,共6页 Journal of Wuhan University of Technology
基金 国家高技术研究发展863计划项目(2009AA12Z117) 襄樊学院规划项目(2009YA012)
关键词 非均衡数据集 重取样 机器学习 分类 imbalanced data sets re-sampling machine learning classification
  • 相关文献

参考文献12

  • 1Chawla N V, Bower K W, Hall L O, et al. Smote: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research, 2002,16 (3) 321-357.
  • 2Chawla N V, Lazarevic A, Hall L O, et al. Smoteboost: Improving Prediction of the Minority Class in Boosting [ C] //Lecture Notes In Computer Science, 2003 : 107-119.
  • 3Han H, Wang W, Mao B. Borderline-smote: A New Over-sampling Method in Imbalanced Data Sets Learning [ J ]. Lecture Notes In Computer Science, 2005,3644 (1) : 878-887.
  • 4杨智明,乔立岩,彭喜元.基于改进SMOTE的不平衡数据挖掘方法研究[J].电子学报,2007,35(B12):22-26. 被引量:30
  • 5Hart P E. The Condensed Nearest Neighbor Rule[J]. IEEE Transactions on Information Theory, 1968,14 (3):515-516.
  • 6Laurikkala J. Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]//Artificial Intelligence in Medicine, 2001 : 63-66.
  • 7Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One-sided Selection [ C] //Proceedings of the Fourteenth International Conference on Machine Learning, 1997:179-186.
  • 8Tomek I. Two Modifications of Cnn[J ]. IEEE Transactions on Systems, Man and Cybernetics, 1976,6(6):769-772.
  • 9Estabrooks A. A Combination Scheme for Inductive Learning from Imbalanced Data Sets[ D]. Dalhousie University, 2000.
  • 10Stanfill C, Waltz D. Toward Memory-based Reasoning[J]. Communications of the ACM, 1986,29 (12) 1213-1228.

二级参考文献12

  • 1Weiss GM. Mining with rarity: A unifying framework [ J ]. SIGKDD Explorations, 2004,6(1) : 7 - 19.
  • 2Chawla N, Bowyer K, Hall L, Kegelmeyer W. SMOTE: Synthetic minority over-sampling technique[ J]. Journal of Artificial Intelligence Research,2002,16(1) :321 - 357.
  • 3Kubat M,Matwin S. Addressing the curse of imbalanced training sets:one-sided selection[A] .Proc of the 14th International Conference on Machine Leaming[C]. San Francisco,CA: Morgan Kaufmann, 1997.217 - 225.
  • 4Japkowicz N, Stephen S. The class imbalance problem: a systematic study [J]. Intelligent Data Analysis Journal, 2002, 6 (5) :429 - 450.
  • 5Gustavo E, Batista P, Ronaldo C.A study of the behavior of several methods for balancing machine learning training data [J]. SIGKDD Explorations, 2004,6 ( 1 ) : 20 - 29.
  • 6Veropoulos K, Campbell C, Cristianini N. Controlling the sensitivity of support vector machines[ A]. Proceedings of the International Joint Conference on AI[ C ]. San Francisco, CA: Morgan Kaufmann, 1999.55 - 60.
  • 7T Imam,K M Ting,J Kamruzzaman. z-SVM:An SVM for improved classification of imbalanced data [ A ]. Australian Joint Conference on AI[ C]. Hobart, Australia: Springer, 2006.264 -273.
  • 8L M Manevitz,M Yousef. One-class SVMs for document classification[ J]. Journal of Machine Leaming Research, 2001,2 (1):139- 154.
  • 9Chawla N, Bowyer K, Hall L, Kegelmeyer W. SMOTEBoost: Improving prediction of the minority class in boosting[A]. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases [ C ]. Cavtat-Dubrovnik, Croatia: Springer,2003. 107- 119.
  • 10Wu G, Chang E. Class-boundary alignment for imbalanced dataset learning[ A]. Workshop on Leaming from Imbalanced Data Sets Ⅱ,ICML[C]. Washington, DC: AAAI Press,2003: 49 - 56.

共引文献29

同被引文献12

引证文献1

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部