期刊文献+

一种基于混合策略的失衡数据集分类方法 被引量:16

A Classification Method for Imbalance Data Set Based on Hybrid Strategy
下载PDF
导出
摘要 提出了一种有效应用于失衡数据集的分类方法,其核心思想是从样本预处理和分类器改进两方面入手,为失衡数据集的分类问题提供全面的解决方案.首先创造性地采用动态自组织映射聚类的方法对失衡数据集进行重采样,这种采样方法,有效地解决了传统重采样的方法随机性强,人为主观干扰以及信息损失等弊端.随后借助K-近邻规则的思想,对新采集的样本进行剪枝,有效地解决了实际存在的数据混叠现象.算法对SVM的核函数进行等角变换,由此对类边界进行了校准,以适应样本类别失衡的情况.通过对三种算法的对比实验证明了算法在失衡数据集分类上的有效性.本文的算法已经在答案抽取技术中得到了成功应用,并在TREC2006国际QA评测中得到了客观充分的验证. This paper presents a novel and effective classification method for imbalanced data sets.The core idea of the algorithrn,which is composed of three parts,is to provide a general solution for IDS classification by both sample preprocessing and classifter improving.Firstly,we re-sample the imbalance data by using variable SOM clustering so as to overcome the flaws of the traditional re-sampling methods,such as serious randomness,subjective interference and information loss.Then we cut down the sampled data sets according to the K-NN rule to solve the problem of data confusion,which improves the generalization of SVM.Especially, in order to adapt the class imbalance,the class boundary alignment is introduced through conformal transform on kernel function. The comparison results show the effectiveness of three algorithms.Meanwhile,the algorithm has also been used in our question answer system,which obtains outstanding result in the international TREC-2006 QA track.
出处 《电子学报》 EI CAS CSCD 北大核心 2007年第11期2161-2165,共5页 Acta Electronica Sinica
基金 国家自然科学基金重点项目(No.60435020) 国家863高技术研究发展计划重点项目(No.2006AA01Z197)
关键词 失衡数据集 分类 支持向量机 动态自组织映射 K-近邻 imbalanced data sets(IDS) classification support vector machine(SVM) variable self-organizing maps(VSOM) K-nearest neighbor(K-NN)
  • 相关文献

参考文献14

  • 1Chawla N V,et al. Editorial: special issue on learning flom irabalanced data sets [ J ]. ACM SIGKDD Explorations, 2004, 6 (1):1-6.
  • 2Batista G,et al.A study of the behavior of several methods for balancing machine learning[ J] .ACM SIGKDD Explorations, 2004,6(1):20-29.
  • 3Estabrooks A, et al. A multiple resampling method for learning from imbalanced data sets [ J ]. Computational Intelligence, 2004,20(1) : 18-36.
  • 4Japkowicz N, et al. The class imbalance problem: a systematic study[ J]. Intelligent Data Analysis,2002,6(5) : 429-450.
  • 5Japkowicz N, et al. Learning from imbalanced data sets: a comparison of various strategies [ A ]. Proceedings of the AAAI' 2000 Workshop on Imbalanced Data Sets [ C ]. CA: AAAI Press,2000.10-15.
  • 6Provost F, et al. Machine learning from imbalanced data sets [A]. In Proceedings of the AAAI' 2000 Workshop on Imbalanced Data Sets[C]. CA:AAAI Press,2000. 101-103.
  • 7Visa S, et al. The effect of imbalanced data class distribution on fuzzy classifiers-experimental study[ A]. In Proceedings of the FUZZ-IEEE Conference[ C]. USA: IEEE Press,2005.22-26.
  • 8苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:378
  • 9方景龙,陈铄,潘志庚,梁荣华.复杂分类问题支持向量机的简化[J].电子学报,2007,35(5):858-861. 被引量:9
  • 10刘涵,郭勇,郑岗,刘丁.基于最小二乘支持向量机的图像边缘检测研究[J].电子学报,2006,34(7):1275-1279. 被引量:17

二级参考文献29

  • 1李红莲,王春花,袁保宗,朱占辉.针对大规模训练集的支持向量机的学习策略[J].计算机学报,2004,27(5):715-719. 被引量:53
  • 2王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 3李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 4Hearst M A, Dumais S T, Osman E, Platt J, Scholkopf B.Support Vector Machines. IEEE Intelligent Systems, 1998, 13(4) : 18-28.
  • 5Ke Hai-Xin,Zhang Xue-Gong. Editing support vector machines.In: Proceedings of International Joint Conference on Neural Networks, Washington, USA, 2001, 2:1464-1467.
  • 6Vapnik V N. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 1999, 10 (5): 988-999.
  • 7Vapnik V N. Statistical Learning Theory. 2nd ed. New York:Springer-Verlag : 1999.
  • 8Klaus-Robert Mailer, Sebastian Mika, Gunnar Raetsch, Koji Tsuda, and Bernhard Schoelkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 2001, 12 (2): 181-201.
  • 9Burges C J C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998, 2(2): 121-167.
  • 10Torre V,Poggio T A.On edge detection[J].IEEE Trans on Pattern Analysis and Machine Intelligence,1986,8 (2):147-163.

共引文献470

同被引文献280

引证文献16

二级引证文献180

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部