摘要
提出了一种有效应用于失衡数据集的分类方法,其核心思想是从样本预处理和分类器改进两方面入手,为失衡数据集的分类问题提供全面的解决方案.首先创造性地采用动态自组织映射聚类的方法对失衡数据集进行重采样,这种采样方法,有效地解决了传统重采样的方法随机性强,人为主观干扰以及信息损失等弊端.随后借助K-近邻规则的思想,对新采集的样本进行剪枝,有效地解决了实际存在的数据混叠现象.算法对SVM的核函数进行等角变换,由此对类边界进行了校准,以适应样本类别失衡的情况.通过对三种算法的对比实验证明了算法在失衡数据集分类上的有效性.本文的算法已经在答案抽取技术中得到了成功应用,并在TREC2006国际QA评测中得到了客观充分的验证.
This paper presents a novel and effective classification method for imbalanced data sets.The core idea of the algorithrn,which is composed of three parts,is to provide a general solution for IDS classification by both sample preprocessing and classifter improving.Firstly,we re-sample the imbalance data by using variable SOM clustering so as to overcome the flaws of the traditional re-sampling methods,such as serious randomness,subjective interference and information loss.Then we cut down the sampled data sets according to the K-NN rule to solve the problem of data confusion,which improves the generalization of SVM.Especially, in order to adapt the class imbalance,the class boundary alignment is introduced through conformal transform on kernel function. The comparison results show the effectiveness of three algorithms.Meanwhile,the algorithm has also been used in our question answer system,which obtains outstanding result in the international TREC-2006 QA track.
出处
《电子学报》
EI
CAS
CSCD
北大核心
2007年第11期2161-2165,共5页
Acta Electronica Sinica
基金
国家自然科学基金重点项目(No.60435020)
国家863高技术研究发展计划重点项目(No.2006AA01Z197)
关键词
失衡数据集
分类
支持向量机
动态自组织映射
K-近邻
imbalanced data sets(IDS)
classification
support vector machine(SVM)
variable self-organizing maps(VSOM)
K-nearest neighbor(K-NN)