改进型加权KNN算法的不平衡数据集分类被引量：26

Classification for Imbalanced Dataset of Improved Weighted KNN Algorithm

下载PDF

导出

摘要 K最邻近(KNN)算法对不平衡数据集进行分类时分类判决总会倾向于多数类。为此,提出一种加权KNN算法GAK-KNN。定义新的权重分配模型,综合考虑类间分布不平衡及类内分布不均匀的不良影响,采用基于遗传算法的K-means算法对训练样本集进行聚类,按照权重分配模型计算各训练样本的权重,通过改进的KNN算法对测试样本进行分类。基于UCI数据集的大量实验结果表明,GAK-KNN算法的识别率和整体性能都优于传统KNN算法及其他改进算法。 Based on analyzing the shortages of K-Nearest Neighbor（KNN） algorithm in solving classification problems on imbalanced dataset,a novel KNN approach based on weight strategy（GAK-KNN） is presented.The key of GAK-KNN lies on defining a new weight assignment model,which can fully take into account the adverse effects caused by the uneven distribution of training sample between classes and within classes.The specific steps are as follows： use K-means algorithm based on Genetic Algorithm（GA） to cluster the training sample set,compute the weight for each training sample in accordance to the clustering results and weight assignment model,use the improved KNN algorithm to classify the test samples.GAK-KNN algorithm can significantly improve the identification rate of the minority samples and overall classification performance.Theoretical analysis and comprehensive experimental results on the UCI dataset con？rm the claims.

作者王超学潘正茂马春森董丽丽张涛

机构地区西安建筑科技大学信息与控制工程学院中国农业科学院植物保护研究所

出处《计算机工程》 CAS CSCD 2012年第20期160-163,168,共5页 Computer Engineering

基金国家自然科学基金资助项目(31170393) 陕西省自然科学基金资助项目(2012JM8023) 陕西省教育厅自然科学专项基金资助项目(12JK0726)

关键词不平衡数据集分类 K最邻近算法权重分配模型遗传算法 K-MEANS算法 imbalanced dataset classification K-Nearest Neighbor（KNN） algorithm weight assignment model Genetic Algorithm（GA） K-means algorithm

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献7

1Paolo S. A Multi-objective Optimization Approach for Class Imbalance Learning[J]. Pattern Recognition, 2011, 44(8): 1801- 1810.
2Tan Songbo. Neighbor-weighted K-nearest Neighbor for Unbalanced Text Corpus[J]. Expert Systems with Applications, 2005, 28(4): 667-671.
3郝秀兰,陶晓鹏,徐和祥,胡运发.kNN文本分类器类偏斜问题的一种处理对策[J].计算机研究与发展,2009,46(1):52-61. 被引量：33
4Jason V H, Taghi K. Knowledge Discovery from Imbalanced and Noisy Data[J]. Knowledge and Data Engineering, 2009, 68(12): 1513-1542.
5边婧,彭新光.不平衡入侵检测数据的代价敏感分类策略[J].计算机应用研究,2009,26(8):3036-3038. 被引量：6
6李荣陆,胡运发.基于密度的kNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539-545. 被引量：98
7Holland J H. Adaptation in Nature and Artificial Systems[M]. Ann Arbor, USA: The University of Michigan Press, 1975.

二级参考文献36

1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：389
2Japkowicz N. Learning from imbalanced data sets: A comparison of various strategies, WS-00-05 [R]. Menlo Park, CA: AAAI Press, 2000
3Chawla N V, Japkowicz N, Kotcz A. Editorial: Special issue on learning from imbalaneed data sets [J]. Sigkdd Explorations Newsletters, 2004, 6( 1 ) : 1-6
4Weiss Gary M. Mining with rarity: A unifying frameworks [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 7-19
5Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown [OL]. [2008-01-06]. http://www. site. uottawa. ca/-nat/workshop2003/workshop 2003. html
6Chawla N V, Hall L O, Bowyer K W, et al. SMOTE: Synthetic minority oversampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16 : 321-357
7Jo Taeho, Japkowicz Nathalie. Class imbalances versus small disjunets [J]. SIGKDD Explorations Newsletters, 2004, 6 (1): 40-49
8Batista E A P A, Prati R C, Monard M C. A study of the behavior of several methods for halaneing machine learning training data [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 20-29
9Guo Hongyu, Viktor Herna L. Learning from imbalanced data sets with boosting and data generation: The DataBoostIM approach [J]. SIGKDD Explorations Newsletters, 2004, 6(1): 30-39
10Chawla N V, Lazarevic A, Hall L O, et al. Smoteboost: Improving prediction of the minority class in boosting [C] // Proc of the Seventh European Conf on Principles and Practice of Knowledge Discovery in Databases. Berlin: Springer, 2003:107-119

<12 3 4 >

共引文献127

1姚学恒,张萍,闫立伟,操诚.基于机器学习的企业秘密文档自动分类方法[J].产业与科技论坛,2020,19(7):44-45.
2郑凌铭,舒胜文,陈彬,吴涵,黄建业,钱健.强台风环境下基于格点化和支持向量机的10 kV杆塔受损量预测方法[J].高电压技术,2020,46(1):42-51. 被引量：15
3隋国华,李春雷.基于组合分类器的地层含油情况智能决策系统[J].计算机研究与发展,2011,48(S3):476-479.
4李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量：96
5华北,曹先彬.基于代表样本动态生成的中文网页分类[J].计算机应用,2006,26(10):2502-2504. 被引量：2
6李订芳,胡文超,何炎祥.基于共享最近邻聚类和模糊集理论的分类器[J].控制与决策,2006,21(10):1103-1108. 被引量：5
7王煜,白石,王正欧.用于Web文本分类的快速KNN算法[J].情报学报,2007,26(1):60-64. 被引量：33
8屈军,林旭.文本分类中特征提取方法的比较与分析[J].现代计算机,2007,13(4):10-13. 被引量：8
9印鉴,谭焕云.基于χ~2统计量的kNN文本分类算法[J].小型微型计算机系统,2007,28(6):1094-1097. 被引量：13
10华北,曹先彬.基于代表样本动态生成的快速文本分类[J].计算机仿真,2007,24(6):322-325.

<12 3 4 5…13 >

同被引文献223

1唐琳,郭崇慧,陈静锋.中文分词技术研究综述[J].数据分析与知识发现,2020,4(2):1-17. 被引量：45
2崔宇,侯慧娟,苏磊,钱涛,盛戈皞,江秀臣.考虑不平衡案例样本的电力变压器故障诊断方法[J].高电压技术,2020,46(1):33-41. 被引量：33
3贾自艳,何清,张海俊,李嘉佑,史忠植.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280. 被引量：59
4杨斌,匡立春,孙中春,施泽进.一种用于测井油气层综合识别的支持向量机方法[J].测井技术,2005,29(6):511-514. 被引量：26
5惠守博,王文杰.支持向量机分类算法中多元变量共线性问题的改进[J].计算机工程与设计,2006,27(8):1385-1388. 被引量：10
6赵华,赵铁军,于浩,张姝.面向动态演化的话题检测研究[J].高技术通讯,2006,16(12):1230-1235. 被引量：17
7周树德,孙增圻.分布估计算法综述[J].自动化学报,2007,33(2):113-124. 被引量：210
8HASTIE T,Tibshirani R,Friedman J.统计学习基础:数据挖掘、推理与预测[M].北京:电子工业出版社,2004:305-356.
9Han Jia-wei, Kamber M, Pei Jian.数据挖掘概念与技术[M].范明,孟小峰,译.北京:机械工业出版社,2012:213.
10王煜,白石,王正欧.基于特征权重优化的改进KNN Web文本分类算法[J].情报学报,2007,26(5):643-647. 被引量：2