引入偏置选择变量的不平衡数据集重采样方法

An Imbalanced Data Set Resampling Algorithm by Introducing Bias Selection Variable

下载PDF

导出

摘要不平衡数据分类是模式分类领域较难处理的一类问题,其主要原因在于类间样本数目不均衡。为了有效地提高不平衡数据分类效果,本文提出了一种引入偏置选择变量的不平衡数据集重采样算法。该算法引入一个偏置选择变量,该变量定义了多数类样本被取样的概率。通过引入偏置选择变量可以有效地降低不平衡度,因此能很好地提高分类算法在不平衡数据集上的泛化性能。在人工生成数据集上的分类实验充分验证了本文重采样算法的有效性。 Imbalanced data classification is more difficult to handle in the field of pattern classification, mainly due to the uneven number of samples between classes. In order to effectively improve the classification performance on imbalanced data set, this paper proposes an imbalaneed data set resampling algorithm by introducing bias selection variable. The al- gorithm introduced a bias selection variable, which defines the sampling probability of the majority class sample. By in- troducing bias selection variables, the imbalanced degree of data sets can be effectively reduced, and thus the generaliza- tion performance of the classification algorithm on imbalanced data sets can be improved . Classification experiments on artificially generated data sets fully verify the validity of this proposed algorithm.

作者徐尽

机构地区许昌学院计算机科学与技术学院

出处《科技通报》北大核心 2013年第8期139-141,共3页 Bulletin of Science and Technology

关键词模式分类偏置选择变量不平衡度泛化性能 pattem classification bias selection variables the imbalanced degree generalization performance

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Gustavo E A,Batista P A,Ronaldo C,et al.A study ofthe behavior of several methods for balancing machinelearning training data[J].SIGKDD Explorations,2004,6(1):20-29.
2Drummond C,Holte R C.C4.5,class imbalance,and costsensitivity:why under-sampling beats over-sampling[C]//.International Conference on Machine Learning.Washing-ton DC,2003:152-154.
3Quinlan,J.R,Induction of decision trees[J].Machinelearning.1986,1(1):81-106.
4Kohavi R.A study of cross-validation and bootstrap foraccuracy estimation and model selection.[C]//Wermter S,Riloff E,Scheler G,eds.Proc.14th Joint Int.Conf.Artifi-cial Intelligence.San Mateo,CA:Morgan Kaufmann,1995.1137-1145.
5孙英慧,孙英娟,蒲东兵.基于决策目标的知识获取方法[J].科技通报,2012,28(12):78-80. 被引量：1
6HanJW,KamberM著.范明译.Data Mining Conceptsand Techniques,第二版[M].北京:机械工业出版社,2001:257-259.

二级参考文献10

1杨明.一种基于改进差别矩阵的属性约简增量式更新算法[J].计算机学报,2007,30(5):815-822. 被引量：112
2Pawlak Z.Rough sets [J].International Journal of Informa- tion and Computer Seienee,1982,11(5): 341-356.
3Ming Wen Shao, Wen Xiu Zhang. Dominance Relation and Rules in an Ineomplete Ordered Information System [J].International Journal of Intelligent Systems,2005, 20: 13-27.
4Li Wanqing, Ma Lihua, Meng Wenqing, Du Fengqiu. Analysis of Risk Decision of E-Commerce Project Based on Data Mining Of Rough Sets [C]. Proceedings of the 2007 WSEAS International Conference on Computer En- gineering and Applications, Gold Coast Australia, Jan- uary,2007,24(28) : 17-19.
5L.Polkowski and P.Artiemjew.Rough Sets In Data Analysis [J]. Foundations and Applications, Studies in Computa- tional Intelligence(SCI) 2008,122: 33-54.
6Junhong Wang,Jiye Liang,Yuhua Qian.Uncertainty Mea- sure of Rough Sets Based on A Knowledge Granulation for Incomplete Information Systems[J]. International Jour- nal of Uncertainty, Fuzziness and Knowledge-Based Sys- tems,2008,16(2) :233 -244.
7王立宏,吴耿锋.离散化方案的度量[J].模式识别与人工智能,2008,21(4):494-499. 被引量：1
8王柯,朱启兵,崔宝同.决策表连续属性离散化的一种方法[J].计算机工程与应用,2008,44(30):148-149. 被引量：3
9Dongbo ZHANG,Yaonan WANG.A new ensemble feature selection and its application to pattern classification[J].控制理论与应用（英文版）,2009,7(4):419-426. 被引量：1
10冯林,王国胤,李天瑞.连续值属性决策表中的知识获取方法[J].电子学报,2009,37(11):2432-2438. 被引量：15

共引文献3

1陈树娟.基于局部线性嵌入的随机森林算法[J].科技通报,2013,29(8):33-35.
2王象刚.基于K均值随机森林快速算法及入侵检测中的应用[J].科技通报,2013,29(8):76-78. 被引量：2
3孙永科,周开来.核PCA神经网络集成算法在文本识别中的应用[J].科技通报,2013,29(8):124-126. 被引量：3

1陈兴稣,王雪峰.一种基于聚类的不平衡数据分类算法[J].信息技术,2013,37(8):57-60. 被引量：1
2唐新宇,陈晓明.基于合成新样本的不平衡数据集上采样算法[J].科技通报,2013,29(8):127-129.
3晁学鹏.一种基于K均值聚类的下采样算法[J].科技通报,2013,29(8):73-75. 被引量：3
4姚晓玲.浅析final关键字在Java中的应用[J].电脑知识与技术（过刊）,2007(22):1069-1070. 被引量：1
5臧峥嵘.实用简单的PATH编辑工具[J].电脑编程技巧与维护,1994,0(3):12-27.
6李震梅,杨爱军,谷笑娜.基于虚拟仪器的电能质量监测和分析的软件设计[J].山东理工大学学报（自然科学版）,2004,18(5):8-12. 被引量：5
7王建,杨耀权,马高伟.软测量辅助变量选择方法研究[J].电力科学与工程,2011,27(7):37-40. 被引量：11
8薄翠梅,张湜,郭庆武,李俊.软测量模型自动生成系统的研究与开发[J].控制工程,2004,11(S2):168-172.
9王大庆,姜文潭.电子计算机机房电能质量问题的探讨[J].大连轻工业学院学报,2006,25(1):69-71.
10周伟,王建军,李继锐.基于人工神经网络的影响高速公路社会效益量化的变量选择方法[J].西安公路交通大学学报,2000,20(3):62-66. 被引量：8

科技通报

2013年第8期

浏览历史

内容加载中请稍等...

引入偏置选择变量的不平衡数据集重采样方法

参考文献6

二级参考文献10

共引文献3

相关作者

相关机构

相关主题

浏览历史