数据挖掘中平衡偏斜训练集的方法研究被引量：3

Balancing Method for Skewed Training Set in Data Mining

下载PDF

导出

摘要分类是数据挖掘的重要任务之一.训练分类器的训练集可能是偏斜数据.传统分类算法处理偏斜训练集,通常会使少数类别样例的分类精度很低.已有的偏斜训练集平衡算法都是针对只有两种目标类的情况.为平衡拥有多种目标类的偏斜训练集,基于同类样例差异较小的思想给出SSGP算法,在同类样例附近增加少数类别样例,且使多种少数类别样例同速增加.并证明SSGP算法不会向数据集中添加噪声样例.为提高效率,用样例取模取代大量相异度计算.实验表明,只需执行一遍SSGP算法就能同时提高多种少数类别样例的分类精度. Classification is one of the important tasks in data mining. The training sets that are extracted for training classifiers are usually skewed. Traditional classification algorithms usually result in low predictive accuracy of minority classes when handling skewed training sets. The existing balancing algorithms only deal with the data sets which contain two classes of cases. In order to balance the training sets that have several classes, an algorithm called SSGP is introduced, based on the idea that little difference lies between the same class cases. SSGP forms new minority class cases by interpolating between several minority class cases that lie together, and makes sure that the number of each minority class case increases at the same speed. It is proved that SSGP would not add noise to the data set. To enhance the efficiency, SSGP adopts the modulus in stead of calculating a lot of dissimilarity between cases. The experimental results show that SSGP can improve the predictive accuracy of several minority classes by running once.

作者李雄飞李军屈成伟刘丽娟孙涛

机构地区符号计算与知识工程教育部重点实验室(吉林大学) 长春理工大学应用数学系

出处《计算机研究与发展》 EI CSCD 北大核心 2012年第2期346-353,共8页 Journal of Computer Research and Development

基金国家科技支撑计划基金项目(2006BAK01A33) 吉林省科技发展计划基金项目(20070321 20090704)

关键词分类偏斜训练集平衡算法少数类别样例模 classification skewed training data balancing algorithm minority class case modulus

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献13

1Lewis D, Ringuette M. A comparison of two learning algorithms for text categorization [C/OL] //Proc of the 3rd Annual Symp on Document Analysis and Information Retrieval. 1994. [2008-05-20]. http://eiteseerx, ist. psu. edu/viewdoc/download?doi = 10. 1. 1. 49. 860&rep= repl & type-- pdf.
2Fawcett T, Provost F. Combining data mining and machine learning for effective user profiling [C] //Proc of the 2nd Int Conf on Knowledge Discovery and Data Mining. Portland, OR: AAAI, 1996:8-13.
3Prati R C, Batista E A P A, Monard M C. Glass imbatances versus class overlapping: An analysis of a learning system behavior [G]//LNCS 2972: Proc of the 3rd Mexican Int Conf on Artificial Intelligence. Berlin: Springer, 2004:312-321.
4Kumar A, Nagadevara V. Development of hybrid classification methodology for mining skewed data sets A case study of lndian customs data [C] //Proc of the IEEE Int Conf on Computer Systems and Applications. Los Alamitos, CA: IEEE Computer Society, 2006:584-591.
5Weiss G. Mining with rarity: A unifying framework [J]. ACM SIGKDD Explorations, 2004, 6(1) : 7-19.
6Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
7Tomek I. Two modifications of CNN [J]. IEEE Trans on Systems, Man, and Cybernetics, 1976, 7(2): 679-772.
8Batista G E AP A, PratiRC, Monard MC. A study of the behavior of several methods for balancing machine learning training data [J]. ACM SIGKDD Explorations, 2004, 6(1) 20-29.
9Wilson D R. Asymptotic properties of nearest neighbor rules using edited data [J]. IEEE Trans on systems, Man and Cybernetics, 1972, 2(3): 408-421.
10Chen Jianxun, Cheng Tsanghsiang, Chan A L F, et al. An application of classification analysis for skewed class distribution in therapeutic drug monitoring The case of vaneomycin [C]//Proc of the IDEAS-DH. Los Alamitos, CA: IEEE Computer Society, 2004:35-39.

二级参考文献10

1Han Jiawei, Micheline. Data Mining: Concepts and Techniques.San Francisco: Morgan Kaufmann Publishers, 2000.
2M. Ester, HP. Kriegel, J. Sander, et al. A density based algorithm of discovering clusters in large spatial databases with noise. In: E. Simoudis, Han Jiawei, U. M. Fayyad, eds. Proc.the 2nd Int'l Conf. Knowledge Discovery and Data Mining Portland. Menlo Park, CA: AAAI Press, 1996. 226～231.
3Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In: Proc.ACM SIGMOD Int'l Conf. Management of Data. New York:ACM Press, 1996. 73～84.
4S. Guha, R. Rostogi, K. Shim. CURE: An efficient clustering algorithm for large databases. In: L. M. Haas, A. Tiwary, eds.Proc. the ACM SIGMOD Int'l Conf. Management of Data Seattle. New York: ACM Press, 1998. 73～84.
5W. Zhnn, et al. Muntz. STING: A statistical information grid approach to spatial data mining. In: Proc. 23rd VLDB Conf.,San Francisco: Morgan Kaufrnann, 1997. 186～195.
6S. Kantabutra, A. L. Couch. Parallel k-means clustering algorithm on Nows. NECTEC Technical Journal, 1999, 1 ( 1 ) :243～ 247.
7Manasi N. Joshi. Parallel k-means algorithm on distributed memory multiprocessors. http:∥www. cs. umn. edu/～mnjoshi/PKMeans. pdf, 2003.
8C. Pizzuti, D. Talia. P-Autoclass: Scalable parallel clustering for mining large data sets. IEEE Trans. Knowledge and Data Engineering, 2003, 15(6): 629～641.
9O. Egecioglu, H. Ferhatosmanoglu, U. Ogras. Dimensionality reduction and similarity computation by inner-product approximates. IEEE Trans. Knowledge and Data Engineering,2004, 16(6): 714～726.
10Maria Halkidi, Michalis Vazirgiannis. Clustering validity assessment: Finding the optimal partitioning of a data set. IEEE Int'l Conf. Data Mining, California, 2001.

共引文献14

1唐皓,刘希玉.引力流形上的空间聚类[J].科协论坛（下半月）,2009(10):96-98.
2陈晓云,王平,何春霞,冷明伟.基于三角不等式原理的TTSAS聚类加速算法[J].计算机工程,2006,32(17):97-99. 被引量：1
3刘峰,刘希玉,刘弘.流形上的空间密度聚类算法研究[J].中国海洋大学学报（自然科学版）,2007,37(4):681-684. 被引量：1
4吉根林,凌霄汉,杨明.一种基于集成学习的分布式聚类算法[J].东南大学学报（自然科学版）,2007,37(4):585-588. 被引量：1
5刘峰,刘希玉,张建萍.基于拓扑聚类的密度聚类算法研究[J].山东师范大学学报（自然科学版）,2007,22(3):30-33.
6刘希玉,张建萍.一种基于密度聚类的一般观点——拓扑聚类[J].计算机工程与应用,2007,43(26):164-168.
7倪巍伟,陈耿,孙志挥.一种基于数据垂直划分的分布式密度聚类算法[J].计算机研究与发展,2007,44(9):1612-1617. 被引量：8
8刘韬,蔡淑琴,曹丰文,崔志磊.基于距离浓度的K-均值聚类算法[J].华中科技大学学报（自然科学版）,2007,35(10):50-52. 被引量：7
9赵伟,李文辉,张姝.一种改进R-Link的空间数据检索算法[J].吉林大学学报（理学版）,2008,46(3):499-503. 被引量：1
10孙涛,李雄飞,刘丽娟.数据分布不敏感的决策树算法[J].吉林大学学报（工学版）,2009,39(6):1607-1611. 被引量：1

同被引文献19

1王月盈.淘宝恶意评价解决对策探讨[J].经济视野,2013(18). 被引量：1
2刘胥影,吴建鑫,周志华.一种基于级联模型的类别不平衡数据分类方法[J].南京大学学报（自然科学版）,2006,42(2):148-155. 被引量：23
3凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量：35
4Chawla N V, Bowyer K, Hall L, et al. SMOTE: Synthetic Mino- rity Over-sampling Technique[J]. Journal of Artificial Intelli- gence Research, 2002,16(1) : 321-357.
5Tomek I. Two modifications of CNN[J]. IEEE Transaction on Systems, Man and Communications, 1976,26 (1) : 769-772.
6Kermanidis K, Maragoundakis K, Fakotakis N, et al. Learning greek verb complements: addressing the class imbalance[C]//'Procee- dings of the 20th International Conference on Computational Linguistics. Geneva, Switzerland, 2004 : 1065-1071.
7Yen Show-jane, Lee Yue-shi. Under-sampling approaches for improving prediction of the minority class in an imbalaneed data- set[C]//Proceedings of Intelligent Control and Automation,Se- ries: I.ecture Notes in Control and Information Sciences. Berlin/ Heidelberg: Springer, 2006 : 731-740.
8Tang Y, Zhang Y Q, Chawla N V, et al. SVMs modeling for highly imbalanced classifications[J]. IEEE Transaction on Sys- tems, Man, and Cybernetics, Part B: Cybernetics, 2009,39 ( 1 ) : 281-288.
9Ertekin S, Huang J,Bottou L, et al. Learning on the border: ac tive learning in imbalanced data classification[C]//Proceedings of the ACM Conference on Information and Knowledge Manage- ment. Lisbon, Portugal, 2007 : 127-136.
10Monard,M. C. ,Batista G. E. A. P. A. Learning with Skewed Class Distributions. Advances in Logic[C]. Artificial Intelligence and Robotics, Sao Paulo, SP, 2002 : 173 - 180.

引证文献3

1胡小生,张润晶,钟勇.两层聚类的类别不平衡数据挖掘算法[J].计算机科学,2013,40(11):271-275. 被引量：6
2米洁,杨国林,马晓波.恶意网购行为分类算法研究[J].内蒙古工业大学学报（自然科学版）,2015,34(1):47-51.
3许统德,赵志俊,高俊文.多层级联式少数类聚类高精度数据挖掘算法[J].控制工程,2018,25(5):829-834. 被引量：12

二级引证文献18

1胡小生,温菊屏,钟勇.动态平衡采样的不平衡数据集成分类方法[J].智能系统学报,2016,11(2):257-263. 被引量：13
2野梅娜,李艳艳,杨陈军,张瑞.非平衡数据处理方法在癫痫发作检测中的应用[J].西北大学学报（自然科学版）,2016,46(6):789-794. 被引量：2
3赵楠,张小芳,张利军.不平衡数据分类研究综述[J].计算机科学,2018,45(B06):22-27. 被引量：47
4王凯亮,陆俊,徐志强,齐增清,龚钢军,王赟.基于先验知识与DBM采样的类不平衡用电数据分类方法[J].电力系统自动化,2019,43(20):57-64. 被引量：11
5臧玉魏,谢连科,张永,张国英,吴健,白晓春.基于电力营销聚类分析的数据挖掘算法研究[J].信息技术,2020,44(4):56-59. 被引量：13
6蒋华,江日辰,王鑫,王慧娇.ADASYN和SMOTE相结合的不平衡数据分类算法[J].计算机仿真,2020,37(3):254-258. 被引量：16
7张登科.基于数据挖掘技术的图书馆个性化系统设计[J].长春师范大学学报,2020,39(4):75-79. 被引量：2
8徐建中,陈潜心,李冰洋.数据挖掘下建筑安全防护驱动因素检测仿真[J].计算机仿真,2020,37(6):430-433. 被引量：1
9郑琳,张辉.云环境下基于群智能算法的大数据聚类挖掘技术[J].现代电子技术,2020,43(15):115-118. 被引量：11
10白玲玲.基于尺度划分的多尺度数据挖掘算法设计[J].宁夏师范学院学报,2020,41(7):65-72. 被引量：1

1米洁,杨国林,马晓波.恶意网购行为分类算法研究[J].内蒙古工业大学学报（自然科学版）,2015,34(1):47-51.
2汪莉.基于改进k-means算法的入侵检测方法设计[J].科技广场,2010(5):63-65.
3陈雪娇,任燕.基于决策树与相异度的离群数据挖掘方法[J].微计算机信息,2009,25(21):131-132. 被引量：1
4徐平安,唐雁,石教开,张辉荣.基于薛定谔方程的K-Means聚类算法[J].山东大学学报（工学版）,2016,46(1):34-41. 被引量：1
5池云.异构信息网络的分类研究[J].计算机应用与软件,2014,31(6):330-333.
6李丹,张旭亚,孙丽娜.面向对象的土地利用遥感分类方法研究[J].安徽农业科学,2013,41(20):8586-8588. 被引量：6
7赖积保,王慧强,郑逢斌,冯光升.基于DSimC和EWDS的网络安全态势要素提取方法[J].计算机科学,2010,37(11):64-69. 被引量：7
8胡小生,张润晶,钟勇.一种基于聚类提升的不平衡数据分类算法[J].集成技术,2014,3(2):35-41. 被引量：6
9李轶鲲,杨树文,刘涛.基于内容的遥感岩性信息提取方法研究[J].兰州交通大学学报,2013,32(3):164-168. 被引量：1
10徐久成,冯森,穆辉宇.基于信噪比与随机森林的肿瘤特征基因选择[J].河南师范大学学报（自然科学版）,2017,45(2):87-92. 被引量：11

计算机研究与发展

2012年第2期

浏览历史

内容加载中请稍等...

数据挖掘中平衡偏斜训练集的方法研究被引量：3

参考文献13

二级参考文献10

共引文献14

同被引文献19

引证文献3

二级引证文献18

相关作者

相关机构

相关主题

浏览历史

数据挖掘中平衡偏斜训练集的方法研究 被引量：3

参考文献13

二级参考文献10

共引文献14

同被引文献19

引证文献3

二级引证文献18

相关作者

相关机构

相关主题

浏览历史

数据挖掘中平衡偏斜训练集的方法研究被引量：3