期刊文献+

数据挖掘中平衡偏斜训练集的方法研究 被引量:3

Balancing Method for Skewed Training Set in Data Mining
下载PDF
导出
摘要 分类是数据挖掘的重要任务之一.训练分类器的训练集可能是偏斜数据.传统分类算法处理偏斜训练集,通常会使少数类别样例的分类精度很低.已有的偏斜训练集平衡算法都是针对只有两种目标类的情况.为平衡拥有多种目标类的偏斜训练集,基于同类样例差异较小的思想给出SSGP算法,在同类样例附近增加少数类别样例,且使多种少数类别样例同速增加.并证明SSGP算法不会向数据集中添加噪声样例.为提高效率,用样例取模取代大量相异度计算.实验表明,只需执行一遍SSGP算法就能同时提高多种少数类别样例的分类精度. Classification is one of the important tasks in data mining. The training sets that are extracted for training classifiers are usually skewed. Traditional classification algorithms usually result in low predictive accuracy of minority classes when handling skewed training sets. The existing balancing algorithms only deal with the data sets which contain two classes of cases. In order to balance the training sets that have several classes, an algorithm called SSGP is introduced, based on the idea that little difference lies between the same class cases. SSGP forms new minority class cases by interpolating between several minority class cases that lie together, and makes sure that the number of each minority class case increases at the same speed. It is proved that SSGP would not add noise to the data set. To enhance the efficiency, SSGP adopts the modulus in stead of calculating a lot of dissimilarity between cases. The experimental results show that SSGP can improve the predictive accuracy of several minority classes by running once.
出处 《计算机研究与发展》 EI CSCD 北大核心 2012年第2期346-353,共8页 Journal of Computer Research and Development
基金 国家科技支撑计划基金项目(2006BAK01A33) 吉林省科技发展计划基金项目(20070321 20090704)
关键词 分类 偏斜训练集 平衡算法 少数类别样例 classification skewed training data balancing algorithm minority class case modulus
  • 相关文献

参考文献13

  • 1Lewis D, Ringuette M. A comparison of two learning algorithms for text categorization [C/OL] //Proc of the 3rd Annual Symp on Document Analysis and Information Retrieval. 1994. [2008-05-20]. http://eiteseerx, ist. psu. edu/viewdoc/download?doi = 10. 1. 1. 49. 860&rep= repl & type-- pdf.
  • 2Fawcett T, Provost F. Combining data mining and machine learning for effective user profiling [C] //Proc of the 2nd Int Conf on Knowledge Discovery and Data Mining. Portland, OR: AAAI, 1996:8-13.
  • 3Prati R C, Batista E A P A, Monard M C. Glass imbatances versus class overlapping: An analysis of a learning system behavior [G]//LNCS 2972: Proc of the 3rd Mexican Int Conf on Artificial Intelligence. Berlin: Springer, 2004:312-321.
  • 4Kumar A, Nagadevara V. Development of hybrid classification methodology for mining skewed data sets A case study of lndian customs data [C] //Proc of the IEEE Int Conf on Computer Systems and Applications. Los Alamitos, CA: IEEE Computer Society, 2006:584-591.
  • 5Weiss G. Mining with rarity: A unifying framework [J]. ACM SIGKDD Explorations, 2004, 6(1) : 7-19.
  • 6Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
  • 7Tomek I. Two modifications of CNN [J]. IEEE Trans on Systems, Man, and Cybernetics, 1976, 7(2): 679-772.
  • 8Batista G E AP A, PratiRC, Monard MC. A study of the behavior of several methods for balancing machine learning training data [J]. ACM SIGKDD Explorations, 2004, 6(1) 20-29.
  • 9Wilson D R. Asymptotic properties of nearest neighbor rules using edited data [J]. IEEE Trans on systems, Man and Cybernetics, 1972, 2(3): 408-421.
  • 10Chen Jianxun, Cheng Tsanghsiang, Chan A L F, et al. An application of classification analysis for skewed class distribution in therapeutic drug monitoring The case of vaneomycin [C]//Proc of the IDEAS-DH. Los Alamitos, CA: IEEE Computer Society, 2004:35-39.

二级参考文献10

  • 1Han Jiawei, Micheline. Data Mining: Concepts and Techniques.San Francisco: Morgan Kaufmann Publishers, 2000.
  • 2M. Ester, HP. Kriegel, J. Sander, et al. A density based algorithm of discovering clusters in large spatial databases with noise. In: E. Simoudis, Han Jiawei, U. M. Fayyad, eds. Proc.the 2nd Int'l Conf. Knowledge Discovery and Data Mining Portland. Menlo Park, CA: AAAI Press, 1996. 226~231.
  • 3Tian Zhang, Raghu Ramakrishnan, Miron Livny. BIRCH: An efficient data clustering method for very large databases. In: Proc.ACM SIGMOD Int'l Conf. Management of Data. New York:ACM Press, 1996. 73~84.
  • 4S. Guha, R. Rostogi, K. Shim. CURE: An efficient clustering algorithm for large databases. In: L. M. Haas, A. Tiwary, eds.Proc. the ACM SIGMOD Int'l Conf. Management of Data Seattle. New York: ACM Press, 1998. 73~84.
  • 5W. Zhnn, et al. Muntz. STING: A statistical information grid approach to spatial data mining. In: Proc. 23rd VLDB Conf.,San Francisco: Morgan Kaufrnann, 1997. 186~195.
  • 6S. Kantabutra, A. L. Couch. Parallel k-means clustering algorithm on Nows. NECTEC Technical Journal, 1999, 1 ( 1 ) :243~ 247.
  • 7Manasi N. Joshi. Parallel k-means algorithm on distributed memory multiprocessors. http:∥www. cs. umn. edu/~mnjoshi/PKMeans. pdf, 2003.
  • 8C. Pizzuti, D. Talia. P-Autoclass: Scalable parallel clustering for mining large data sets. IEEE Trans. Knowledge and Data Engineering, 2003, 15(6): 629~641.
  • 9O. Egecioglu, H. Ferhatosmanoglu, U. Ogras. Dimensionality reduction and similarity computation by inner-product approximates. IEEE Trans. Knowledge and Data Engineering,2004, 16(6): 714~726.
  • 10Maria Halkidi, Michalis Vazirgiannis. Clustering validity assessment: Finding the optimal partitioning of a data set. IEEE Int'l Conf. Data Mining, California, 2001.

共引文献14

同被引文献19

  • 1王月盈.淘宝恶意评价解决对策探讨[J].经济视野,2013(18). 被引量:1
  • 2刘胥影,吴建鑫,周志华.一种基于级联模型的类别不平衡数据分类方法[J].南京大学学报(自然科学版),2006,42(2):148-155. 被引量:23
  • 3凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 4Chawla N V, Bowyer K, Hall L, et al. SMOTE: Synthetic Mino- rity Over-sampling Technique[J]. Journal of Artificial Intelli- gence Research, 2002,16(1) : 321-357.
  • 5Tomek I. Two modifications of CNN[J]. IEEE Transaction on Systems, Man and Communications, 1976,26 (1) : 769-772.
  • 6Kermanidis K, Maragoundakis K, Fakotakis N, et al. Learning greek verb complements: addressing the class imbalance[C]//'Procee- dings of the 20th International Conference on Computational Linguistics. Geneva, Switzerland, 2004 : 1065-1071.
  • 7Yen Show-jane, Lee Yue-shi. Under-sampling approaches for improving prediction of the minority class in an imbalaneed data- set[C]//Proceedings of Intelligent Control and Automation,Se- ries: I.ecture Notes in Control and Information Sciences. Berlin/ Heidelberg: Springer, 2006 : 731-740.
  • 8Tang Y, Zhang Y Q, Chawla N V, et al. SVMs modeling for highly imbalanced classifications[J]. IEEE Transaction on Sys- tems, Man, and Cybernetics, Part B: Cybernetics, 2009,39 ( 1 ) : 281-288.
  • 9Ertekin S, Huang J,Bottou L, et al. Learning on the border: ac tive learning in imbalanced data classification[C]//Proceedings of the ACM Conference on Information and Knowledge Manage- ment. Lisbon, Portugal, 2007 : 127-136.
  • 10Monard,M. C. ,Batista G. E. A. P. A. Learning with Skewed Class Distributions. Advances in Logic[C]. Artificial Intelligence and Robotics, Sao Paulo, SP, 2002 : 173 - 180.

引证文献3

二级引证文献18

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部