期刊文献+

交叉验证中类别切分不均衡对分类性能的影响分析 被引量:3

Impact Analysis of Classification Performance for Cross-Validation of Imbalance Spliting Data
下载PDF
导出
摘要 交叉验证被广泛应用于模型的泛化误差估计,特别是2折交叉验证在分类模型比较中得到广泛的应用.主要针对Logistic分类回归模型采用2折交叉验证的不同切分方法且特征(自变量)取值均为0,1时对模型性能的影响进行了模拟.结果表明,当2折交叉验证的两份数据中的类别分布相同或相近时,准确率、召回率、F值及精确率的2折交叉验证估计的偏差最小,且估计的偏差随着2折交叉验证中类别的差异增加而增加.当2折交叉验证中数据的类别分布相差较大时,模型性能的估计明显地变差.因此,采用交叉验证切分数据时,应尽量保持每份数据的类别分布与总体一致. Cross-validation is widely used in the model generalization error estimation. In particular,the 2 fold cross-validation has been widely used in the classification model's compari- son. Using 2 fold cross-validation method in the Logistic regression model and characteristics (in- dependent variable) values are 0 or 1 when studing the model's performance. The results show that precision, recall rate, F value and the accurate rate of 2 fold cross-validation deviation estima- tion are minimum when the distribution of categories are same or similar in the 2 fold cross-vali- dation,the estimation of deviation increases with the 2 fold cross-validation category difference. The estimation of model's performance is significant degraded when class distributions of 2 fold data sets diverge. Therefore, we should try to keep the distribution of each data category consis- tency with sample when using cross-validation segmentation data.
出处 《太原师范学院学报(自然科学版)》 2013年第1期53-58,共6页 Journal of Taiyuan Normal University:Natural Science Edition
关键词 2折交叉验证 Logistic分类回归模型 类别切分不均衡 模型性能 2 fold cross-validation Logistic regression model imbalance splitting classes
  • 相关文献

参考文献12

  • 1Shao J un, Rao J N K. Standard errors for low income proportions estimated from stratified multi-stage samples[J]. The IndianJournal of Statistics 1993,55 : 393-414.
  • 2Hastie T, Tibshirani R. The elements of statistical learning[J]. The Mathematical Iutelligencer, 2005,27 (2) : 83- 85.
  • 3Dietterich T G. Approximate statistical tests for comparing supervised cassification learning algorithms[J]. MIT Press, 1998, 10:1 895-1 924.
  • 4Guillaume Bouehard,Gilles Celeux. Choosing a model in a elassifcation purpose[C]. International Environmental Modelling and Software Soeiety (iEMSS), Ottawa, David A Swayne, Wanhong Yong, Voinov A A, F : Latova, 2010 : 2 046-2 056.
  • 5Padhraie Smyth. Model seleetion for probabilistie clustering using cross-validated likelihood. Statistics and Compution[J]. Journal of the American Statistical Association, 2002,97 : 63-72.
  • 6Hafidi B, Mkhadri A. Repeated half sampling criterion for model seleetion[J]. The Indian Journal of Statistics, 2004,66:566- 581.
  • 7Alpaydin E. Combined 5x2CV F test for comparing supervised classification learning algorithms[J]. Massachusetts Institute of Teehnogy, 1999,11(8) : 1 885-1 892.
  • 8Breiman L, Speetor P. Submodel Selection and Evaluation in Regression[J]. Wiley Interdisciplinary Reviews, 2011,1( 1):14-23.
  • 9Schaffer C. Selecting a Classification Method by Cross-Validation[J]. Machine Learning, 1993,13 : 135-143.
  • 10Diamantidis N A,Karlis D,Giakoumakis E A. Unsupervised Stratification of Cross-Validation for Accuracy Estimation[J]. Artificial Intelligenee, 2000,116 : 1-16.

同被引文献21

  • 1ZHU X J. Semi -supervised Learning Literature Survey [ R]. Madison : University of Wisconsin, 2008.
  • 2CH APELLE O, ZIEN A. Semi -supervised Classifica- tion by Low Density Separation [ C ]. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Barbados, 2005. 57 -64.
  • 3ZHOU Z H , LI M. Tri -training: exploiting unlabeled data using three classifiers [ J ] . IEEE Transactions on Knowl- edge and Data Engineering , 2005, 17(11) :1529-1542.
  • 4Zhang M L, ZHOU Z H. CoTrade: Confident co -train- ing with data editing[J]. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 2011, 41 (6) : 1612 - 1626.
  • 5WANG Yun - yun, CHEN Song - cai, ZHOU Zhi - hua. New semi - supervised classification method based on modified cluster assumption [ J ]. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23 (5): 689 - 702.
  • 6LI Y F, KWOK J T, ZHOU Z H. Cost - Sensitive Semi - supervised Support Vector Machine [ A ]. In : Proceed- ings of the 24th AAAI Conference on Artificial Intelli- gences (AAAI10) [ C]. Atlanta, GE, 2010, 500 - 505.
  • 7MENG Jun, WU Li - xia, WANG Xiu - kun. Granulation -based symbolic representation of time series and semi -supervised classification [ J ]. Computers and Mathe- matics with Applications, 2011, 62 (9) : 3581 - 3590.
  • 8KONG Xiang - nan, NG M K, ZHOU Zhi - hua. Trans- ductive multi -label learning via label set propagation [ J ]. IEEE Transactions on Knowledge and Data Engi- neering (TKDE), 2011.25 (3) :704 - 719.
  • 9JOHN G H. Robust Decision Trees: Removing Outliers from Databases [ A ]. Proceedings of the First Internation-al Conference on Knowledge Discovery and Data Mining [ C]. Menlo Park,CA: AAAI Press,1995. 174 - 179.
  • 10COHN D, GHAHRAMANI Z, JORDAN MI. Active learning with statistical models [ J ]. Journal of Artificial Intelligence Research,1996, (4) : 129 - 145.

引证文献3

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部