摘要
交叉验证被广泛应用于模型的泛化误差估计,特别是2折交叉验证在分类模型比较中得到广泛的应用.主要针对Logistic分类回归模型采用2折交叉验证的不同切分方法且特征(自变量)取值均为0,1时对模型性能的影响进行了模拟.结果表明,当2折交叉验证的两份数据中的类别分布相同或相近时,准确率、召回率、F值及精确率的2折交叉验证估计的偏差最小,且估计的偏差随着2折交叉验证中类别的差异增加而增加.当2折交叉验证中数据的类别分布相差较大时,模型性能的估计明显地变差.因此,采用交叉验证切分数据时,应尽量保持每份数据的类别分布与总体一致.
Cross-validation is widely used in the model generalization error estimation. In particular,the 2 fold cross-validation has been widely used in the classification model's compari- son. Using 2 fold cross-validation method in the Logistic regression model and characteristics (in- dependent variable) values are 0 or 1 when studing the model's performance. The results show that precision, recall rate, F value and the accurate rate of 2 fold cross-validation deviation estima- tion are minimum when the distribution of categories are same or similar in the 2 fold cross-vali- dation,the estimation of deviation increases with the 2 fold cross-validation category difference. The estimation of model's performance is significant degraded when class distributions of 2 fold data sets diverge. Therefore, we should try to keep the distribution of each data category consis- tency with sample when using cross-validation segmentation data.
出处
《太原师范学院学报(自然科学版)》
2013年第1期53-58,共6页
Journal of Taiyuan Normal University:Natural Science Edition