In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived ...In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived from two quite different perspectives. Here, settling on the most commonly accepted definition of the MSPE as the expectation of the squared prediction error loss, we provide theoretical expressions for it, valid for any linear model (LM) fitter, be it under random or non random designs. Specializing these MSPE expressions for each of them, we are able to derive closed formulas of the MSPE for some of the most popular LM fitters: Ordinary Least Squares (OLS), with or without a full column rank design matrix;Ordinary and Generalized Ridge regression, the latter embedding smoothing splines fitting. For each of these LM fitters, we then deduce a computable estimate of the MSPE which turns out to coincide with Akaike’s FPE. Using a slight variation, we similarly get a class of MSPE estimates coinciding with the classical GCV formula for those same LM fitters.展开更多
为了实现提高产量和抵抗病害等能力的目的,需要提高育种水平,通过设计交差验证(Cross-Validation)实验进行大豆基因型和表型数据的分组处理,根据数据的个体和mark的数量进行合理分配,采用gBLUP(genomic Best Linear Unbiased Prediction...为了实现提高产量和抵抗病害等能力的目的,需要提高育种水平,通过设计交差验证(Cross-Validation)实验进行大豆基因型和表型数据的分组处理,根据数据的个体和mark的数量进行合理分配,采用gBLUP(genomic Best Linear Unbiased Prediction)方法进行表型预测。根据对大豆数据多个性状通过不同分组的对比来得到精确值的范围,为后续的育种分析提供依据。对于只有大豆基因型数据而没有表型数据的情况,需要模拟表型,根据设定遗传力和模拟位点的个数(NQTN)进行模拟,然后再进行不同分组获取精准值,这样扩大了大豆数据的预测灵活性。展开更多
A method for fast 1-fold cross validation is proposed for the regularized extreme learning machine (RELM). The computational time of fast l-fold cross validation increases as the fold number decreases, which is oppo...A method for fast 1-fold cross validation is proposed for the regularized extreme learning machine (RELM). The computational time of fast l-fold cross validation increases as the fold number decreases, which is opposite to that of naive 1-fold cross validation. As opposed to naive l-fold cross validation, fast l-fold cross validation takes the advantage in terms of computational time, especially for the large fold number such as l 〉 20. To corroborate the efficacy and feasibility of fast l-fold cross validation, experiments on five benchmark regression data sets are evaluated.展开更多
Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Predictio...Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.Methods: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.Results: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.Conclusions: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.展开更多
In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues....In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.展开更多
Support vector machine (SVM) has been successfully applied for classification in this paper. This paper discussed the basic principle of the SVM at first, and then SVM classifier with polynomial kernel and the Gaussia...Support vector machine (SVM) has been successfully applied for classification in this paper. This paper discussed the basic principle of the SVM at first, and then SVM classifier with polynomial kernel and the Gaussian radial basis function kernel are choosen to determine pupils who have difficulties in writing. The 10-fold cross-validation method for training and validating is introduced. The aim of this paper is to compare the performance of support vector machine with RBF and polynomial kernel used for classifying pupils with or without handwriting difficulties. Experimental results showed that the performance of SVM with RBF kernel is better than the one with polynomial kernel.展开更多
Cross iteration often exists in the computational process of the simulation models, especially for control models. There is a credibility defect tracing problem in the validation of models with cross iteration. In ord...Cross iteration often exists in the computational process of the simulation models, especially for control models. There is a credibility defect tracing problem in the validation of models with cross iteration. In order to resolve this problem, after the problem formulation, a validation theorem on the cross iteration is proposed, and the proof of the theorem is given under the cross iteration circumstance. Meanwhile, applying the proposed theorem, the credibility calculation algorithm is provided, and the solvent of the defect tracing is explained. Further, based on the validation theorem on the cross iteration, a validation method for simulation models with the cross iteration is proposed, which is illustrated by a flowchart step by step. Finally, a validation example of a sixdegree of freedom (DOF) flight vehicle model is provided, and the validation process is performed by using the validation method. The result analysis shows that the method is effective to obtain the credibility of the model and accomplish the defect tracing of the validation.展开更多
For the nonparametric regression model Y-ni = g(x(ni)) + epsilon(ni)i = 1, ..., n, with regularly spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold ...For the nonparametric regression model Y-ni = g(x(ni)) + epsilon(ni)i = 1, ..., n, with regularly spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation parameters are chosen by cross-validation on the everage squared error, strong consistency for the case of dyadic sample size and moment consistency for arbitrary sample size are established under some regular conditions.展开更多
文摘In regression, despite being both aimed at estimating the Mean Squared Prediction Error (MSPE), Akaike’s Final Prediction Error (FPE) and the Generalized Cross Validation (GCV) selection criteria are usually derived from two quite different perspectives. Here, settling on the most commonly accepted definition of the MSPE as the expectation of the squared prediction error loss, we provide theoretical expressions for it, valid for any linear model (LM) fitter, be it under random or non random designs. Specializing these MSPE expressions for each of them, we are able to derive closed formulas of the MSPE for some of the most popular LM fitters: Ordinary Least Squares (OLS), with or without a full column rank design matrix;Ordinary and Generalized Ridge regression, the latter embedding smoothing splines fitting. For each of these LM fitters, we then deduce a computable estimate of the MSPE which turns out to coincide with Akaike’s FPE. Using a slight variation, we similarly get a class of MSPE estimates coinciding with the classical GCV formula for those same LM fitters.
文摘为了实现提高产量和抵抗病害等能力的目的,需要提高育种水平,通过设计交差验证(Cross-Validation)实验进行大豆基因型和表型数据的分组处理,根据数据的个体和mark的数量进行合理分配,采用gBLUP(genomic Best Linear Unbiased Prediction)方法进行表型预测。根据对大豆数据多个性状通过不同分组的对比来得到精确值的范围,为后续的育种分析提供依据。对于只有大豆基因型数据而没有表型数据的情况,需要模拟表型,根据设定遗传力和模拟位点的个数(NQTN)进行模拟,然后再进行不同分组获取精准值,这样扩大了大豆数据的预测灵活性。
基金supported by the National Natural Science Foundation of China(51006052)the NUST Outstanding Scholar Supporting Program
文摘A method for fast 1-fold cross validation is proposed for the regularized extreme learning machine (RELM). The computational time of fast l-fold cross validation increases as the fold number decreases, which is opposite to that of naive 1-fold cross validation. As opposed to naive l-fold cross validation, fast l-fold cross validation takes the advantage in terms of computational time, especially for the large fold number such as l 〉 20. To corroborate the efficacy and feasibility of fast l-fold cross validation, experiments on five benchmark regression data sets are evaluated.
基金supported by the US Department of Agriculture,Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive grant no.2015-67015-22947
文摘Background: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.Methods: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.Results: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.Conclusions: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.
文摘In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.
文摘Support vector machine (SVM) has been successfully applied for classification in this paper. This paper discussed the basic principle of the SVM at first, and then SVM classifier with polynomial kernel and the Gaussian radial basis function kernel are choosen to determine pupils who have difficulties in writing. The 10-fold cross-validation method for training and validating is introduced. The aim of this paper is to compare the performance of support vector machine with RBF and polynomial kernel used for classifying pupils with or without handwriting difficulties. Experimental results showed that the performance of SVM with RBF kernel is better than the one with polynomial kernel.
基金supported by the National Natural Science Foundation of China(61374164)
文摘Cross iteration often exists in the computational process of the simulation models, especially for control models. There is a credibility defect tracing problem in the validation of models with cross iteration. In order to resolve this problem, after the problem formulation, a validation theorem on the cross iteration is proposed, and the proof of the theorem is given under the cross iteration circumstance. Meanwhile, applying the proposed theorem, the credibility calculation algorithm is provided, and the solvent of the defect tracing is explained. Further, based on the validation theorem on the cross iteration, a validation method for simulation models with the cross iteration is proposed, which is illustrated by a flowchart step by step. Finally, a validation example of a sixdegree of freedom (DOF) flight vehicle model is provided, and the validation process is performed by using the validation method. The result analysis shows that the method is effective to obtain the credibility of the model and accomplish the defect tracing of the validation.
文摘For the nonparametric regression model Y-ni = g(x(ni)) + epsilon(ni)i = 1, ..., n, with regularly spaced nonrandom design, the authors study the behavior of the nonlinear wavelet estimator of g(x). When the threshold and truncation parameters are chosen by cross-validation on the everage squared error, strong consistency for the case of dyadic sample size and moment consistency for arbitrary sample size are established under some regular conditions.