Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can a...Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can affect its quantification performance.In this work,we propose a hybrid variable selection method to improve the performance of LIBS quantification.Important variables are first identified using Pearson's correlation coefficient,mutual information,least absolute shrinkage and selection operator(LASSO)and random forest,and then filtered and combined with empirical variables related to fingerprint elements of coal ash content.Subsequently,these variables are fed into a partial least squares regression(PLSR).Additionally,in some models,certain variables unrelated to ash content are removed manually to study the impact of variable deselection on model performance.The proposed hybrid strategy was tested on three LIBS datasets for quantitative analysis of coal ash content and compared with the corresponding data-driven baseline method.It is significantly better than the variable selection only method based on empirical knowledge and in most cases outperforms the baseline method.The results showed that on all three datasets the hybrid strategy for variable selection combining empirical knowledge and data-driven algorithms achieved the lowest root mean square error of prediction(RMSEP)values of 1.605,3.478 and 1.647,respectively,which were significantly lower than those obtained from multiple linear regression using only 12 empirical variables,which are 1.959,3.718 and 2.181,respectively.The LASSO-PLSR model with empirical support and 20 selected variables exhibited a significantly improved performance after variable deselection,with RMSEP values dropping from 1.635,3.962 and 1.647 to 1.483,3.086 and 1.567,respectively.Such results demonstrate that using empirical knowledge as a support for datadriven variable selection can be a viable approach to improve the accuracy and reliability of LIBS quantification.展开更多
With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. However, curr...With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. However, current genetic association mapping analyses are focused on identifying individual QTLs. This study aimed to identify a set of QTLs or genetic markers, which can capture genetic variability for marker-assisted selection. Selecting a set with k loci that can maximize genetic variation out of high throughput genomic data is a challenging issue. In this study, we proposed an adaptive sequential replacement (ASR) method, which is considered a variant of the sequential replacement (SR) method. Through Monte Carlo simulation and comparing with four other selection methods: exhaustive, SR method, forward, and backward methods we found that the ASR method sustains consistent and repeatable results comparable to the exhaustive method with much reduced computational intensity.展开更多
Coal is a crucial fossil energy in today’s society,and the detection of sulfir(S) and nitrogen(N)in coal is essential for the evaluation of coal quality.Therefore,an efficient method is needed to quantitatively analy...Coal is a crucial fossil energy in today’s society,and the detection of sulfir(S) and nitrogen(N)in coal is essential for the evaluation of coal quality.Therefore,an efficient method is needed to quantitatively analyze N and S content in coal,to achieve the purpose of clean utilization of coal.This study applied laser-induced breakdown spectroscopy(LIBS) to test coal quality,and combined two variable selection algorithms,competitive adaptive reweighted sampling(CARS) and the successive projections algorithm(SPA),to establish the corresponding partial least square(PLS) model.The results of the experiment were as follows.The PLS modeled with the full spectrum of 27,620 variables has poor accuracy,the coefficient of determination of the test set(R^2 P) and root mean square error of the test set(RMSEP) of nitrogen were 0.5172 and 0.2263,respectively,and those of sulfur were0.5784 and 0.5811,respectively.The CARS-PLS screened 37 and 25 variables respectively in the detection of N and S elements,but the prediction ability of the model did not improve significantly.SPA-PLS finally screened 14 and 11 variables respectively through successive projections,and obtained the best prediction effect among the three methods.The R^2 P and RMSEP of nitrogen were0.9873 and 0.0208,respectively,and those of sulfur were 0.9451 and 0.2082,respectively.In general,the predictive results of the two elements increased by about 90% for RMSEP and 60% for R2 P compared with PLS.The results show that LIBS combined with SPA-PLS has good potential for detecting N and S content in coal,and is a very promising technology for industrial application.展开更多
In this article, we study the variable selection of partially linear single-index model(PLSIM). Based on the minimized average variance estimation, the variable selection of PLSIM is done by minimizing average varianc...In this article, we study the variable selection of partially linear single-index model(PLSIM). Based on the minimized average variance estimation, the variable selection of PLSIM is done by minimizing average variance with adaptive l1 penalty. Implementation algorithm is given. Under some regular conditions, we demonstrate the oracle properties of aLASSO procedure for PLSIM. Simulations are used to investigate the effectiveness of the proposed method for variable selection of PLSIM.展开更多
The multiple determination tasks of chemical properties are a classical problem in analytical chemistry. The major problem is concerned in to find the best subset of variables that better represents the compounds. The...The multiple determination tasks of chemical properties are a classical problem in analytical chemistry. The major problem is concerned in to find the best subset of variables that better represents the compounds. These variables are obtained by a spectrophotometer device. This device measures hundreds of correlated variables related with physicocbemical properties and that can be used to estimate the component of interest. The problem is the selection of a subset of informative and uncorrelated variables that help the minimization of prediction error. Classical algorithms select a subset of variables for each compound considered. In this work we propose the use of the SPEA-II (strength Pareto evolutionary algorithm II). We would like to show that the variable selection algorithm can selected just one subset used for multiple determinations using multiple linear regressions. For the case study is used wheat data obtained by NIR (near-infrared spectroscopy) spectrometry where the objective is the determination of a variable subgroup with information about E protein content (%), test weight (Kg/HI), WKT (wheat kernel texture) (%) and farinograph water absorption (%). The results of traditional techniques of multivariate calibration as the SPA (successive projections algorithm), PLS (partial least square) and mono-objective genetic algorithm are presents for comparisons. For NIR spectral analysis of protein concentration on wheat, the number of variables selected from 775 spectral variables was reduced for just 10 in the SPEA-II algorithm. The prediction error decreased from 0.2 in the classical methods to 0.09 in proposed approach, a reduction of 37%. The model using variables selected by SPEA-II had better prediction performance than classical algorithms and full-spectrum partial least-squares.展开更多
This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to im...This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to implement due to the improvement in computing via MCMC sampling. We described the Bayesian methodology by introducing the Bayesian framework, and explaining Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings within Gibbs sampling was used to draw dependent samples from the full conditional distributions which were explained. In mixture experiments with process variables, the response depends not only on the proportions of the mixture components but also on the effects of the process variables. In many such mixture-process variable experiments, constraints such as time or cost prohibit the selection of treatments completely at random. In these situations, restrictions on the randomisation force the level combinations of one group of factors to be fixed and the combinations of the other group of factors are run. Then a new level of the first-factor group is set and combinations of the other factors are run. We discussed the computational algorithm for the Stochastic Search Variable Selection (SSVS) in linear mixed models. We extended the computational algorithm of SSVS to fit models from split-plot mixture design by introducing the algorithm of the Stochastic Search Variable Selection for Split-plot Design (SSVS-SPD). The motivation of this extension is that we have two different levels of the experimental units, one for the whole plots and the other for subplots in the split-plot mixture design.展开更多
In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accu...In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accuracy as noninformative variables will be removed. Having a model with high prediction accuracy allows the researchers to use the model for future forecasting. In this paper, we investigate the differences between various variable selection methods. The aim is to compare the analysis of the frequentist methodology (the backward elimination), penalised shrinkage method (the Adaptive LASSO) and the Least Angle Regression (LARS) for selecting the active variables for data produced by the blocked design experiment. The result of the comparative study supports the utilization of the LARS method for statistical analysis of data from blocked experiments.展开更多
Although there are many papers on variable selection methods based on mean model in the nite mixture of regression models,little work has been done on how to select signi cant explanatory variables in the modeling of ...Although there are many papers on variable selection methods based on mean model in the nite mixture of regression models,little work has been done on how to select signi cant explanatory variables in the modeling of the variance parameter.In this paper,we propose and study a novel class of models:a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population.The problem of variable selection for the proposed models is considered.In particular,a modi ed Expectation-Maximization(EM)algorithm for estimating the model parameters is developed.The consistency and the oracle property of the penalized estimators is established.Simulation studies are conducted to investigate the nite sample performance of the proposed methodolo-gies.An example is illustrated by the proposed methodologies.展开更多
In this study,different methods of variable selection using the multilinear step-wise regression(MLR) and support vector regression(SVR) have been compared when the performance of genetic algorithms(GAs) using v...In this study,different methods of variable selection using the multilinear step-wise regression(MLR) and support vector regression(SVR) have been compared when the performance of genetic algorithms(GAs) using various types of chromosomes is used.The first method is a GA with binary chromosome(GA-BC) and the other is a GA with a fixed-length character chromosome(GA-FCC).The overall prediction accuracy for the training set by means of 7-fold cross-validation was tested.All the regression models were evaluated by the test set.The poor prediction for the test set illustrates that the forward stepwise regression(FSR) model is easier to overfit for the training set.The results using SVR methods showed that the over-fitting could be overcome.Further,the over-fitting would be easier for the GA-BC-SVR method because too many variables fleetly induced into the model.The final optimal model was obtained with good predictive ability(R2 = 0.885,S = 0.469,Rcv2 = 0.700,Scv = 0.757,Rex2 = 0.692,Sex = 0.675) using GA-FCC-SVR method.Our investigation indicates the variable selection method using GA-FCC is the most appropriate for MLR and SVR methods.展开更多
A simple but efficient method has been proposed to select variables in heteroscedastic regression models. It is shown that the pseudo empirical wavelet coefficients corresponding to the significant explanatory variabl...A simple but efficient method has been proposed to select variables in heteroscedastic regression models. It is shown that the pseudo empirical wavelet coefficients corresponding to the significant explanatory variables in the regression models are clearly larger than those nonsignificant ones, on the basis of which a procedure is developed to select variables in regression models. The coefficients of the models are also estimated. All estimators are proved to be consistent.展开更多
Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner.Further investigation of customer p...Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner.Further investigation of customer patterns helps thefirm to develop efficient decisions and in turn,helps to optimize the enter-prise’s business and maximizes consumer satisfaction correspondingly.To con-duct an effective assessment about the customers,Naive Bayes(also called Simple Bayes),a machine learning model is utilized.However,the efficacious of the simple Bayes model is utterly relying on the consumer data used,and the existence of uncertain and redundant attributes in the consumer data enables the simple Bayes model to attain the worst prediction in consumer data because of its presumption regarding the attributes applied.However,in practice,the NB pre-mise is not true in consumer data,and the analysis of these redundant attributes enables simple Bayes model to get poor prediction results.In this work,an ensem-ble attribute selection methodology is performed to overcome the problem with consumer data and to pick a steady uncorrelated attribute set to model with the NB classifier.In ensemble variable selection,two different strategies are applied:one is based upon data perturbation(or homogeneous ensemble,same feature selector is applied to a different subsamples derived from the same learning set)and the other one is based upon function perturbation(or heterogeneous ensemble different feature selector is utilized to the same learning set).Further-more,the feature set captured from both ensemble strategies is applied to NB indi-vidually and the outcome obtained is computed.Finally,the experimental outcomes show that the proposed ensemble strategies perform efficiently in choosing a steady attribute set and increasing NB classification performance efficiently.展开更多
This paper has compared variable selection method for multiple linear regression models that have both relative and non-relative variables in full model when predictor variables are highly correlated 0.999 . In this s...This paper has compared variable selection method for multiple linear regression models that have both relative and non-relative variables in full model when predictor variables are highly correlated 0.999 . In this study two objective functions used in the Tabu Search are mean square error (MSE) and the mean absolute error (MAE). The results of Tabu Search are compared with the results obtained by stepwise regression method based on the hit percentage criterion. The simulations cover the both cases, without and with multicollinearity problems. For each situation, 1,000 iterations are examined by applying a different sample size n = 25 and 100 at 0.05 level of significance. Without multicollinearity problem, the hit percentages of the stepwise regression method and Tabu Search using the objective function of MSE are almost the same but slightly higher than the Tabu Search using the objective function of MAE. However with multicollinearity problem the hit percentages of the Tabu Search using both objective functions are higher than the hit percentage of the stepwise regression method.展开更多
There are two fundamental goals in statistical learning: identifying relevant predictors and ensuring high prediction accuracy. The first goal, by means of variable selection, is of particular importance when the tru...There are two fundamental goals in statistical learning: identifying relevant predictors and ensuring high prediction accuracy. The first goal, by means of variable selection, is of particular importance when the true underlying model has a sparse representation. Discovering relevant predictors can enhance the performance of the prediction for the fitted model. Usually an estimate is considered desirable if it is consistent in terms of both coefficient estimate and variable selection. Hence, before we try to estimate the regression coefficients β , it is preferable that we have a set of useful predictors m hand. The emphasis of our task in this paper is to propose a method, in the aim of identifying relevant predictors to ensure screening consistency in variable selection. The primary interest is on Orthogonal Matching Pursuit(OMP).展开更多
Variable selection is one of the most fundamental problems in regression analysis. By sampling from the posterior distributions of candidate models, Bayesian variable selection via MCMC (Markov chain Monte-Carlo) is...Variable selection is one of the most fundamental problems in regression analysis. By sampling from the posterior distributions of candidate models, Bayesian variable selection via MCMC (Markov chain Monte-Carlo) is effective to overcome the computational burden of all-subset variable selection approaches. However, the convergence of the MCMC is often hard to determine and one is often not sure about if obtained samples are unbiased. This complication has limited the application of Bayesian variable selection in practice. Based on the idea of CFTP (coupling from the past), perfect sampling schemes have been developed to obtain independent samples from the posterior distribution for a variety of problems. Here the authors propose an efficient and effective perfect sampling algorithm for Bayesian variable selection of linear regression models, which independently and identically sample from the posterior distribution of the model space and can efficiently handle thousands of variables. The effectiveness of the authors' algorithm is illustrated by three simulation studies, which have up to thousands of variables, the authors' method is further illustrated in SNPs (single nucleotide polymorphisms) association study among RA (rheumatoid arthritis) patients.展开更多
Variable selection is applied widely for visible-near infrared(Vis-NIR)spectroscopy analysis of internal quality in fruits.Different spectral variable selection methods were compared for online quantitative analysis o...Variable selection is applied widely for visible-near infrared(Vis-NIR)spectroscopy analysis of internal quality in fruits.Different spectral variable selection methods were compared for online quantitative analysis of soluble solids content(SSC)in navel oranges.Moving window partial least squares(MW-PLS),Monte Carlo uninformative variables elimination(MC-UVE)and wavelet transform(WT)combined with the MC-UVE method were used to select the spectral variables and develop the calibration models of online analysis of SSC in navel oranges.The performances of these methods were compared for modeling the Vis NIR data sets of navel orange samples.Results show that the WT-MC-UVE methods gave better calibration models with the higher correlation cofficient(r)of 0.89 and lower root mean square error of prediction(RMSEP)of 0.54 at 5 fruits per second.It concluded that Vis NIR spectroscopy coupled with WT-MC-UVE may be a fast and efective tool for online quantitative analysis of SSC in navel oranges.展开更多
The penalized variable selection methods are often used to select the relevant covariates and estimate the unknown regression coefficients simultaneously,but these existing methods may fail to be consistent for the se...The penalized variable selection methods are often used to select the relevant covariates and estimate the unknown regression coefficients simultaneously,but these existing methods may fail to be consistent for the setting with highly correlated covariates.In this paper,the semi-standard partial covariance(SPAC)method with Lasso penalty is proposed to study the generalized linear model with highly correlated covariates,and the consistencies of the estimation and variable selection are shown in high-dimensional settings under some regularity conditions.Some simulation studies and an analysis of colon tumor dataset are carried out to show that the proposed method performs better in addressing highly correlated problem than the traditional penalized variable selection methods.展开更多
Deep learning has been increasingly popular in omics data analysis.Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability.However,because deep learning des...Deep learning has been increasingly popular in omics data analysis.Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability.However,because deep learning desires a large sample size,the existing methods may result in uncertain findings when the dataset has a small sample size,commonly seen in omics data analysis.With the explosion and availability of omics data from multiple populations/studies,the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets,which might lead to inaccurate variable selection results.We propose a penalized integrative deep neural network(PIN)to simultaneously select important variables from multiple datasets.PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework.Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets.The source code is freely available on Github(rucliyang/PINFunc).We speculate that the proposed PIN method will promote the identification of disease-related important variables based on multiple studies/datasets from diverse origins.展开更多
The variable selection of high dimensional nonparametric nonlinear systems aims to select the contributing variables or to eliminate the redundant variables.For a high dimensional nonparametric nonlinear system,howeve...The variable selection of high dimensional nonparametric nonlinear systems aims to select the contributing variables or to eliminate the redundant variables.For a high dimensional nonparametric nonlinear system,however,identifying whether a variable contributes or not is not easy.Therefore,based on the Fourier spectrum of densityweighted derivative,one novel variable selection approach is developed,which does not suffer from the dimensionality curse and improves the identification accuracy.Furthermore,a necessary and sufficient condition for testing a variable whether it contributes or not is provided.The proposed approach does not require strong assumptions on the distribution,such as elliptical distribution.The simulation study verifies the effectiveness of the novel variable selection algorithm.展开更多
A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes.Moreover,data that arise from a heterogeneous population can be...A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes.Moreover,data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models.These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations.With the appropriate choice of the tuning parameters,we establish the theoretical properties of the proposed procedure,including consistency for variable selection method and the oracle property in estimation.A productive nonparametric clustering method is applied to select the number of components,and an efficient EM algorithm for numerical computations is developed.Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.展开更多
Variable selection for varying coefficient models includes the separation of varying and constant effects,and the selection of variables with nonzero varying effects and those with nonzero constant effects.This paper ...Variable selection for varying coefficient models includes the separation of varying and constant effects,and the selection of variables with nonzero varying effects and those with nonzero constant effects.This paper proposes a unified variable selection approach called the double-penalized quadratic inference functions method for varying coefficient models of longitudinal data.The proposed method can not only separate varying coefficients and constant coefficients,but also estimate and select the nonzero varying coefficients and nonzero constant coefficients.It is suitable for variable selection of linear models,varying coefficient models,and partial linear varying coefficient models.Under regularity conditions,the proposed method is consistent in both separation and selection of varying coefficients and constant coefficients.The obtained estimators of varying coefficients possess the optimal convergence rate of non-parametric function estimation,and the estimators of nonzero constant coefficients are consistent and asymptotically normal.Finally,the authors investigate the finite sample performance of the proposed method through simulation studies and a real data analysis.The results show that the proposed method performs better than the existing competitor.展开更多
基金financial supports from National Natural Science Foundation of China(No.62205172)Huaneng Group Science and Technology Research Project(No.HNKJ22-H105)Tsinghua University Initiative Scientific Research Program and the International Joint Mission on Climate Change and Carbon Neutrality。
文摘Laser-induced breakdown spectroscopy(LIBS)has become a widely used atomic spectroscopic technique for rapid coal analysis.However,the vast amount of spectral information in LIBS contains signal uncertainty,which can affect its quantification performance.In this work,we propose a hybrid variable selection method to improve the performance of LIBS quantification.Important variables are first identified using Pearson's correlation coefficient,mutual information,least absolute shrinkage and selection operator(LASSO)and random forest,and then filtered and combined with empirical variables related to fingerprint elements of coal ash content.Subsequently,these variables are fed into a partial least squares regression(PLSR).Additionally,in some models,certain variables unrelated to ash content are removed manually to study the impact of variable deselection on model performance.The proposed hybrid strategy was tested on three LIBS datasets for quantitative analysis of coal ash content and compared with the corresponding data-driven baseline method.It is significantly better than the variable selection only method based on empirical knowledge and in most cases outperforms the baseline method.The results showed that on all three datasets the hybrid strategy for variable selection combining empirical knowledge and data-driven algorithms achieved the lowest root mean square error of prediction(RMSEP)values of 1.605,3.478 and 1.647,respectively,which were significantly lower than those obtained from multiple linear regression using only 12 empirical variables,which are 1.959,3.718 and 2.181,respectively.The LASSO-PLSR model with empirical support and 20 selected variables exhibited a significantly improved performance after variable deselection,with RMSEP values dropping from 1.635,3.962 and 1.647 to 1.483,3.086 and 1.567,respectively.Such results demonstrate that using empirical knowledge as a support for datadriven variable selection can be a viable approach to improve the accuracy and reliability of LIBS quantification.
文摘With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. However, current genetic association mapping analyses are focused on identifying individual QTLs. This study aimed to identify a set of QTLs or genetic markers, which can capture genetic variability for marker-assisted selection. Selecting a set with k loci that can maximize genetic variation out of high throughput genomic data is a challenging issue. In this study, we proposed an adaptive sequential replacement (ASR) method, which is considered a variant of the sequential replacement (SR) method. Through Monte Carlo simulation and comparing with four other selection methods: exhaustive, SR method, forward, and backward methods we found that the ASR method sustains consistent and repeatable results comparable to the exhaustive method with much reduced computational intensity.
基金the Jiangsu Government Scholarship for Overseas Studies (JS-2019-031)the Startup Foundation for Introducing Talent of NUIST (2243141701023)。
文摘Coal is a crucial fossil energy in today’s society,and the detection of sulfir(S) and nitrogen(N)in coal is essential for the evaluation of coal quality.Therefore,an efficient method is needed to quantitatively analyze N and S content in coal,to achieve the purpose of clean utilization of coal.This study applied laser-induced breakdown spectroscopy(LIBS) to test coal quality,and combined two variable selection algorithms,competitive adaptive reweighted sampling(CARS) and the successive projections algorithm(SPA),to establish the corresponding partial least square(PLS) model.The results of the experiment were as follows.The PLS modeled with the full spectrum of 27,620 variables has poor accuracy,the coefficient of determination of the test set(R^2 P) and root mean square error of the test set(RMSEP) of nitrogen were 0.5172 and 0.2263,respectively,and those of sulfur were0.5784 and 0.5811,respectively.The CARS-PLS screened 37 and 25 variables respectively in the detection of N and S elements,but the prediction ability of the model did not improve significantly.SPA-PLS finally screened 14 and 11 variables respectively through successive projections,and obtained the best prediction effect among the three methods.The R^2 P and RMSEP of nitrogen were0.9873 and 0.0208,respectively,and those of sulfur were 0.9451 and 0.2082,respectively.In general,the predictive results of the two elements increased by about 90% for RMSEP and 60% for R2 P compared with PLS.The results show that LIBS combined with SPA-PLS has good potential for detecting N and S content in coal,and is a very promising technology for industrial application.
文摘In this article, we study the variable selection of partially linear single-index model(PLSIM). Based on the minimized average variance estimation, the variable selection of PLSIM is done by minimizing average variance with adaptive l1 penalty. Implementation algorithm is given. Under some regular conditions, we demonstrate the oracle properties of aLASSO procedure for PLSIM. Simulations are used to investigate the effectiveness of the proposed method for variable selection of PLSIM.
文摘The multiple determination tasks of chemical properties are a classical problem in analytical chemistry. The major problem is concerned in to find the best subset of variables that better represents the compounds. These variables are obtained by a spectrophotometer device. This device measures hundreds of correlated variables related with physicocbemical properties and that can be used to estimate the component of interest. The problem is the selection of a subset of informative and uncorrelated variables that help the minimization of prediction error. Classical algorithms select a subset of variables for each compound considered. In this work we propose the use of the SPEA-II (strength Pareto evolutionary algorithm II). We would like to show that the variable selection algorithm can selected just one subset used for multiple determinations using multiple linear regressions. For the case study is used wheat data obtained by NIR (near-infrared spectroscopy) spectrometry where the objective is the determination of a variable subgroup with information about E protein content (%), test weight (Kg/HI), WKT (wheat kernel texture) (%) and farinograph water absorption (%). The results of traditional techniques of multivariate calibration as the SPA (successive projections algorithm), PLS (partial least square) and mono-objective genetic algorithm are presents for comparisons. For NIR spectral analysis of protein concentration on wheat, the number of variables selected from 775 spectral variables was reduced for just 10 in the SPEA-II algorithm. The prediction error decreased from 0.2 in the classical methods to 0.09 in proposed approach, a reduction of 37%. The model using variables selected by SPEA-II had better prediction performance than classical algorithms and full-spectrum partial least-squares.
文摘This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to implement due to the improvement in computing via MCMC sampling. We described the Bayesian methodology by introducing the Bayesian framework, and explaining Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings within Gibbs sampling was used to draw dependent samples from the full conditional distributions which were explained. In mixture experiments with process variables, the response depends not only on the proportions of the mixture components but also on the effects of the process variables. In many such mixture-process variable experiments, constraints such as time or cost prohibit the selection of treatments completely at random. In these situations, restrictions on the randomisation force the level combinations of one group of factors to be fixed and the combinations of the other group of factors are run. Then a new level of the first-factor group is set and combinations of the other factors are run. We discussed the computational algorithm for the Stochastic Search Variable Selection (SSVS) in linear mixed models. We extended the computational algorithm of SSVS to fit models from split-plot mixture design by introducing the algorithm of the Stochastic Search Variable Selection for Split-plot Design (SSVS-SPD). The motivation of this extension is that we have two different levels of the experimental units, one for the whole plots and the other for subplots in the split-plot mixture design.
文摘In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accuracy as noninformative variables will be removed. Having a model with high prediction accuracy allows the researchers to use the model for future forecasting. In this paper, we investigate the differences between various variable selection methods. The aim is to compare the analysis of the frequentist methodology (the backward elimination), penalised shrinkage method (the Adaptive LASSO) and the Least Angle Regression (LARS) for selecting the active variables for data produced by the blocked design experiment. The result of the comparative study supports the utilization of the LARS method for statistical analysis of data from blocked experiments.
基金Supported by the National Natural Science Foundation of China(11861041).
文摘Although there are many papers on variable selection methods based on mean model in the nite mixture of regression models,little work has been done on how to select signi cant explanatory variables in the modeling of the variance parameter.In this paper,we propose and study a novel class of models:a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population.The problem of variable selection for the proposed models is considered.In particular,a modi ed Expectation-Maximization(EM)algorithm for estimating the model parameters is developed.The consistency and the oracle property of the penalized estimators is established.Simulation studies are conducted to investigate the nite sample performance of the proposed methodolo-gies.An example is illustrated by the proposed methodologies.
基金supported by Youth Foundation of the Education Department of Sichuan Province (No.09ZB038)
文摘In this study,different methods of variable selection using the multilinear step-wise regression(MLR) and support vector regression(SVR) have been compared when the performance of genetic algorithms(GAs) using various types of chromosomes is used.The first method is a GA with binary chromosome(GA-BC) and the other is a GA with a fixed-length character chromosome(GA-FCC).The overall prediction accuracy for the training set by means of 7-fold cross-validation was tested.All the regression models were evaluated by the test set.The poor prediction for the test set illustrates that the forward stepwise regression(FSR) model is easier to overfit for the training set.The results using SVR methods showed that the over-fitting could be overcome.Further,the over-fitting would be easier for the GA-BC-SVR method because too many variables fleetly induced into the model.The final optimal model was obtained with good predictive ability(R2 = 0.885,S = 0.469,Rcv2 = 0.700,Scv = 0.757,Rex2 = 0.692,Sex = 0.675) using GA-FCC-SVR method.Our investigation indicates the variable selection method using GA-FCC is the most appropriate for MLR and SVR methods.
基金Zhou's research was partially supported by the foundations of NatioiMd Natural Science (10471140) and (10571169) of China.
文摘A simple but efficient method has been proposed to select variables in heteroscedastic regression models. It is shown that the pseudo empirical wavelet coefficients corresponding to the significant explanatory variables in the regression models are clearly larger than those nonsignificant ones, on the basis of which a procedure is developed to select variables in regression models. The coefficients of the models are also estimated. All estimators are proved to be consistent.
文摘Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner.Further investigation of customer patterns helps thefirm to develop efficient decisions and in turn,helps to optimize the enter-prise’s business and maximizes consumer satisfaction correspondingly.To con-duct an effective assessment about the customers,Naive Bayes(also called Simple Bayes),a machine learning model is utilized.However,the efficacious of the simple Bayes model is utterly relying on the consumer data used,and the existence of uncertain and redundant attributes in the consumer data enables the simple Bayes model to attain the worst prediction in consumer data because of its presumption regarding the attributes applied.However,in practice,the NB pre-mise is not true in consumer data,and the analysis of these redundant attributes enables simple Bayes model to get poor prediction results.In this work,an ensem-ble attribute selection methodology is performed to overcome the problem with consumer data and to pick a steady uncorrelated attribute set to model with the NB classifier.In ensemble variable selection,two different strategies are applied:one is based upon data perturbation(or homogeneous ensemble,same feature selector is applied to a different subsamples derived from the same learning set)and the other one is based upon function perturbation(or heterogeneous ensemble different feature selector is utilized to the same learning set).Further-more,the feature set captured from both ensemble strategies is applied to NB indi-vidually and the outcome obtained is computed.Finally,the experimental outcomes show that the proposed ensemble strategies perform efficiently in choosing a steady attribute set and increasing NB classification performance efficiently.
文摘This paper has compared variable selection method for multiple linear regression models that have both relative and non-relative variables in full model when predictor variables are highly correlated 0.999 . In this study two objective functions used in the Tabu Search are mean square error (MSE) and the mean absolute error (MAE). The results of Tabu Search are compared with the results obtained by stepwise regression method based on the hit percentage criterion. The simulations cover the both cases, without and with multicollinearity problems. For each situation, 1,000 iterations are examined by applying a different sample size n = 25 and 100 at 0.05 level of significance. Without multicollinearity problem, the hit percentages of the stepwise regression method and Tabu Search using the objective function of MSE are almost the same but slightly higher than the Tabu Search using the objective function of MAE. However with multicollinearity problem the hit percentages of the Tabu Search using both objective functions are higher than the hit percentage of the stepwise regression method.
文摘There are two fundamental goals in statistical learning: identifying relevant predictors and ensuring high prediction accuracy. The first goal, by means of variable selection, is of particular importance when the true underlying model has a sparse representation. Discovering relevant predictors can enhance the performance of the prediction for the fitted model. Usually an estimate is considered desirable if it is consistent in terms of both coefficient estimate and variable selection. Hence, before we try to estimate the regression coefficients β , it is preferable that we have a set of useful predictors m hand. The emphasis of our task in this paper is to propose a method, in the aim of identifying relevant predictors to ensure screening consistency in variable selection. The primary interest is on Orthogonal Matching Pursuit(OMP).
文摘Variable selection is one of the most fundamental problems in regression analysis. By sampling from the posterior distributions of candidate models, Bayesian variable selection via MCMC (Markov chain Monte-Carlo) is effective to overcome the computational burden of all-subset variable selection approaches. However, the convergence of the MCMC is often hard to determine and one is often not sure about if obtained samples are unbiased. This complication has limited the application of Bayesian variable selection in practice. Based on the idea of CFTP (coupling from the past), perfect sampling schemes have been developed to obtain independent samples from the posterior distribution for a variety of problems. Here the authors propose an efficient and effective perfect sampling algorithm for Bayesian variable selection of linear regression models, which independently and identically sample from the posterior distribution of the model space and can efficiently handle thousands of variables. The effectiveness of the authors' algorithm is illustrated by three simulation studies, which have up to thousands of variables, the authors' method is further illustrated in SNPs (single nucleotide polymorphisms) association study among RA (rheumatoid arthritis) patients.
基金support provided by National Natural Science Foundation of China (60844007,61178036,21265006)National Science and Technology Support Plan (2008BAD96B04)+1 种基金Special Science and Technology Support Program for Foreign Science and Technology Cooperation Plan (2009BHB15200)Technological expertise and academic leaders training plan of Jiangxi Province (2009DD00700)。
文摘Variable selection is applied widely for visible-near infrared(Vis-NIR)spectroscopy analysis of internal quality in fruits.Different spectral variable selection methods were compared for online quantitative analysis of soluble solids content(SSC)in navel oranges.Moving window partial least squares(MW-PLS),Monte Carlo uninformative variables elimination(MC-UVE)and wavelet transform(WT)combined with the MC-UVE method were used to select the spectral variables and develop the calibration models of online analysis of SSC in navel oranges.The performances of these methods were compared for modeling the Vis NIR data sets of navel orange samples.Results show that the WT-MC-UVE methods gave better calibration models with the higher correlation cofficient(r)of 0.89 and lower root mean square error of prediction(RMSEP)of 0.54 at 5 fruits per second.It concluded that Vis NIR spectroscopy coupled with WT-MC-UVE may be a fast and efective tool for online quantitative analysis of SSC in navel oranges.
基金Supported by the National Natural Science Foundation of China(Grant Nos.12001277,12271046 and 12131006)。
文摘The penalized variable selection methods are often used to select the relevant covariates and estimate the unknown regression coefficients simultaneously,but these existing methods may fail to be consistent for the setting with highly correlated covariates.In this paper,the semi-standard partial covariance(SPAC)method with Lasso penalty is proposed to study the generalized linear model with highly correlated covariates,and the consistencies of the estimation and variable selection are shown in high-dimensional settings under some regularity conditions.Some simulation studies and an analysis of colon tumor dataset are carried out to show that the proposed method performs better in addressing highly correlated problem than the traditional penalized variable selection methods.
基金National Natural Science Foundation of China,Grant/Award Number:72271237Building World-class Universities of Renmin University of China,Grant/Award Number:21XNF037。
文摘Deep learning has been increasingly popular in omics data analysis.Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability.However,because deep learning desires a large sample size,the existing methods may result in uncertain findings when the dataset has a small sample size,commonly seen in omics data analysis.With the explosion and availability of omics data from multiple populations/studies,the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets,which might lead to inaccurate variable selection results.We propose a penalized integrative deep neural network(PIN)to simultaneously select important variables from multiple datasets.PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework.Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets.The source code is freely available on Github(rucliyang/PINFunc).We speculate that the proposed PIN method will promote the identification of disease-related important variables based on multiple studies/datasets from diverse origins.
基金Project supported by the National Key Research and Development Program of China(No.2021YFB3400700)the National Natural Science Foundation of China(Nos.12422201,12072188,12121002,and 12372017)。
文摘The variable selection of high dimensional nonparametric nonlinear systems aims to select the contributing variables or to eliminate the redundant variables.For a high dimensional nonparametric nonlinear system,however,identifying whether a variable contributes or not is not easy.Therefore,based on the Fourier spectrum of densityweighted derivative,one novel variable selection approach is developed,which does not suffer from the dimensionality curse and improves the identification accuracy.Furthermore,a necessary and sufficient condition for testing a variable whether it contributes or not is provided.The proposed approach does not require strong assumptions on the distribution,such as elliptical distribution.The simulation study verifies the effectiveness of the novel variable selection algorithm.
基金the National Natural Science Foundation of China[grant number 11861041]the Natural Science Research Foundation of Kunming University of Science and Technology[grant number KKSY201907003].
文摘A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes.Moreover,data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models.These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations.With the appropriate choice of the tuning parameters,we establish the theoretical properties of the proposed procedure,including consistency for variable selection method and the oracle property in estimation.A productive nonparametric clustering method is applied to select the number of components,and an efficient EM algorithm for numerical computations is developed.Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.
基金supported in part by the National Science Foundation of China under Grant Nos.12071305and 71803001in part by the national social science foundation of China under Grant No.19BTJ014+1 种基金in part by the University Social Science Research Project of Anhui Province under Grant No.SK2020A0051in part by the Social Science Foundation of the Ministry of Education of China under Grant Nos.19YJCZH250 and 21YJAZH081。
文摘Variable selection for varying coefficient models includes the separation of varying and constant effects,and the selection of variables with nonzero varying effects and those with nonzero constant effects.This paper proposes a unified variable selection approach called the double-penalized quadratic inference functions method for varying coefficient models of longitudinal data.The proposed method can not only separate varying coefficients and constant coefficients,but also estimate and select the nonzero varying coefficients and nonzero constant coefficients.It is suitable for variable selection of linear models,varying coefficient models,and partial linear varying coefficient models.Under regularity conditions,the proposed method is consistent in both separation and selection of varying coefficients and constant coefficients.The obtained estimators of varying coefficients possess the optimal convergence rate of non-parametric function estimation,and the estimators of nonzero constant coefficients are consistent and asymptotically normal.Finally,the authors investigate the finite sample performance of the proposed method through simulation studies and a real data analysis.The results show that the proposed method performs better than the existing competitor.