With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. However, curr...With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. However, current genetic association mapping analyses are focused on identifying individual QTLs. This study aimed to identify a set of QTLs or genetic markers, which can capture genetic variability for marker-assisted selection. Selecting a set with k loci that can maximize genetic variation out of high throughput genomic data is a challenging issue. In this study, we proposed an adaptive sequential replacement (ASR) method, which is considered a variant of the sequential replacement (SR) method. Through Monte Carlo simulation and comparing with four other selection methods: exhaustive, SR method, forward, and backward methods we found that the ASR method sustains consistent and repeatable results comparable to the exhaustive method with much reduced computational intensity.展开更多
Coal is a crucial fossil energy in today’s society,and the detection of sulfir(S) and nitrogen(N)in coal is essential for the evaluation of coal quality.Therefore,an efficient method is needed to quantitatively analy...Coal is a crucial fossil energy in today’s society,and the detection of sulfir(S) and nitrogen(N)in coal is essential for the evaluation of coal quality.Therefore,an efficient method is needed to quantitatively analyze N and S content in coal,to achieve the purpose of clean utilization of coal.This study applied laser-induced breakdown spectroscopy(LIBS) to test coal quality,and combined two variable selection algorithms,competitive adaptive reweighted sampling(CARS) and the successive projections algorithm(SPA),to establish the corresponding partial least square(PLS) model.The results of the experiment were as follows.The PLS modeled with the full spectrum of 27,620 variables has poor accuracy,the coefficient of determination of the test set(R^2 P) and root mean square error of the test set(RMSEP) of nitrogen were 0.5172 and 0.2263,respectively,and those of sulfur were0.5784 and 0.5811,respectively.The CARS-PLS screened 37 and 25 variables respectively in the detection of N and S elements,but the prediction ability of the model did not improve significantly.SPA-PLS finally screened 14 and 11 variables respectively through successive projections,and obtained the best prediction effect among the three methods.The R^2 P and RMSEP of nitrogen were0.9873 and 0.0208,respectively,and those of sulfur were 0.9451 and 0.2082,respectively.In general,the predictive results of the two elements increased by about 90% for RMSEP and 60% for R2 P compared with PLS.The results show that LIBS combined with SPA-PLS has good potential for detecting N and S content in coal,and is a very promising technology for industrial application.展开更多
In this article, we study the variable selection of partially linear single-index model(PLSIM). Based on the minimized average variance estimation, the variable selection of PLSIM is done by minimizing average varianc...In this article, we study the variable selection of partially linear single-index model(PLSIM). Based on the minimized average variance estimation, the variable selection of PLSIM is done by minimizing average variance with adaptive l1 penalty. Implementation algorithm is given. Under some regular conditions, we demonstrate the oracle properties of aLASSO procedure for PLSIM. Simulations are used to investigate the effectiveness of the proposed method for variable selection of PLSIM.展开更多
This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to im...This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to implement due to the improvement in computing via MCMC sampling. We described the Bayesian methodology by introducing the Bayesian framework, and explaining Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings within Gibbs sampling was used to draw dependent samples from the full conditional distributions which were explained. In mixture experiments with process variables, the response depends not only on the proportions of the mixture components but also on the effects of the process variables. In many such mixture-process variable experiments, constraints such as time or cost prohibit the selection of treatments completely at random. In these situations, restrictions on the randomisation force the level combinations of one group of factors to be fixed and the combinations of the other group of factors are run. Then a new level of the first-factor group is set and combinations of the other factors are run. We discussed the computational algorithm for the Stochastic Search Variable Selection (SSVS) in linear mixed models. We extended the computational algorithm of SSVS to fit models from split-plot mixture design by introducing the algorithm of the Stochastic Search Variable Selection for Split-plot Design (SSVS-SPD). The motivation of this extension is that we have two different levels of the experimental units, one for the whole plots and the other for subplots in the split-plot mixture design.展开更多
In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accu...In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accuracy as noninformative variables will be removed. Having a model with high prediction accuracy allows the researchers to use the model for future forecasting. In this paper, we investigate the differences between various variable selection methods. The aim is to compare the analysis of the frequentist methodology (the backward elimination), penalised shrinkage method (the Adaptive LASSO) and the Least Angle Regression (LARS) for selecting the active variables for data produced by the blocked design experiment. The result of the comparative study supports the utilization of the LARS method for statistical analysis of data from blocked experiments.展开更多
Although there are many papers on variable selection methods based on mean model in the nite mixture of regression models,little work has been done on how to select signi cant explanatory variables in the modeling of ...Although there are many papers on variable selection methods based on mean model in the nite mixture of regression models,little work has been done on how to select signi cant explanatory variables in the modeling of the variance parameter.In this paper,we propose and study a novel class of models:a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population.The problem of variable selection for the proposed models is considered.In particular,a modi ed Expectation-Maximization(EM)algorithm for estimating the model parameters is developed.The consistency and the oracle property of the penalized estimators is established.Simulation studies are conducted to investigate the nite sample performance of the proposed methodolo-gies.An example is illustrated by the proposed methodologies.展开更多
Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner.Further investigation of customer p...Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner.Further investigation of customer patterns helps thefirm to develop efficient decisions and in turn,helps to optimize the enter-prise’s business and maximizes consumer satisfaction correspondingly.To con-duct an effective assessment about the customers,Naive Bayes(also called Simple Bayes),a machine learning model is utilized.However,the efficacious of the simple Bayes model is utterly relying on the consumer data used,and the existence of uncertain and redundant attributes in the consumer data enables the simple Bayes model to attain the worst prediction in consumer data because of its presumption regarding the attributes applied.However,in practice,the NB pre-mise is not true in consumer data,and the analysis of these redundant attributes enables simple Bayes model to get poor prediction results.In this work,an ensem-ble attribute selection methodology is performed to overcome the problem with consumer data and to pick a steady uncorrelated attribute set to model with the NB classifier.In ensemble variable selection,two different strategies are applied:one is based upon data perturbation(or homogeneous ensemble,same feature selector is applied to a different subsamples derived from the same learning set)and the other one is based upon function perturbation(or heterogeneous ensemble different feature selector is utilized to the same learning set).Further-more,the feature set captured from both ensemble strategies is applied to NB indi-vidually and the outcome obtained is computed.Finally,the experimental outcomes show that the proposed ensemble strategies perform efficiently in choosing a steady attribute set and increasing NB classification performance efficiently.展开更多
Variable selection is applied widely for visible-near infrared(Vis-NIR)spectroscopy analysis of internal quality in fruits.Different spectral variable selection methods were compared for online quantitative analysis o...Variable selection is applied widely for visible-near infrared(Vis-NIR)spectroscopy analysis of internal quality in fruits.Different spectral variable selection methods were compared for online quantitative analysis of soluble solids content(SSC)in navel oranges.Moving window partial least squares(MW-PLS),Monte Carlo uninformative variables elimination(MC-UVE)and wavelet transform(WT)combined with the MC-UVE method were used to select the spectral variables and develop the calibration models of online analysis of SSC in navel oranges.The performances of these methods were compared for modeling the Vis NIR data sets of navel orange samples.Results show that the WT-MC-UVE methods gave better calibration models with the higher correlation cofficient(r)of 0.89 and lower root mean square error of prediction(RMSEP)of 0.54 at 5 fruits per second.It concluded that Vis NIR spectroscopy coupled with WT-MC-UVE may be a fast and efective tool for online quantitative analysis of SSC in navel oranges.展开更多
A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes.Moreover,data that arise from a heterogeneous population can be...A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes.Moreover,data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models.These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations.With the appropriate choice of the tuning parameters,we establish the theoretical properties of the proposed procedure,including consistency for variable selection method and the oracle property in estimation.A productive nonparametric clustering method is applied to select the number of components,and an efficient EM algorithm for numerical computations is developed.Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.展开更多
Variable selection for varying coefficient models includes the separation of varying and constant effects,and the selection of variables with nonzero varying effects and those with nonzero constant effects.This paper ...Variable selection for varying coefficient models includes the separation of varying and constant effects,and the selection of variables with nonzero varying effects and those with nonzero constant effects.This paper proposes a unified variable selection approach called the double-penalized quadratic inference functions method for varying coefficient models of longitudinal data.The proposed method can not only separate varying coefficients and constant coefficients,but also estimate and select the nonzero varying coefficients and nonzero constant coefficients.It is suitable for variable selection of linear models,varying coefficient models,and partial linear varying coefficient models.Under regularity conditions,the proposed method is consistent in both separation and selection of varying coefficients and constant coefficients.The obtained estimators of varying coefficients possess the optimal convergence rate of non-parametric function estimation,and the estimators of nonzero constant coefficients are consistent and asymptotically normal.Finally,the authors investigate the finite sample performance of the proposed method through simulation studies and a real data analysis.The results show that the proposed method performs better than the existing competitor.展开更多
A geodemographic classification aims to describe the most salient characteristics of a small area zonal geography.However,such representations are influenced by the methodological choices made during their constructio...A geodemographic classification aims to describe the most salient characteristics of a small area zonal geography.However,such representations are influenced by the methodological choices made during their construction.Of particular debate are the choice and specification of input variables,with the objective of identifying inputs that add value but also aim for model parsimony.Within this context,our paper introduces a principal component analysis(PCA)-based automated variable selection methodology that has the objective of identifying candidate inputs to a geodemographic classification from a collection of variables.The proposed methodology is exemplified in the context of variables from the UK 2011 Census,and its output compared to the Office for National Statistics 2011 Output Area Classification(2011 OAC).Through the implementation of the proposed methodology,the quality of the cluster assignment was improved relative to 2011 OAC,manifested by a lower total withincluster sum of square score.Across the UK,more than 70.2%of the Output Areas(OAs)occupied by the newly created classification(i.e.AVS-OAC)outperform the 2011 OAC,with particularly strong performance within Scotland and Wales.展开更多
In practice, predictors possess grouping structures spontaneously. Incorporation of such useful information can improve statistical modeling and inference. In addition, the high-dimensionality often leads to the colli...In practice, predictors possess grouping structures spontaneously. Incorporation of such useful information can improve statistical modeling and inference. In addition, the high-dimensionality often leads to the collinearity problem. The elastic net is an ideal method which is inclined to reflect a grouping effect. In this paper, we consider the problem of group selection and estimation in the sparse linear regression model in which predictors can be grouped. We investigate a group adaptive elastic-net and derive oracle inequalities and model consistency for the cases where group number is larger than the sample size. Oracle property is addressed for the case of the fixed group number. We revise the locally approximated coordinate descent algorithm to make our computation. Simulation and real data studies indicate that the group adaptive elastic-net is an alternative and competitive method for model selection of high-dimensional problems for the cases of group number being larger than the sample size.展开更多
Nondestructive determination the internal quality of thick-skin fruits has always been a challenge.In order to investigate the prediction ability of full transmittance mode on the soluble solid content(SSC)in thick-sk...Nondestructive determination the internal quality of thick-skin fruits has always been a challenge.In order to investigate the prediction ability of full transmittance mode on the soluble solid content(SSC)in thick-skin fruits,the full transmittance spectra of citrus were collected using a visible/near infrared(Vis/NIR)portable spectrograph(550–1100 nm).Three obvious absorption peakswere found at 710,810 and 915 nmin the original spectra curve.Four spectral preprocessing methods including Smoothing,multiplicative scatter correction(MSC),standard normal variate(SNV)and first derivativewere employed to improve the quality of the original spectra.Subsequently,the effective wavelengths of SSC were selected from the original and pretreated spectra with the algorithms of successive projections algorithm(SPA),competitive adaptive reweighted sampling(CARS)and genetic algorithm(GA).Finally,the prediction models of SSC were established based on the full wavelengths and effectivewavelengths.Results showed that SPA performed the best performance on eliminating the useless information variable and optimizing the number of effective variables.The optimal predictionmodel was established based on 10 characteristic variables selected from the spectra pretreated by SNV with the algorithmof SPA,with the correlation coefficient,root mean square error,and residual predictive deviation for prediction set being 0.9165,0.5684°Brix and 2.5120,respectively.Overall,the full transmittance mode was feasible to predict the internal quality of thick-skin fruits,like citrus.Additionally,the combination of spectral preprocessing with a variable selection algorithmwas effective for developing the reliable predictionmodel.The conclusions of this study also provide an alternative method for fast and real-time detection of the internal quality of thick-skin fruits using Vis/NIR spectroscopy.展开更多
In this paper, we consider the issue of variable selection in partial linear single-index models under the assumption that the vector of regression coefficients is sparse. We apply penalized spline to estimate the non...In this paper, we consider the issue of variable selection in partial linear single-index models under the assumption that the vector of regression coefficients is sparse. We apply penalized spline to estimate the nonparametric function and SCAD penalty to achieve sparse estimates of regression parameters in both the linear and single-index parts of the model. Under some mild conditions, it is shown that the penalized estimators have oracle property, in the sense that it is asymptotically normal with the same mean and covariance that they would have if zero coefficients are known in advance. Our model owns a least square representation, therefore standard least square programming algorithms can be implemented without extra programming efforts. In the meantime, parametric estimation, variable selection and nonparametric estimation can be realized in one step,which incredibly increases computational stability. The finite sample performance of the penalized estimators is evaluated through Monte Carlo studies and illustrated with a real data set.展开更多
Multiple testing has gained much attention in high-dimensional statistical theory and applications,and the problem of variable selection can be regarded as a generalization of the multiple testing.It is aiming to sele...Multiple testing has gained much attention in high-dimensional statistical theory and applications,and the problem of variable selection can be regarded as a generalization of the multiple testing.It is aiming to select the important variables among many variables.Performing variable selection in high-dimensional linear models with measurement errors is challenging.Both the influence of high-dimensional parameters and measurement errors need to be considered to avoid severely biases.We consider the problem of variable selection in error-in-variables and introduce the DCoCoLasso-FDP procedure,a new variable selection method.By constructing the consistent estimator of false discovery proportion(FDP)and false discovery rate(FDR),our method can prioritize the important variables and control FDP and FDR at a specifical level in error-in-variables models.An extensive simulation study is conducted to compare DCoCoLasso-FDP procedure with existing methods in various settings,and numerical results are provided to present the efficiency of our method.展开更多
Input variables selection(IVS) is proved to be pivotal in nonlinear dynamic system modeling. In order to optimize the model of the nonlinear dynamic system, a fuzzy modeling method for determining the premise structur...Input variables selection(IVS) is proved to be pivotal in nonlinear dynamic system modeling. In order to optimize the model of the nonlinear dynamic system, a fuzzy modeling method for determining the premise structure by selecting important inputs of the system is studied. Firstly, a simplified two stage fuzzy curves method is proposed, which is employed to sort all possible inputs by their relevance with outputs, select the important input variables of the system and identify the structure.Secondly, in order to reduce the complexity of the model, the standard fuzzy c-means clustering algorithm and the recursive least squares algorithm are used to identify the premise parameters and conclusion parameters, respectively. Then, the effectiveness of IVS is verified by two well-known issues. Finally, the proposed identification method is applied to a realistic variable load pneumatic system. The simulation experiments indi cate that the IVS method in this paper has a positive influence on the approximation performance of the Takagi-Sugeno(T-S) fuzzy modeling.展开更多
In this paper,we present a variable selection procedure by combining basis function approximations with penalized estimating equations for varying-coefficient models with missing response at random.With appropriate se...In this paper,we present a variable selection procedure by combining basis function approximations with penalized estimating equations for varying-coefficient models with missing response at random.With appropriate selection of the tuning parameters,we establish the consistency of the variable selection procedure and the optimal convergence rate of the regularized estimators.A simulation study is undertaken to assess the finite sample performance of the proposed variable selection procedure.展开更多
In supervised learning the number of values of a response variable can be very high. Grouping these values in a few clusters can be useful to perform accurate supervised classification analyses. On the other hand sele...In supervised learning the number of values of a response variable can be very high. Grouping these values in a few clusters can be useful to perform accurate supervised classification analyses. On the other hand selecting relevant covariates is a crucial step to build robust and efficient prediction models. We propose in this paper an algorithm that simultaneously groups the values of a response variable into a limited number of clusters and selects stepwise the best covariates that discriminate this clustering. These objectives are achieved by alternate optimization of a user-defined model selection criterion. This process extends a former version of the algorithm to a more general framework. Moreover possible further developments are discussed in detail.展开更多
Survival data with amulti-state structure are frequently observed in follow-up studies.An analytic approach based on a multi-state model(MSM)should be used in longitudinal health studies in which a patient experiences...Survival data with amulti-state structure are frequently observed in follow-up studies.An analytic approach based on a multi-state model(MSM)should be used in longitudinal health studies in which a patient experiences a sequence of clinical progression events.One main objective in the MSM framework is variable selection,where attempts are made to identify the risk factors associated with the transition hazard rates or probabilities of disease progression.The usual variable selection methods,including stepwise and penalized methods,do not provide information about the importance of variables.In this context,we present a two-step algorithm to evaluate the importance of variables formulti-state data.Three differentmachine learning approaches(randomforest,gradient boosting,and neural network)as themost widely usedmethods are considered to estimate the variable importance in order to identify the factors affecting disease progression and rank these factors according to their importance.The performance of our proposed methods is validated by simulation and applied to the COVID-19 data set.The results revealed that the proposed two-stage method has promising performance for estimating variable importance.展开更多
Capturing leaf color variances over space is important for diagnosing plant nutrient and health status,estimating water availability as well as improving ornamental and tourism values of plants.In this study,leaf colo...Capturing leaf color variances over space is important for diagnosing plant nutrient and health status,estimating water availability as well as improving ornamental and tourism values of plants.In this study,leaf color variances of the Eurasian smoke tree,Cotinus coggygria were estimated based on geographic and climate variables in a shrub community using generalized elastic net(GELnet)and support vector machine(SVM)algorithms.Results reveal that leaf color varied over space,and the variances were the result of geography due to its effect on solar radiation,temperature,illumination and moisture of the shrub environment,whereas the influence of climate were not obvious.The SVM and GELnet algorithm models were similar estimating leaf color indices based on geographic variables,and demonstrates that both techniques have the potential to estimate leaf color variances of C.coggygria in a shrubbery with a complex geographical environment in the absence of human activity.展开更多
文摘With the rapid development of DNA technologies, high throughput genomic data have become a powerful leverage to locate desirable genetic loci associated with traits of importance in various crop species. However, current genetic association mapping analyses are focused on identifying individual QTLs. This study aimed to identify a set of QTLs or genetic markers, which can capture genetic variability for marker-assisted selection. Selecting a set with k loci that can maximize genetic variation out of high throughput genomic data is a challenging issue. In this study, we proposed an adaptive sequential replacement (ASR) method, which is considered a variant of the sequential replacement (SR) method. Through Monte Carlo simulation and comparing with four other selection methods: exhaustive, SR method, forward, and backward methods we found that the ASR method sustains consistent and repeatable results comparable to the exhaustive method with much reduced computational intensity.
基金the Jiangsu Government Scholarship for Overseas Studies (JS-2019-031)the Startup Foundation for Introducing Talent of NUIST (2243141701023)。
文摘Coal is a crucial fossil energy in today’s society,and the detection of sulfir(S) and nitrogen(N)in coal is essential for the evaluation of coal quality.Therefore,an efficient method is needed to quantitatively analyze N and S content in coal,to achieve the purpose of clean utilization of coal.This study applied laser-induced breakdown spectroscopy(LIBS) to test coal quality,and combined two variable selection algorithms,competitive adaptive reweighted sampling(CARS) and the successive projections algorithm(SPA),to establish the corresponding partial least square(PLS) model.The results of the experiment were as follows.The PLS modeled with the full spectrum of 27,620 variables has poor accuracy,the coefficient of determination of the test set(R^2 P) and root mean square error of the test set(RMSEP) of nitrogen were 0.5172 and 0.2263,respectively,and those of sulfur were0.5784 and 0.5811,respectively.The CARS-PLS screened 37 and 25 variables respectively in the detection of N and S elements,but the prediction ability of the model did not improve significantly.SPA-PLS finally screened 14 and 11 variables respectively through successive projections,and obtained the best prediction effect among the three methods.The R^2 P and RMSEP of nitrogen were0.9873 and 0.0208,respectively,and those of sulfur were 0.9451 and 0.2082,respectively.In general,the predictive results of the two elements increased by about 90% for RMSEP and 60% for R2 P compared with PLS.The results show that LIBS combined with SPA-PLS has good potential for detecting N and S content in coal,and is a very promising technology for industrial application.
文摘In this article, we study the variable selection of partially linear single-index model(PLSIM). Based on the minimized average variance estimation, the variable selection of PLSIM is done by minimizing average variance with adaptive l1 penalty. Implementation algorithm is given. Under some regular conditions, we demonstrate the oracle properties of aLASSO procedure for PLSIM. Simulations are used to investigate the effectiveness of the proposed method for variable selection of PLSIM.
文摘This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to implement due to the improvement in computing via MCMC sampling. We described the Bayesian methodology by introducing the Bayesian framework, and explaining Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings within Gibbs sampling was used to draw dependent samples from the full conditional distributions which were explained. In mixture experiments with process variables, the response depends not only on the proportions of the mixture components but also on the effects of the process variables. In many such mixture-process variable experiments, constraints such as time or cost prohibit the selection of treatments completely at random. In these situations, restrictions on the randomisation force the level combinations of one group of factors to be fixed and the combinations of the other group of factors are run. Then a new level of the first-factor group is set and combinations of the other factors are run. We discussed the computational algorithm for the Stochastic Search Variable Selection (SSVS) in linear mixed models. We extended the computational algorithm of SSVS to fit models from split-plot mixture design by introducing the algorithm of the Stochastic Search Variable Selection for Split-plot Design (SSVS-SPD). The motivation of this extension is that we have two different levels of the experimental units, one for the whole plots and the other for subplots in the split-plot mixture design.
文摘In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accuracy as noninformative variables will be removed. Having a model with high prediction accuracy allows the researchers to use the model for future forecasting. In this paper, we investigate the differences between various variable selection methods. The aim is to compare the analysis of the frequentist methodology (the backward elimination), penalised shrinkage method (the Adaptive LASSO) and the Least Angle Regression (LARS) for selecting the active variables for data produced by the blocked design experiment. The result of the comparative study supports the utilization of the LARS method for statistical analysis of data from blocked experiments.
基金Supported by the National Natural Science Foundation of China(11861041).
文摘Although there are many papers on variable selection methods based on mean model in the nite mixture of regression models,little work has been done on how to select signi cant explanatory variables in the modeling of the variance parameter.In this paper,we propose and study a novel class of models:a skew-normal mixture of joint location and scale models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population.The problem of variable selection for the proposed models is considered.In particular,a modi ed Expectation-Maximization(EM)algorithm for estimating the model parameters is developed.The consistency and the oracle property of the penalized estimators is established.Simulation studies are conducted to investigate the nite sample performance of the proposed methodolo-gies.An example is illustrated by the proposed methodologies.
文摘Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner.Further investigation of customer patterns helps thefirm to develop efficient decisions and in turn,helps to optimize the enter-prise’s business and maximizes consumer satisfaction correspondingly.To con-duct an effective assessment about the customers,Naive Bayes(also called Simple Bayes),a machine learning model is utilized.However,the efficacious of the simple Bayes model is utterly relying on the consumer data used,and the existence of uncertain and redundant attributes in the consumer data enables the simple Bayes model to attain the worst prediction in consumer data because of its presumption regarding the attributes applied.However,in practice,the NB pre-mise is not true in consumer data,and the analysis of these redundant attributes enables simple Bayes model to get poor prediction results.In this work,an ensem-ble attribute selection methodology is performed to overcome the problem with consumer data and to pick a steady uncorrelated attribute set to model with the NB classifier.In ensemble variable selection,two different strategies are applied:one is based upon data perturbation(or homogeneous ensemble,same feature selector is applied to a different subsamples derived from the same learning set)and the other one is based upon function perturbation(or heterogeneous ensemble different feature selector is utilized to the same learning set).Further-more,the feature set captured from both ensemble strategies is applied to NB indi-vidually and the outcome obtained is computed.Finally,the experimental outcomes show that the proposed ensemble strategies perform efficiently in choosing a steady attribute set and increasing NB classification performance efficiently.
基金support provided by National Natural Science Foundation of China (60844007,61178036,21265006)National Science and Technology Support Plan (2008BAD96B04)+1 种基金Special Science and Technology Support Program for Foreign Science and Technology Cooperation Plan (2009BHB15200)Technological expertise and academic leaders training plan of Jiangxi Province (2009DD00700)。
文摘Variable selection is applied widely for visible-near infrared(Vis-NIR)spectroscopy analysis of internal quality in fruits.Different spectral variable selection methods were compared for online quantitative analysis of soluble solids content(SSC)in navel oranges.Moving window partial least squares(MW-PLS),Monte Carlo uninformative variables elimination(MC-UVE)and wavelet transform(WT)combined with the MC-UVE method were used to select the spectral variables and develop the calibration models of online analysis of SSC in navel oranges.The performances of these methods were compared for modeling the Vis NIR data sets of navel orange samples.Results show that the WT-MC-UVE methods gave better calibration models with the higher correlation cofficient(r)of 0.89 and lower root mean square error of prediction(RMSEP)of 0.54 at 5 fruits per second.It concluded that Vis NIR spectroscopy coupled with WT-MC-UVE may be a fast and efective tool for online quantitative analysis of SSC in navel oranges.
基金the National Natural Science Foundation of China[grant number 11861041]the Natural Science Research Foundation of Kunming University of Science and Technology[grant number KKSY201907003].
文摘A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes.Moreover,data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models.These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations.With the appropriate choice of the tuning parameters,we establish the theoretical properties of the proposed procedure,including consistency for variable selection method and the oracle property in estimation.A productive nonparametric clustering method is applied to select the number of components,and an efficient EM algorithm for numerical computations is developed.Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.
基金supported in part by the National Science Foundation of China under Grant Nos.12071305and 71803001in part by the national social science foundation of China under Grant No.19BTJ014+1 种基金in part by the University Social Science Research Project of Anhui Province under Grant No.SK2020A0051in part by the Social Science Foundation of the Ministry of Education of China under Grant Nos.19YJCZH250 and 21YJAZH081。
文摘Variable selection for varying coefficient models includes the separation of varying and constant effects,and the selection of variables with nonzero varying effects and those with nonzero constant effects.This paper proposes a unified variable selection approach called the double-penalized quadratic inference functions method for varying coefficient models of longitudinal data.The proposed method can not only separate varying coefficients and constant coefficients,but also estimate and select the nonzero varying coefficients and nonzero constant coefficients.It is suitable for variable selection of linear models,varying coefficient models,and partial linear varying coefficient models.Under regularity conditions,the proposed method is consistent in both separation and selection of varying coefficients and constant coefficients.The obtained estimators of varying coefficients possess the optimal convergence rate of non-parametric function estimation,and the estimators of nonzero constant coefficients are consistent and asymptotically normal.Finally,the authors investigate the finite sample performance of the proposed method through simulation studies and a real data analysis.The results show that the proposed method performs better than the existing competitor.
文摘A geodemographic classification aims to describe the most salient characteristics of a small area zonal geography.However,such representations are influenced by the methodological choices made during their construction.Of particular debate are the choice and specification of input variables,with the objective of identifying inputs that add value but also aim for model parsimony.Within this context,our paper introduces a principal component analysis(PCA)-based automated variable selection methodology that has the objective of identifying candidate inputs to a geodemographic classification from a collection of variables.The proposed methodology is exemplified in the context of variables from the UK 2011 Census,and its output compared to the Office for National Statistics 2011 Output Area Classification(2011 OAC).Through the implementation of the proposed methodology,the quality of the cluster assignment was improved relative to 2011 OAC,manifested by a lower total withincluster sum of square score.Across the UK,more than 70.2%of the Output Areas(OAs)occupied by the newly created classification(i.e.AVS-OAC)outperform the 2011 OAC,with particularly strong performance within Scotland and Wales.
基金supported by National Natural Science Foundation of China(Grant No.11571219)the Open Research Fund Program of Key Laboratory of Mathematical Economics(SUFE)(Grant No.201309KF02)Ministry of Education,and Changjiang Scholars and Innovative Research Team in University(Grant No.IRT13077)
文摘In practice, predictors possess grouping structures spontaneously. Incorporation of such useful information can improve statistical modeling and inference. In addition, the high-dimensionality often leads to the collinearity problem. The elastic net is an ideal method which is inclined to reflect a grouping effect. In this paper, we consider the problem of group selection and estimation in the sparse linear regression model in which predictors can be grouped. We investigate a group adaptive elastic-net and derive oracle inequalities and model consistency for the cases where group number is larger than the sample size. Oracle property is addressed for the case of the fixed group number. We revise the locally approximated coordinate descent algorithm to make our computation. Simulation and real data studies indicate that the group adaptive elastic-net is an alternative and competitive method for model selection of high-dimensional problems for the cases of group number being larger than the sample size.
基金This study was supported by National Key Research and Development Program(2016YFD0200104)Beijing Talents Foundation(2018000021223ZK06)National Natural Science Foundation of China(Grant No.31671927).
文摘Nondestructive determination the internal quality of thick-skin fruits has always been a challenge.In order to investigate the prediction ability of full transmittance mode on the soluble solid content(SSC)in thick-skin fruits,the full transmittance spectra of citrus were collected using a visible/near infrared(Vis/NIR)portable spectrograph(550–1100 nm).Three obvious absorption peakswere found at 710,810 and 915 nmin the original spectra curve.Four spectral preprocessing methods including Smoothing,multiplicative scatter correction(MSC),standard normal variate(SNV)and first derivativewere employed to improve the quality of the original spectra.Subsequently,the effective wavelengths of SSC were selected from the original and pretreated spectra with the algorithms of successive projections algorithm(SPA),competitive adaptive reweighted sampling(CARS)and genetic algorithm(GA).Finally,the prediction models of SSC were established based on the full wavelengths and effectivewavelengths.Results showed that SPA performed the best performance on eliminating the useless information variable and optimizing the number of effective variables.The optimal predictionmodel was established based on 10 characteristic variables selected from the spectra pretreated by SNV with the algorithmof SPA,with the correlation coefficient,root mean square error,and residual predictive deviation for prediction set being 0.9165,0.5684°Brix and 2.5120,respectively.Overall,the full transmittance mode was feasible to predict the internal quality of thick-skin fruits,like citrus.Additionally,the combination of spectral preprocessing with a variable selection algorithmwas effective for developing the reliable predictionmodel.The conclusions of this study also provide an alternative method for fast and real-time detection of the internal quality of thick-skin fruits using Vis/NIR spectroscopy.
基金Supported by the National Natural Science Foundation of China(No.11671096)
文摘In this paper, we consider the issue of variable selection in partial linear single-index models under the assumption that the vector of regression coefficients is sparse. We apply penalized spline to estimate the nonparametric function and SCAD penalty to achieve sparse estimates of regression parameters in both the linear and single-index parts of the model. Under some mild conditions, it is shown that the penalized estimators have oracle property, in the sense that it is asymptotically normal with the same mean and covariance that they would have if zero coefficients are known in advance. Our model owns a least square representation, therefore standard least square programming algorithms can be implemented without extra programming efforts. In the meantime, parametric estimation, variable selection and nonparametric estimation can be realized in one step,which incredibly increases computational stability. The finite sample performance of the penalized estimators is evaluated through Monte Carlo studies and illustrated with a real data set.
文摘Multiple testing has gained much attention in high-dimensional statistical theory and applications,and the problem of variable selection can be regarded as a generalization of the multiple testing.It is aiming to select the important variables among many variables.Performing variable selection in high-dimensional linear models with measurement errors is challenging.Both the influence of high-dimensional parameters and measurement errors need to be considered to avoid severely biases.We consider the problem of variable selection in error-in-variables and introduce the DCoCoLasso-FDP procedure,a new variable selection method.By constructing the consistent estimator of false discovery proportion(FDP)and false discovery rate(FDR),our method can prioritize the important variables and control FDP and FDR at a specifical level in error-in-variables models.An extensive simulation study is conducted to compare DCoCoLasso-FDP procedure with existing methods in various settings,and numerical results are provided to present the efficiency of our method.
基金This work was supported by the Natural Science Foundation of Hebei Province(F2019203505).
文摘Input variables selection(IVS) is proved to be pivotal in nonlinear dynamic system modeling. In order to optimize the model of the nonlinear dynamic system, a fuzzy modeling method for determining the premise structure by selecting important inputs of the system is studied. Firstly, a simplified two stage fuzzy curves method is proposed, which is employed to sort all possible inputs by their relevance with outputs, select the important input variables of the system and identify the structure.Secondly, in order to reduce the complexity of the model, the standard fuzzy c-means clustering algorithm and the recursive least squares algorithm are used to identify the premise parameters and conclusion parameters, respectively. Then, the effectiveness of IVS is verified by two well-known issues. Finally, the proposed identification method is applied to a realistic variable load pneumatic system. The simulation experiments indi cate that the IVS method in this paper has a positive influence on the approximation performance of the Takagi-Sugeno(T-S) fuzzy modeling.
基金Supported by the National Natural Science Foundation of China (Grant No. 10871013)the Natural Science Foundation of Beijing (Grant No. 1072004), the Natural Science Foundation of Guangxi (Grant No. 2010GXNSFB013051)the Graduate Student Foundation of Hechi University (Grant No. 2008QS-N014)
文摘In this paper,we present a variable selection procedure by combining basis function approximations with penalized estimating equations for varying-coefficient models with missing response at random.With appropriate selection of the tuning parameters,we establish the consistency of the variable selection procedure and the optimal convergence rate of the regularized estimators.A simulation study is undertaken to assess the finite sample performance of the proposed variable selection procedure.
文摘In supervised learning the number of values of a response variable can be very high. Grouping these values in a few clusters can be useful to perform accurate supervised classification analyses. On the other hand selecting relevant covariates is a crucial step to build robust and efficient prediction models. We propose in this paper an algorithm that simultaneously groups the values of a response variable into a limited number of clusters and selects stepwise the best covariates that discriminate this clustering. These objectives are achieved by alternate optimization of a user-defined model selection criterion. This process extends a former version of the algorithm to a more general framework. Moreover possible further developments are discussed in detail.
文摘Survival data with amulti-state structure are frequently observed in follow-up studies.An analytic approach based on a multi-state model(MSM)should be used in longitudinal health studies in which a patient experiences a sequence of clinical progression events.One main objective in the MSM framework is variable selection,where attempts are made to identify the risk factors associated with the transition hazard rates or probabilities of disease progression.The usual variable selection methods,including stepwise and penalized methods,do not provide information about the importance of variables.In this context,we present a two-step algorithm to evaluate the importance of variables formulti-state data.Three differentmachine learning approaches(randomforest,gradient boosting,and neural network)as themost widely usedmethods are considered to estimate the variable importance in order to identify the factors affecting disease progression and rank these factors according to their importance.The performance of our proposed methods is validated by simulation and applied to the COVID-19 data set.The results revealed that the proposed two-stage method has promising performance for estimating variable importance.
基金supported by the Fundamental Research Funds for the Central Universities(Grant No.XDJK2019D041)the Research Innovation Programs for graduate student of Chongqing,China(Grant No.CYS19123)the National Undergraduate Innovation and Entrepreneurship Training Programs(Grant No.201810635015).
文摘Capturing leaf color variances over space is important for diagnosing plant nutrient and health status,estimating water availability as well as improving ornamental and tourism values of plants.In this study,leaf color variances of the Eurasian smoke tree,Cotinus coggygria were estimated based on geographic and climate variables in a shrub community using generalized elastic net(GELnet)and support vector machine(SVM)algorithms.Results reveal that leaf color varied over space,and the variances were the result of geography due to its effect on solar radiation,temperature,illumination and moisture of the shrub environment,whereas the influence of climate were not obvious.The SVM and GELnet algorithm models were similar estimating leaf color indices based on geographic variables,and demonstrates that both techniques have the potential to estimate leaf color variances of C.coggygria in a shrubbery with a complex geographical environment in the absence of human activity.