In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified fro...In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified from the point of view of introducing Jensen-Shannon divergence to measure the importance of covariates. The idea of the method is to calculate the Jensen-Shannon divergence between the conditional probability distribution of the covariates on a given response variable and the unconditional probability distribution of the covariates, and then use the probabilities of the response variables as weights to calculate the weighted Jensen-Shannon divergence, where a larger weighted Jensen-Shannon divergence means that the covariates are more important. Additionally, we also investigated an adapted version of the method, which is to measure the relationship between the covariates and the response variable using the weighted Jensen-Shannon divergence adjusted by the logarithmic factor of the number of categories when the number of categories in each covariate varies. Then, through both theoretical and simulation experiments, it was demonstrated that the proposed methods have sure screening and ranking consistency properties. Finally, the results from simulation and real-dataset experiments show that in feature screening, the proposed methods investigated are robust in performance and faster in computational speed compared with an existing method.展开更多
Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems,such as gene-wide association studies and the multi-factor analysis-of-variance(ANOVA).To address this iss...Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems,such as gene-wide association studies and the multi-factor analysis-of-variance(ANOVA).To address this issue,we proposed a group screening method to do variables selection on groups of variables in linear models.This group screening method is based on a working independence,and sure screening property is also established for our approach.To enhance the finite sample performance,a data-driven thresholding and a two-stage iterative procedure are developed.To the best of our knowledge,screening for grouped variables rarely appeared in the literature,and this method can be regarded as an important and non-trivial extension of screening for individual variables.An extensive simulation study and a real data analysis demonstrate its finite sample performance.展开更多
In this paper, we study ultra-high-dimensional partially linear models when the dimension of thelinear predictors grows exponentially with the sample size. For the variable screening, we proposea sequential profile La...In this paper, we study ultra-high-dimensional partially linear models when the dimension of thelinear predictors grows exponentially with the sample size. For the variable screening, we proposea sequential profile Lasso method (SPLasso) and show that it possesses the screening property.SPLasso can also detect all relevant predictors with probability tending to one, no matter whetherthe ultra-high models involve both parametric and nonparametric parts. To select the best subset among the models generated by SPLasso, we propose an extended Bayesian information criterion (EBIC) for choosing the final model. We also conduct simulation studies and apply a realdata example to assess the performance of the proposed method and compare with the existingmethod.展开更多
文摘In ultra-high-dimensional data, it is common for the response variable to be multi-classified. Therefore, this paper proposes a model-free screening method for variables whose response variable is multi-classified from the point of view of introducing Jensen-Shannon divergence to measure the importance of covariates. The idea of the method is to calculate the Jensen-Shannon divergence between the conditional probability distribution of the covariates on a given response variable and the unconditional probability distribution of the covariates, and then use the probabilities of the response variables as weights to calculate the weighted Jensen-Shannon divergence, where a larger weighted Jensen-Shannon divergence means that the covariates are more important. Additionally, we also investigated an adapted version of the method, which is to measure the relationship between the covariates and the response variable using the weighted Jensen-Shannon divergence adjusted by the logarithmic factor of the number of categories when the number of categories in each covariate varies. Then, through both theoretical and simulation experiments, it was demonstrated that the proposed methods have sure screening and ranking consistency properties. Finally, the results from simulation and real-dataset experiments show that in feature screening, the proposed methods investigated are robust in performance and faster in computational speed compared with an existing method.
基金supported by the National Natural Science Foundation of China(CN)(11571112)the National Social Science Foundation Key Program(17ZDA091)+1 种基金Natural Science Fund of Education Department of Anhui Province(KJ2013B233)the 111 Project of China(B14019).
文摘Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems,such as gene-wide association studies and the multi-factor analysis-of-variance(ANOVA).To address this issue,we proposed a group screening method to do variables selection on groups of variables in linear models.This group screening method is based on a working independence,and sure screening property is also established for our approach.To enhance the finite sample performance,a data-driven thresholding and a two-stage iterative procedure are developed.To the best of our knowledge,screening for grouped variables rarely appeared in the literature,and this method can be regarded as an important and non-trivial extension of screening for individual variables.An extensive simulation study and a real data analysis demonstrate its finite sample performance.
基金Gaorong Li’s research was supported in part by the National Natural Science Foundation of China[number 11471029]Tiejun Tong’s research was supported in part by the National Natural Science Foundation of China[number 11671338]+1 种基金the Hong Kong Baptist University grants[grant number FRG2/15-16/019][grant number FRG1/16-17/018].
文摘In this paper, we study ultra-high-dimensional partially linear models when the dimension of thelinear predictors grows exponentially with the sample size. For the variable screening, we proposea sequential profile Lasso method (SPLasso) and show that it possesses the screening property.SPLasso can also detect all relevant predictors with probability tending to one, no matter whetherthe ultra-high models involve both parametric and nonparametric parts. To select the best subset among the models generated by SPLasso, we propose an extended Bayesian information criterion (EBIC) for choosing the final model. We also conduct simulation studies and apply a realdata example to assess the performance of the proposed method and compare with the existingmethod.