It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limit...It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.展开更多
It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable ...It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited;to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.展开更多
In this paper,we develop a flexible semiparametric model averaging marginal regression procedure to forecast the joint conditional quantile function of the response variable for ultrahighdimensional data.First,we appr...In this paper,we develop a flexible semiparametric model averaging marginal regression procedure to forecast the joint conditional quantile function of the response variable for ultrahighdimensional data.First,we approximate the joint conditional quantile function by a weighted average of one-dimensional marginal conditional quantile functions that have varying coefficient structures.Then,a local linear regression technique is employed to derive the consistent estimates of marginal conditional quantile functions.Second,based on estimated marginal conditional quantile functions,we estimate and select the significant model weights involved in the approximation by a nonconvex penalized quantile regression.Under some relaxed conditions,we establish the asymptotic properties for the nonparametric kernel estimators and oracle estimators of the model averaging weights.We further derive the oracle property for the proposed nonconvex penalized model averaging procedure.Finally,simulation studies and a real data analysis are conducted to illustrate the merits of our proposed model averaging method.展开更多
Using the so-called martingale difference correlation(MDC), we propose a novel censoredconditional-quantile screening approach for ultrahigh-dimensional survival data with heterogeneity(which is often present in such ...Using the so-called martingale difference correlation(MDC), we propose a novel censoredconditional-quantile screening approach for ultrahigh-dimensional survival data with heterogeneity(which is often present in such data). By incorporating a weighting scheme, this method is a natural extension of MDCbased conditional quantile screening, as considered by Shao and Zhang(2014), to handle ultrahigh-dimensional survival data. The proposed screening procedure has a sure-screening property under certain technical conditions and an excellent capability of detecting the nonlinear relationship between independent and censored dependent variables. Both simulation results and an analysis of real data demonstrate the effectiveness of the new censored conditional quantile-screening procedure.展开更多
This article develops a procedure for screening variables, in ultra high-di- mensional settings, based on their predictive significance. This is achieved by ranking the variables according to the variance of their res...This article develops a procedure for screening variables, in ultra high-di- mensional settings, based on their predictive significance. This is achieved by ranking the variables according to the variance of their respective marginal regression functions (RV-SIS). We show that, under some mild technical conditions, the RV-SIS possesses a sure screening property, which is defined by Fan and Lv (2008). Numerical comparisons suggest that RV-SIS has competitive performance compared to other screening procedures, and outperforms them in many different model settings.展开更多
Next Generation Sequencing (NGS) provides an effective basis for estimating the survival time of cancer patients, but it also poses the problem of high data dimensionality, in addition to the fact that some patients d...Next Generation Sequencing (NGS) provides an effective basis for estimating the survival time of cancer patients, but it also poses the problem of high data dimensionality, in addition to the fact that some patients drop out of the study, making the data missing, so a method for estimating the mean of the response variable with missing values for the ultra-high dimensional datasets is needed. In this paper, we propose a two-stage ultra-high dimensional variable screening method, RF-SIS, based on random forest regression, which effectively solves the problem of estimating missing values due to excessive data dimension. After the dimension reduction process by applying RF-SIS, mean interpolation is executed on the missing responses. The results of the simulated data show that compared with the estimation method of directly deleting missing observations, the estimation results of RF-SIS-MI have significant advantages in terms of the proportion of intervals covered, the average length of intervals, and the average absolute deviation.展开更多
文摘It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.
文摘It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited;to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.
基金Supported by the National Natural Science Foundation of China Grant(Grant No.12201091)Natural Science Foundation of Chongqing Grant(Grant Nos.CSTB2022NSCQ-MSX0852,cstc2021jcyj-msxmX0502)+3 种基金Innovation Support Program for Chongqing Overseas Returnees(Grant No.cx2020025)Science and Technology Research Program of Chongqing Municipal Education Commission(Grant Nos.KJQN202100526,KJQN201900511)the National Statistical Science Research Program(Grant No.2022LY019)Chongqing University Innovation Research Group Project:Nonlinear Optimization Method and Its Application(Grant No.CXQT20014)。
文摘In this paper,we develop a flexible semiparametric model averaging marginal regression procedure to forecast the joint conditional quantile function of the response variable for ultrahighdimensional data.First,we approximate the joint conditional quantile function by a weighted average of one-dimensional marginal conditional quantile functions that have varying coefficient structures.Then,a local linear regression technique is employed to derive the consistent estimates of marginal conditional quantile functions.Second,based on estimated marginal conditional quantile functions,we estimate and select the significant model weights involved in the approximation by a nonconvex penalized quantile regression.Under some relaxed conditions,we establish the asymptotic properties for the nonparametric kernel estimators and oracle estimators of the model averaging weights.We further derive the oracle property for the proposed nonconvex penalized model averaging procedure.Finally,simulation studies and a real data analysis are conducted to illustrate the merits of our proposed model averaging method.
基金supported by the National Statistical Scientific Research Projects(Grant No.2015LZ54)
文摘Using the so-called martingale difference correlation(MDC), we propose a novel censoredconditional-quantile screening approach for ultrahigh-dimensional survival data with heterogeneity(which is often present in such data). By incorporating a weighting scheme, this method is a natural extension of MDCbased conditional quantile screening, as considered by Shao and Zhang(2014), to handle ultrahigh-dimensional survival data. The proposed screening procedure has a sure-screening property under certain technical conditions and an excellent capability of detecting the nonlinear relationship between independent and censored dependent variables. Both simulation results and an analysis of real data demonstrate the effectiveness of the new censored conditional quantile-screening procedure.
文摘This article develops a procedure for screening variables, in ultra high-di- mensional settings, based on their predictive significance. This is achieved by ranking the variables according to the variance of their respective marginal regression functions (RV-SIS). We show that, under some mild technical conditions, the RV-SIS possesses a sure screening property, which is defined by Fan and Lv (2008). Numerical comparisons suggest that RV-SIS has competitive performance compared to other screening procedures, and outperforms them in many different model settings.
文摘Next Generation Sequencing (NGS) provides an effective basis for estimating the survival time of cancer patients, but it also poses the problem of high data dimensionality, in addition to the fact that some patients drop out of the study, making the data missing, so a method for estimating the mean of the response variable with missing values for the ultra-high dimensional datasets is needed. In this paper, we propose a two-stage ultra-high dimensional variable screening method, RF-SIS, based on random forest regression, which effectively solves the problem of estimating missing values due to excessive data dimension. After the dimension reduction process by applying RF-SIS, mean interpolation is executed on the missing responses. The results of the simulated data show that compared with the estimation method of directly deleting missing observations, the estimation results of RF-SIS-MI have significant advantages in terms of the proportion of intervals covered, the average length of intervals, and the average absolute deviation.