This paper is concerned with ultrahigh dimensional data analysis,which has become increasingly important in diverse scientific fields.We develop a sure independence screening procedure via the measure of conditional m...This paper is concerned with ultrahigh dimensional data analysis,which has become increasingly important in diverse scientific fields.We develop a sure independence screening procedure via the measure of conditional mean dependence based on Copula(CC-SIS,for short).The CC-SIS can be implemented as easily as the sure independence screening procedures which respectively based on the Pearson correlation,conditional mean and distance correlation(SIS,SIRS and DC-SIS,for short)and can significantly improve the performance of feature screening.We establish the sure screening property for the CC-SIS,and conduct simulations to examine its finite sample performance.Numerical comparison indicates that the CC-SIS performs better than the other two methods in various models.At last,we also illustrate the CC-SIS through a real data example.展开更多
Many modern biomedical studies have yielded survival data with high-throughput predictors.The goals of scientific research often lie in identifying predictive biomarkers,understanding biological mechanisms and making ...Many modern biomedical studies have yielded survival data with high-throughput predictors.The goals of scientific research often lie in identifying predictive biomarkers,understanding biological mechanisms and making accurate and precise predictions.Variable screening is a crucial first step in achieving these goals.This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates.We present the main methodologies,along with the key conditions that ensure sure screening properties.The practical utility of these methods is examined via extensive simulations.We conclude the review with some future opportunities in this field.展开更多
Sure independence screening(SIS) has been proposed to reduce the ultrahigh dimensionality down to a moderate scale and proved to enjoy the sure screening property under Gaussian linear models.However,the observed re...Sure independence screening(SIS) has been proposed to reduce the ultrahigh dimensionality down to a moderate scale and proved to enjoy the sure screening property under Gaussian linear models.However,the observed response is often skewed or heavy-tailed with extreme values in practice,which may dramatically deteriorate the performance of SIS.To this end,we propose a new robust sure independence screening(RoSIS) via considering the correlation between each predictor and the distribution function of the response.The proposed approach contributes to the literature in the following three folds: First,it is able to reduce ultrahigh dimensionality effectively.Second,it is robust to heavy tails or extreme values in the response.Third,it possesses both sure screening property and ranking consistency property under milder conditions.Furthermore,we demonstrate its excellent finite sample performance through numerical simulations and a real data example.展开更多
This paper considers the feature screening and variable selection for ultrahigh dimensional covariates. The new feature screening procedure base on conditional expectation which is used to differentiate whether an exp...This paper considers the feature screening and variable selection for ultrahigh dimensional covariates. The new feature screening procedure base on conditional expectation which is used to differentiate whether an explanatory variable contributes to a response variable or not, without requiring a specific parametric form of the underlying data model. The authors estimate the marginal condi- tional expectation by kernel regression estimator. The proposed method is showed to have sure screen property. The authors propose an iterative kernel estimator algorithm to reduce the ultrahigh dimensionality to an appropriate scale. Simulation results and real data analysis demonstrate the proposed method works well and performs better than competing methods.展开更多
In this paper,we mainly study how to estimate the error density in the ultrahigh dimensional sparse additive model,where the number of variables is larger than the sample size.First,a smoothing method based on B-splin...In this paper,we mainly study how to estimate the error density in the ultrahigh dimensional sparse additive model,where the number of variables is larger than the sample size.First,a smoothing method based on B-splines is applied to the estimation of regression functions.Second,an improved two-stage refitted crossvalidation(RCV)procedure by random splitting technique is used to obtain the residuals of the model,and then the residual-based kernel method is applied to estimate the error density function.Under suitable sparse conditions,the large sample properties of the estimator,including the weak and strong consistency,as well as normality and the law of the iterated logarithm,are obtained.Especially,the relationship between the sparsity and the convergence rate of the kernel density estimator is given.The methodology is illustrated by simulations and a real data example,which suggests that the proposed method performs well.展开更多
This paper focuses on error density estimation in ultrahigh dimensional sparse linear model,where the error term may have a heavy-tailed distribution.First,an improved two-stage refitted crossvalidation method combine...This paper focuses on error density estimation in ultrahigh dimensional sparse linear model,where the error term may have a heavy-tailed distribution.First,an improved two-stage refitted crossvalidation method combined with some robust variable screening procedures such as RRCS and variable selection methods such as LAD-SCAD is used to obtain the submodel,and then the residual-based kernel density method is applied to estimate the error density through LAD regression.Under given conditions,the large sample properties of the estimator are also established.Especially,we explicitly give the relationship between the sparsity and the convergence rate of the kernel density estimator.The simulation results show that the proposed error density estimator has a good performance.A real data example is presented to illustrate our methods.展开更多
In this paper we propose the Gini correlation screening(GCS)method to select the important variables with ultrahigh dimensional data.The new procedure is based on the Gini correlation coefficient via the covariance be...In this paper we propose the Gini correlation screening(GCS)method to select the important variables with ultrahigh dimensional data.The new procedure is based on the Gini correlation coefficient via the covariance between the response and the rank of the predictor variables rather than the Pearson correlation and the Kendallτcorrelation coefficient.The new method does not require imposing a specific model structure on regression functions and only needs the condition which the predictors and response have continuous distribution function.We demonstrate that,with the number of predictors growing at an exponential rate of the sample size,the proposed procedure possesses consistency in ranking,which is both useful in its own right and can lead to consistency in selection.The procedure is computationally efficient and simple,and exhibits a competent empirical performance in our intensive simulations and real data analysis.展开更多
基金Supported by Natural Science Foundation of Henan(Grant No.202300410066)Program for Science and Technology Development of Henan Province(Grant No.242102310350).
文摘This paper is concerned with ultrahigh dimensional data analysis,which has become increasingly important in diverse scientific fields.We develop a sure independence screening procedure via the measure of conditional mean dependence based on Copula(CC-SIS,for short).The CC-SIS can be implemented as easily as the sure independence screening procedures which respectively based on the Pearson correlation,conditional mean and distance correlation(SIS,SIRS and DC-SIS,for short)and can significantly improve the performance of feature screening.We establish the sure screening property for the CC-SIS,and conduct simulations to examine its finite sample performance.Numerical comparison indicates that the CC-SIS performs better than the other two methods in various models.At last,we also illustrate the CC-SIS through a real data example.
基金Supported by the National Natural Science Foundation of China(11528102)the National Institutes of Health(U01CA209414)
文摘Many modern biomedical studies have yielded survival data with high-throughput predictors.The goals of scientific research often lie in identifying predictive biomarkers,understanding biological mechanisms and making accurate and precise predictions.Variable screening is a crucial first step in achieving these goals.This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates.We present the main methodologies,along with the key conditions that ensure sure screening properties.The practical utility of these methods is examined via extensive simulations.We conclude the review with some future opportunities in this field.
基金Supported by National Natural Science Foundation of China(Grant Nos.11301435 and 71131008)the Fundamental Research Funds for the Central Universities
文摘Sure independence screening(SIS) has been proposed to reduce the ultrahigh dimensionality down to a moderate scale and proved to enjoy the sure screening property under Gaussian linear models.However,the observed response is often skewed or heavy-tailed with extreme values in practice,which may dramatically deteriorate the performance of SIS.To this end,we propose a new robust sure independence screening(RoSIS) via considering the correlation between each predictor and the distribution function of the response.The proposed approach contributes to the literature in the following three folds: First,it is able to reduce ultrahigh dimensionality effectively.Second,it is robust to heavy tails or extreme values in the response.Third,it possesses both sure screening property and ranking consistency property under milder conditions.Furthermore,we demonstrate its excellent finite sample performance through numerical simulations and a real data example.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.11571112,11501372,11571148,11471160Doctoral Fund of Ministry of Education of China under Grant No.20130076110004+1 种基金Program of Shanghai Subject Chief Scientist under Grant No.14XD1401600the 111 Project of China under Grant No.B14019
文摘This paper considers the feature screening and variable selection for ultrahigh dimensional covariates. The new feature screening procedure base on conditional expectation which is used to differentiate whether an explanatory variable contributes to a response variable or not, without requiring a specific parametric form of the underlying data model. The authors estimate the marginal condi- tional expectation by kernel regression estimator. The proposed method is showed to have sure screen property. The authors propose an iterative kernel estimator algorithm to reduce the ultrahigh dimensionality to an appropriate scale. Simulation results and real data analysis demonstrate the proposed method works well and performs better than competing methods.
基金supported by National Natural Science Foundation of China (Grant Nos. 11971324 and 11471223)Interdisciplinary Construction of Bioinformatics and StatisticsAcademy for Multidisciplinary Studies, Capital Normal University
文摘In this paper,we mainly study how to estimate the error density in the ultrahigh dimensional sparse additive model,where the number of variables is larger than the sample size.First,a smoothing method based on B-splines is applied to the estimation of regression functions.Second,an improved two-stage refitted crossvalidation(RCV)procedure by random splitting technique is used to obtain the residuals of the model,and then the residual-based kernel method is applied to estimate the error density function.Under suitable sparse conditions,the large sample properties of the estimator,including the weak and strong consistency,as well as normality and the law of the iterated logarithm,are obtained.Especially,the relationship between the sparsity and the convergence rate of the kernel density estimator is given.The methodology is illustrated by simulations and a real data example,which suggests that the proposed method performs well.
基金Supported by the National Natural Science Foundation of China(Grant No.11971324)the State Key Program of National Natural Science Foundation of China(Grant No.12031016)。
文摘This paper focuses on error density estimation in ultrahigh dimensional sparse linear model,where the error term may have a heavy-tailed distribution.First,an improved two-stage refitted crossvalidation method combined with some robust variable screening procedures such as RRCS and variable selection methods such as LAD-SCAD is used to obtain the submodel,and then the residual-based kernel density method is applied to estimate the error density through LAD regression.Under given conditions,the large sample properties of the estimator are also established.Especially,we explicitly give the relationship between the sparsity and the convergence rate of the kernel density estimator.The simulation results show that the proposed error density estimator has a good performance.A real data example is presented to illustrate our methods.
基金by the National Natural Science Foundation of China(Nos.11171112,11201190,11101158)Doctoral Fund of Ministry of Education of China(20130076110004)and the 111 Project of China(B14019).
文摘In this paper we propose the Gini correlation screening(GCS)method to select the important variables with ultrahigh dimensional data.The new procedure is based on the Gini correlation coefficient via the covariance between the response and the rank of the predictor variables rather than the Pearson correlation and the Kendallτcorrelation coefficient.The new method does not require imposing a specific model structure on regression functions and only needs the condition which the predictors and response have continuous distribution function.We demonstrate that,with the number of predictors growing at an exponential rate of the sample size,the proposed procedure possesses consistency in ranking,which is both useful in its own right and can lead to consistency in selection.The procedure is computationally efficient and simple,and exhibits a competent empirical performance in our intensive simulations and real data analysis.