In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all usef...In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.展开更多
This article develops a procedure for screening variables, in ultra high-di- mensional settings, based on their predictive significance. This is achieved by ranking the variables according to the variance of their res...This article develops a procedure for screening variables, in ultra high-di- mensional settings, based on their predictive significance. This is achieved by ranking the variables according to the variance of their respective marginal regression functions (RV-SIS). We show that, under some mild technical conditions, the RV-SIS possesses a sure screening property, which is defined by Fan and Lv (2008). Numerical comparisons suggest that RV-SIS has competitive performance compared to other screening procedures, and outperforms them in many different model settings.展开更多
Next Generation Sequencing (NGS) provides an effective basis for estimating the survival time of cancer patients, but it also poses the problem of high data dimensionality, in addition to the fact that some patients d...Next Generation Sequencing (NGS) provides an effective basis for estimating the survival time of cancer patients, but it also poses the problem of high data dimensionality, in addition to the fact that some patients drop out of the study, making the data missing, so a method for estimating the mean of the response variable with missing values for the ultra-high dimensional datasets is needed. In this paper, we propose a two-stage ultra-high dimensional variable screening method, RF-SIS, based on random forest regression, which effectively solves the problem of estimating missing values due to excessive data dimension. After the dimension reduction process by applying RF-SIS, mean interpolation is executed on the missing responses. The results of the simulated data show that compared with the estimation method of directly deleting missing observations, the estimation results of RF-SIS-MI have significant advantages in terms of the proportion of intervals covered, the average length of intervals, and the average absolute deviation.展开更多
Many modern biomedical studies have yielded survival data with high-throughput predictors.The goals of scientific research often lie in identifying predictive biomarkers,understanding biological mechanisms and making ...Many modern biomedical studies have yielded survival data with high-throughput predictors.The goals of scientific research often lie in identifying predictive biomarkers,understanding biological mechanisms and making accurate and precise predictions.Variable screening is a crucial first step in achieving these goals.This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates.We present the main methodologies,along with the key conditions that ensure sure screening properties.The practical utility of these methods is examined via extensive simulations.We conclude the review with some future opportunities in this field.展开更多
Sure independence screening(SIS) has been proposed to reduce the ultrahigh dimensionality down to a moderate scale and proved to enjoy the sure screening property under Gaussian linear models.However,the observed re...Sure independence screening(SIS) has been proposed to reduce the ultrahigh dimensionality down to a moderate scale and proved to enjoy the sure screening property under Gaussian linear models.However,the observed response is often skewed or heavy-tailed with extreme values in practice,which may dramatically deteriorate the performance of SIS.To this end,we propose a new robust sure independence screening(RoSIS) via considering the correlation between each predictor and the distribution function of the response.The proposed approach contributes to the literature in the following three folds: First,it is able to reduce ultrahigh dimensionality effectively.Second,it is robust to heavy tails or extreme values in the response.Third,it possesses both sure screening property and ranking consistency property under milder conditions.Furthermore,we demonstrate its excellent finite sample performance through numerical simulations and a real data example.展开更多
The curse of high-dimensionality has emerged in the statistical fields more and more frequently.Many techniques have been developed to address this challenge for classification problems. We propose a novel feature scr...The curse of high-dimensionality has emerged in the statistical fields more and more frequently.Many techniques have been developed to address this challenge for classification problems. We propose a novel feature screening procedure for dichotomous response data. This new method can be implemented as easily as t-test marginal screening approach, and the proposed procedure is free of any subexponential tail probability conditions and moment requirement and not restricted in a specific model structure. We prove that our method possesses the sure screening property and also illustrate the effect of screening by Monte Carlo simulation and apply it to a real data example.展开更多
The freshness and quality indices of whiting (Merlangius merlangus) influenced by a large number of chemical volatile compounds, are here analyzed in order to select the most relevant compounds as predictors for these...The freshness and quality indices of whiting (Merlangius merlangus) influenced by a large number of chemical volatile compounds, are here analyzed in order to select the most relevant compounds as predictors for these indices. The selection process was performed by means of recent statistical variable selection methods, namely robust model-free feature screening, based on quantile correlation and composite quantile correlation. On the one hand, compounds 2-Methyl-1-butanol, 3-Methyl-1-butanol, Ethanol, Trimethylamine, 3-Methyl butanal, 2-Methyl-1-propanol, Ethylacetate, 1-Butanol and 2,3-Butanedione were identified as major predictors for the freshness index and on the other hand, compounds 3-Methyl-1-butanol, 2-Methyl-1- butanol, Ethanol, 3-Methyl butanal, 3-Hydroxy-2-butanone, 1-Butanol, 2,3-Butane- dione, 3-Pentanol, 3-Pentanone and 2-Methyl-1-propanol were identified as major predictors for the quality index.展开更多
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional dat...High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data.Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.展开更多
This paper proposes a new sure independence screening procedure for high-dimensional survival data based on censored quantile correlation(CQC).This framework has two distinctive features:1)Via incorporating a weightin...This paper proposes a new sure independence screening procedure for high-dimensional survival data based on censored quantile correlation(CQC).This framework has two distinctive features:1)Via incorporating a weighting scheme,our metric is a natural extension of quantile correlation(QC),considered by Li(2015),to handle high-dimensional survival data;2)The proposed method not only is robust against outliers,but also can discover the nonlinear relationship between independent variables and censored dependent variable.Additionally,the proposed method enjoys the sure screening property under certain technical conditions.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors.展开更多
In this paper,we propose a new correlation,called stable correlation,to measure the dependence between two random vectors.The new correlation is well defined without the moment condition and is zero if and only if the...In this paper,we propose a new correlation,called stable correlation,to measure the dependence between two random vectors.The new correlation is well defined without the moment condition and is zero if and only if the two random vectors are independent.We also study its other theoretical properties.Based on the new correlation,we further propose a robust model-free feature screening procedure for ultrahigh dimensional data and establish its sure screening property and rank consistency property without imposing the subexponential or sub-Gaussian tail condition,which is commonly required in the literature of feature screening.We also examine the finite sample performance of the proposed robust feature screening procedure via Monte Carlo simulation studies and illustrate the proposed procedure by a real data example.展开更多
Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems,such as gene-wide association studies and the multi-factor analysis-of-variance(ANOVA).To address this iss...Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems,such as gene-wide association studies and the multi-factor analysis-of-variance(ANOVA).To address this issue,we proposed a group screening method to do variables selection on groups of variables in linear models.This group screening method is based on a working independence,and sure screening property is also established for our approach.To enhance the finite sample performance,a data-driven thresholding and a two-stage iterative procedure are developed.To the best of our knowledge,screening for grouped variables rarely appeared in the literature,and this method can be regarded as an important and non-trivial extension of screening for individual variables.An extensive simulation study and a real data analysis demonstrate its finite sample performance.展开更多
The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes ...The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.展开更多
With the rapid-growth-in-size scientific data in various disciplines, feature screening plays an important role to reduce the high-dimensionality to a moderate scale in many scientific fields. In this paper, we introd...With the rapid-growth-in-size scientific data in various disciplines, feature screening plays an important role to reduce the high-dimensionality to a moderate scale in many scientific fields. In this paper, we introduce a unified and robust model-free feature screening approach for high-dimensional survival data with censoring, which has several advantages: it is a model-free approach under a general model framework, and hence avoids the complication to specify an actual model form with huge number of candidate variables; under mild conditions without requiring the existence of any moment of the response, it enjoys the ranking consistency and sure screening properties in ultra-high dimension. In particular, we impose a conditional independence assumption of the response and the censoring variable given each covariate, instead of assuming the censoring variable is independent of the response and the covariates. Moreover, we also propose a more robust variant to the new procedure, which possesses desirable theoretical properties without any finite moment condition of the predictors and the response. The computation of the newly proposed methods does not require any complicated numerical optimization and it is fast and easy to implement. Extensive numerical studies demonstrate that the proposed methods perform competitively for various configurations. Application is illustrated with an analysis of a genetic data set.展开更多
基金Outstanding Youth Foundation of Hunan Provincial Department of Education(Grant No.22B0911)。
文摘In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated.
文摘This article develops a procedure for screening variables, in ultra high-di- mensional settings, based on their predictive significance. This is achieved by ranking the variables according to the variance of their respective marginal regression functions (RV-SIS). We show that, under some mild technical conditions, the RV-SIS possesses a sure screening property, which is defined by Fan and Lv (2008). Numerical comparisons suggest that RV-SIS has competitive performance compared to other screening procedures, and outperforms them in many different model settings.
文摘Next Generation Sequencing (NGS) provides an effective basis for estimating the survival time of cancer patients, but it also poses the problem of high data dimensionality, in addition to the fact that some patients drop out of the study, making the data missing, so a method for estimating the mean of the response variable with missing values for the ultra-high dimensional datasets is needed. In this paper, we propose a two-stage ultra-high dimensional variable screening method, RF-SIS, based on random forest regression, which effectively solves the problem of estimating missing values due to excessive data dimension. After the dimension reduction process by applying RF-SIS, mean interpolation is executed on the missing responses. The results of the simulated data show that compared with the estimation method of directly deleting missing observations, the estimation results of RF-SIS-MI have significant advantages in terms of the proportion of intervals covered, the average length of intervals, and the average absolute deviation.
基金Supported by the National Natural Science Foundation of China(11528102)the National Institutes of Health(U01CA209414)
文摘Many modern biomedical studies have yielded survival data with high-throughput predictors.The goals of scientific research often lie in identifying predictive biomarkers,understanding biological mechanisms and making accurate and precise predictions.Variable screening is a crucial first step in achieving these goals.This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates.We present the main methodologies,along with the key conditions that ensure sure screening properties.The practical utility of these methods is examined via extensive simulations.We conclude the review with some future opportunities in this field.
基金Supported by National Natural Science Foundation of China(Grant Nos.11301435 and 71131008)the Fundamental Research Funds for the Central Universities
文摘Sure independence screening(SIS) has been proposed to reduce the ultrahigh dimensionality down to a moderate scale and proved to enjoy the sure screening property under Gaussian linear models.However,the observed response is often skewed or heavy-tailed with extreme values in practice,which may dramatically deteriorate the performance of SIS.To this end,we propose a new robust sure independence screening(RoSIS) via considering the correlation between each predictor and the distribution function of the response.The proposed approach contributes to the literature in the following three folds: First,it is able to reduce ultrahigh dimensionality effectively.Second,it is robust to heavy tails or extreme values in the response.Third,it possesses both sure screening property and ranking consistency property under milder conditions.Furthermore,we demonstrate its excellent finite sample performance through numerical simulations and a real data example.
基金supported by Graduate Innovation Foundation of Shanghai University of Finance and Economics of China (Grant Nos. CXJJ-2014-459 and CXJJ-2015-430)National Natural Science Foundation of China (Grant No. 71271128), the State Key Program of National Natural Science Foundation of China (Grant No. 71331006), the State Key Program in the Major Research Plan of National Natural Science Foundation of China (Grant No. 91546202)+1 种基金National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences (Grant No. 2008DP173182)Innovative Research Team in Shanghai University of Finance and Economics (Grant No. IRT13077)
文摘The curse of high-dimensionality has emerged in the statistical fields more and more frequently.Many techniques have been developed to address this challenge for classification problems. We propose a novel feature screening procedure for dichotomous response data. This new method can be implemented as easily as t-test marginal screening approach, and the proposed procedure is free of any subexponential tail probability conditions and moment requirement and not restricted in a specific model structure. We prove that our method possesses the sure screening property and also illustrate the effect of screening by Monte Carlo simulation and apply it to a real data example.
文摘The freshness and quality indices of whiting (Merlangius merlangus) influenced by a large number of chemical volatile compounds, are here analyzed in order to select the most relevant compounds as predictors for these indices. The selection process was performed by means of recent statistical variable selection methods, namely robust model-free feature screening, based on quantile correlation and composite quantile correlation. On the one hand, compounds 2-Methyl-1-butanol, 3-Methyl-1-butanol, Ethanol, Trimethylamine, 3-Methyl butanal, 2-Methyl-1-propanol, Ethylacetate, 1-Butanol and 2,3-Butanedione were identified as major predictors for the freshness index and on the other hand, compounds 3-Methyl-1-butanol, 2-Methyl-1- butanol, Ethanol, 3-Methyl butanal, 3-Hydroxy-2-butanone, 1-Butanol, 2,3-Butane- dione, 3-Pentanol, 3-Pentanone and 2-Methyl-1-propanol were identified as major predictors for the quality index.
基金supported by National Natural Science Foundation of China(Grant Nos.11401497 and 11301435)the Fundamental Research Funds for the Central Universities(Grant No.T2013221043)+3 种基金the Scientific Research Foundation for the Returned Overseas Chinese Scholars,State Education Ministry,the Fundamental Research Funds for the Central Universities(Grant No.20720140034)National Institute on Drug Abuse,National Institutes of Health(Grant Nos.P50 DA036107 and P50 DA039838)National Science Foundation(Grant No.DMS1512422)The content is solely the responsibility of the authors and does not necessarily represent the official views of National Institute on Drug Abuse, National Institutes of Health, National Science Foundation or National Natural Science Foundation of China
文摘High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data.Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.
基金supported by the National Natural Science Foundation of China under Grant No.11901006the Natural Science Foundation of Anhui Province under Grant Nos.1908085QA06 and 1908085MA20。
文摘This paper proposes a new sure independence screening procedure for high-dimensional survival data based on censored quantile correlation(CQC).This framework has two distinctive features:1)Via incorporating a weighting scheme,our metric is a natural extension of quantile correlation(QC),considered by Li(2015),to handle high-dimensional survival data;2)The proposed method not only is robust against outliers,but also can discover the nonlinear relationship between independent variables and censored dependent variable.Additionally,the proposed method enjoys the sure screening property under certain technical conditions.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors.
基金supported by National Natural Science Foundation of China(Grant No.11701034)supported by National Science Foundation of USA(Grant No.DMS1820702)。
文摘In this paper,we propose a new correlation,called stable correlation,to measure the dependence between two random vectors.The new correlation is well defined without the moment condition and is zero if and only if the two random vectors are independent.We also study its other theoretical properties.Based on the new correlation,we further propose a robust model-free feature screening procedure for ultrahigh dimensional data and establish its sure screening property and rank consistency property without imposing the subexponential or sub-Gaussian tail condition,which is commonly required in the literature of feature screening.We also examine the finite sample performance of the proposed robust feature screening procedure via Monte Carlo simulation studies and illustrate the proposed procedure by a real data example.
基金supported by the National Natural Science Foundation of China(CN)(11571112)the National Social Science Foundation Key Program(17ZDA091)+1 种基金Natural Science Fund of Education Department of Anhui Province(KJ2013B233)the 111 Project of China(B14019).
文摘Ultra-high-dimensional data with grouping structures arise naturally in many contemporary statistical problems,such as gene-wide association studies and the multi-factor analysis-of-variance(ANOVA).To address this issue,we proposed a group screening method to do variables selection on groups of variables in linear models.This group screening method is based on a working independence,and sure screening property is also established for our approach.To enhance the finite sample performance,a data-driven thresholding and a two-stage iterative procedure are developed.To the best of our knowledge,screening for grouped variables rarely appeared in the literature,and this method can be regarded as an important and non-trivial extension of screening for individual variables.An extensive simulation study and a real data analysis demonstrate its finite sample performance.
基金supported by Singapore Ministry of Educations ACRF Tier 1 (Grant No. R-155-000-065-112)supported by the National Science and Engineering Research Countil of Canada and MITACS,Canada
文摘The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.
基金supported by the Research Grant Council of Hong Kong (Grant Nos. 509413 and 14311916)Direct Grants for Research of The Chinese University of Hong Kong (Grant Nos. 3132754 and 4053235)+3 种基金the Natural Science Foundation of Jiangxi Province (Grant No. 20161BAB201024)the Key Science Fund Project of Jiangxi Province Eduction Department (Grant No. GJJ150439)National Natural Science Foundation of China (Grant Nos. 11461029, 11601197 and 61562030)the Canadian Institutes of Health Research (Grant No. 145546)
文摘With the rapid-growth-in-size scientific data in various disciplines, feature screening plays an important role to reduce the high-dimensionality to a moderate scale in many scientific fields. In this paper, we introduce a unified and robust model-free feature screening approach for high-dimensional survival data with censoring, which has several advantages: it is a model-free approach under a general model framework, and hence avoids the complication to specify an actual model form with huge number of candidate variables; under mild conditions without requiring the existence of any moment of the response, it enjoys the ranking consistency and sure screening properties in ultra-high dimension. In particular, we impose a conditional independence assumption of the response and the censoring variable given each covariate, instead of assuming the censoring variable is independent of the response and the covariates. Moreover, we also propose a more robust variant to the new procedure, which possesses desirable theoretical properties without any finite moment condition of the predictors and the response. The computation of the newly proposed methods does not require any complicated numerical optimization and it is fast and easy to implement. Extensive numerical studies demonstrate that the proposed methods perform competitively for various configurations. Application is illustrated with an analysis of a genetic data set.