High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional dat...High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data.Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.展开更多
We are concerned with robust estimation procedures to estimate the parameters in partially linear models with large-dimensional covariates. To enhance the interpretability, we suggest implementing a noncon- cave regul...We are concerned with robust estimation procedures to estimate the parameters in partially linear models with large-dimensional covariates. To enhance the interpretability, we suggest implementing a noncon- cave regularization method in the robust estimation procedure to select important covariates from the linear component. We establish the consistency for both the linear and the nonlinear components when the covariate dimension diverges at the rate of o(√n), where n is the sample size. We show that the robust estimate of linear component performs asymptotically as well as its oracle counterpart which assumes the baseline function and the unimportant covariates were known a priori. With a consistent estimator of the linear component, we estimate the nonparametric component by a robust local linear regression. It is proved that the robust estimate of nonlinear component performs asymptotically as well as if the linear component were known in advance. Comprehensive simulation studies are carried out and an application is presented to examine the finite-sample performance of the proposed procedures.展开更多
Feature screening plays an important role in ultrahigh dimensional data analysis.This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ...Feature screening plays an important role in ultrahigh dimensional data analysis.This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors(e.g.,genetic makers)given a low-dimensional exposure variable(such as clinical variables or environmental variables).To this end,we first propose a new index to measure conditional independence,and further develop a conditional screening procedure based on the newly proposed index.We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions.The newly proposed screening procedure enjoys some appealing properties.(a)It is model-free in that its implementation does not require a specification on the model structure;(b)it is robust to heavy-tailed distributions or outliers in both directions of response and predictors;and(c)it can deal with both feature screening and the conditional screening in a unified way.We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.展开更多
In many statistical applications, data are collected over time, and they are likely correlated. In this paper, we investigate how to incorporate the correlation information into the local linear regression. Under the ...In many statistical applications, data are collected over time, and they are likely correlated. In this paper, we investigate how to incorporate the correlation information into the local linear regression. Under the assumption that the error process is an auto-regressive process, a new estimation procedure is proposed for the nonparametric regression by using local linear regression method and the profile least squares techniques. We further propose the SCAD penalized profile least squares method to determine the order of auto-regressive process. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed procedure, and to compare the performance of the proposed procedures with the existing one. From our empirical studies, the newly proposed procedures can dramatically improve the accuracy of naive local linear regression with working-independent error structure. We illustrate the proposed methodology by an analysis of real data set.展开更多
基金supported by National Natural Science Foundation of China(Grant Nos.11401497 and 11301435)the Fundamental Research Funds for the Central Universities(Grant No.T2013221043)+3 种基金the Scientific Research Foundation for the Returned Overseas Chinese Scholars,State Education Ministry,the Fundamental Research Funds for the Central Universities(Grant No.20720140034)National Institute on Drug Abuse,National Institutes of Health(Grant Nos.P50 DA036107 and P50 DA039838)National Science Foundation(Grant No.DMS1512422)The content is solely the responsibility of the authors and does not necessarily represent the official views of National Institute on Drug Abuse, National Institutes of Health, National Science Foundation or National Natural Science Foundation of China
文摘High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data.Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.
基金supported by National Institute on Drug Abuse(Grant Nos.R21-DA024260 and P50-DA10075)National Natural Science Foundation of China(Grant Nos.11071077,11371236,11028103,11071022 and 11028103)+2 种基金Innovation Program of Shanghai Municipal Education CommissionPujiang Project of Science and Technology Commission of Shanghai Municipality(Grant No.12PJ1403200)Program for New Century Excellent Talents,Ministry of Education of China(Grant No.NCET-12-0901)
文摘We are concerned with robust estimation procedures to estimate the parameters in partially linear models with large-dimensional covariates. To enhance the interpretability, we suggest implementing a noncon- cave regularization method in the robust estimation procedure to select important covariates from the linear component. We establish the consistency for both the linear and the nonlinear components when the covariate dimension diverges at the rate of o(√n), where n is the sample size. We show that the robust estimate of linear component performs asymptotically as well as its oracle counterpart which assumes the baseline function and the unimportant covariates were known a priori. With a consistent estimator of the linear component, we estimate the nonparametric component by a robust local linear regression. It is proved that the robust estimate of nonlinear component performs asymptotically as well as if the linear component were known in advance. Comprehensive simulation studies are carried out and an application is presented to examine the finite-sample performance of the proposed procedures.
基金supported by National Science Foundation of USA (Grant No. P50 DA039838)the Program of China Scholarships Council (Grant No. 201506040130)+6 种基金 National Natural Science Foundation of China (Grant No. 11401497)the Scientific Research Foundation for the Returned Overseas Chinese ScholarsState Education Ministry, the National Key Basic Research Development Program of China (Grant No. 2010CB950703)the Fundamental Research Funds for the Central UniversitiesNational Institute on Drug AbuseNational Institutes of Health (Grants Nos. P50 DA036107 and P50 DA039838)National Science Foundation of USA (Grant No. DMS 1512422)
文摘Feature screening plays an important role in ultrahigh dimensional data analysis.This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors(e.g.,genetic makers)given a low-dimensional exposure variable(such as clinical variables or environmental variables).To this end,we first propose a new index to measure conditional independence,and further develop a conditional screening procedure based on the newly proposed index.We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions.The newly proposed screening procedure enjoys some appealing properties.(a)It is model-free in that its implementation does not require a specification on the model structure;(b)it is robust to heavy-tailed distributions or outliers in both directions of response and predictors;and(c)it can deal with both feature screening and the conditional screening in a unified way.We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.
基金supported by National Institute on Drug Abuse grant R21 DA024260Yan Li issupported by National Science Foundation grant DMS 0348869 as a graduate research assistant
文摘In many statistical applications, data are collected over time, and they are likely correlated. In this paper, we investigate how to incorporate the correlation information into the local linear regression. Under the assumption that the error process is an auto-regressive process, a new estimation procedure is proposed for the nonparametric regression by using local linear regression method and the profile least squares techniques. We further propose the SCAD penalized profile least squares method to determine the order of auto-regressive process. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed procedure, and to compare the performance of the proposed procedures with the existing one. From our empirical studies, the newly proposed procedures can dramatically improve the accuracy of naive local linear regression with working-independent error structure. We illustrate the proposed methodology by an analysis of real data set.