<div style="text-align:justify;"> With the high speed development of information technology, contemporary data from a variety of fields becomes extremely large. The number of features in many datasets ...<div style="text-align:justify;"> With the high speed development of information technology, contemporary data from a variety of fields becomes extremely large. The number of features in many datasets is well above the sample size and is called high dimensional data. In statistics, variable selection approaches are required to extract the efficacious information from high dimensional data. The most popular approach is to add a penalty function coupled with a tuning parameter to the log likelihood function, which is called penalized likelihood method. However, almost all of penalized likelihood approaches only consider noise accumulation and supurious correlation whereas ignoring the endogeneity which also appeared frequently in high dimensional space. In this paper, we explore the cause of endogeneity and its influence on penalized likelihood approaches. Simulations based on five classical pe-nalized approaches are provided to vindicate their inconsistency under endogeneity. The results show that the positive selection rate of all five approaches increased gradually but the false selection rate does not consistently decrease when endogenous variables exist, that is, they do not satisfy the selection consistency. </div>展开更多
This paper considers variable selection for moment restriction models. We propose a penalized empirical likelihood (PEL) approach that has desirable asymptotic properties comparable to the penalized likelihood appro...This paper considers variable selection for moment restriction models. We propose a penalized empirical likelihood (PEL) approach that has desirable asymptotic properties comparable to the penalized likelihood approach, which relies on a correct parametric likelihood specification. In addition to being consistent and having the oracle property, PEL admits inference on parameter without having to estimate its estimator's covariance. An approximate algorithm, along with a consistent BIC-type criterion for selecting the tuning parameters, is provided for FEL. The proposed algorithm enjoys considerable computational efficiency and overcomes the drawback of the local quadratic approximation of nonconcave penalties. Simulation studies to evaluate and compare the performances of our method with those of the existing ones show that PEL is competitive and robust. The proposed method is illustrated with two real examples.展开更多
A consistent test via the partial penalized empirical likelihood approach for the parametric hy- pothesis testing under the sparse case, called the partial penalized empirical likelihood ratio (PPELR) test, is propo...A consistent test via the partial penalized empirical likelihood approach for the parametric hy- pothesis testing under the sparse case, called the partial penalized empirical likelihood ratio (PPELR) test, is proposed in this paper. Our results are demonstrated for the mean vector in multivariate analysis and regression coefficients in linear models, respectively. And we establish its asymptotic distributions under the null hypoth- esis and the local alternatives of order n-1/2 under regularity conditions. Meanwhile, the oracle property of the partial penalized empirical likelihood estimator also holds. The proposed PPELR test statistic performs as well as the ordinary empirical likelihood ratio test statistic and outperforms the full penalized empirical like- lihood ratio test statistic in term of size and power when the null parameter is zero. Moreover, the proposed method obtains the variable selection as well as the p-values of testing. Numerical simulations and an analysis of Prostate Cancer data confirm our theoretical findings and demonstrate the promising performance of the proposed method in hypothesis testing and variable selection.展开更多
The purpose of this paper is two fold. First, we investigate estimation for varying coefficient partially linear models in which covariates in the nonparametric part are measured with errors. As there would be some sp...The purpose of this paper is two fold. First, we investigate estimation for varying coefficient partially linear models in which covariates in the nonparametric part are measured with errors. As there would be some spurious covariates in the linear part, a penalized profile least squares estimation is suggested with the assistance from smoothly clipped absolute deviation penalty. However, the estimator is often biased due to the existence of measurement errors, a bias correction is proposed such that the estimation consistency with the oracle property is proved. Second, based on the estimator, a test statistic is constructed to check a linear hypothesis of the parameters and its asymptotic properties are studied. We prove that the existence of measurement errors causes intractability of the limiting null distribution that requires a Monte Carlo approximation and the absence of the errors can lead to a chi-square limit. Furthermore, confidence regions of the parameter of interest can also be constructed. Simulation studies and a real data example are conducted to examine the performance of our estimators and test statistic.展开更多
Two-parameter gamma distributions are widely used in liability theory, lifetime data analysis, financial statistics, and other areas. Finite mixtures of gamma distributions are their natural extensions, and they are p...Two-parameter gamma distributions are widely used in liability theory, lifetime data analysis, financial statistics, and other areas. Finite mixtures of gamma distributions are their natural extensions, and they are particularly useful when the population is suspected of heterogeneity. These distributions are successfully employed in various applications, but many researchers falsely believe that the maximum likelihood estimator of the mixing distribution is consistent. Similarly to finite mixtures of normal distributions, the likelihood function under finite gamma mixtures is unbounded. Because of this, each observed value leads to a global maximum that is irrelevant to the true distribution. We apply a seemingly negligible penalty to the likelihood according to the shape parameters in the fitted model. We show that this penalty restores the consistency of the likelihoodbased estimator of the mixing distribution under finite gamma mixture models. We present simulation results to validate the consistency conclusion, and we give an example to illustrate the key points.展开更多
Aim Site occupancy probabilities of target species are commonly used in various ecological studies,e.g.to monitor current status and trends in biodiversity.Detection error introduces bias in the estimators of site occ...Aim Site occupancy probabilities of target species are commonly used in various ecological studies,e.g.to monitor current status and trends in biodiversity.Detection error introduces bias in the estimators of site occupancy.Existing methods for estimating occupancy probability in the presence of detection error use replicate surveys.These methods assume population closure,i.e.the site occupancy status remains constant across surveys,and independence between surveys.We present an approach for estimating site occupancy probability in the presence of detection error that requires only a single survey and does not require assumption of population closure or independence.In place of the closure assumption,this method requires covariates that affect detection and occupancy.Methods Penalized maximum-likelihood method was used to estimate the parameters.Estimability of the parameters was checked using data cloning.Parametric boostrapping method was used for computing confidence intervals.Important Findings The single-survey approach facilitates analysis of historical datasets where replicate surveys are unavailable,situations where replicate surveys are expensive to conduct and when the assumptions of closure or independence are not met.This method saves significant amounts of time,energy and money in ecological surveys without sacrificing statistical validity.Further,we show that occupancy and habitat suitability are not synonymous and suggest a method to estimate habitat suitability using single-survey data.展开更多
The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes ...The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.展开更多
The problem of estimating high-dimensional Gaussian graphical models has gained much attention in recent years. Most existing methods can be considered as one-step approaches, being either regression-based or likeliho...The problem of estimating high-dimensional Gaussian graphical models has gained much attention in recent years. Most existing methods can be considered as one-step approaches, being either regression-based or likelihood-based. In this paper, we propose a two-step method for estimating the high-dimensional Gaussian graphical model. Specifically, the first step serves as a screening step, in which many entries of the concentration matrix are identified as zeros and thus removed from further consideration. Then in the second step, we focus on the remaining entries of the concentration matrix and perform selection and estimation for nonzero entries of the concentration matrix. Since the dimension of the parameter space is effectively reduced by the screening step,the estimation accuracy of the estimated concentration matrix can be potentially improved. We show that the proposed method enjoys desirable asymptotic properties. Numerical comparisons of the proposed method with several existing methods indicate that the proposed method works well. We also apply the proposed method to a breast cancer microarray data set and obtain some biologically meaningful results.展开更多
The present paper proposes a semiparametric reproductive dispersion nonlinear model (SRDNM) which is an extension of the nonlinear reproductive dispersion models and the semiparameter regression models. Maximum pena...The present paper proposes a semiparametric reproductive dispersion nonlinear model (SRDNM) which is an extension of the nonlinear reproductive dispersion models and the semiparameter regression models. Maximum penalized likelihood estimates (MPLEs) of unknown parameters and nonparametric functions in SRDNM are presented. Assessment of local influence for various perturbation schemes are investigated. Some local influence diagnostics are given. A simulation study and a real example are used to illustrate the proposed methodologies.展开更多
The smooth integration of counting and absolute deviation (SICA) penalized variable selection procedure for high-dimensional linear regression models is proposed by Lv and Fan (2009). In this article, we extend th...The smooth integration of counting and absolute deviation (SICA) penalized variable selection procedure for high-dimensional linear regression models is proposed by Lv and Fan (2009). In this article, we extend their idea to Cox's proportional hazards (PH) model by using a penalized log partial likelihood with the SICA penalty. The number of the regression coefficients is allowed to grow with the sample size. Based on an approximation to the inverse of the Hessian matrix, the proposed method can be easily carried out with the smoothing quasi-Newton (SQN) algorithm. Under appropriate sparsity conditions, we show that the resulting estimator of the regression coefficients possesses the oracle property. We perform an extensive simulation study to compare our approach with other methods and illustrate it on a well known PBC data for predicting survival from risk factors.展开更多
Variable selection is an important research topic in modern statistics, traditional variable selection methods can only select the mean model and(or) the variance model, and cannot be used to select the joint mean, va...Variable selection is an important research topic in modern statistics, traditional variable selection methods can only select the mean model and(or) the variance model, and cannot be used to select the joint mean, variance and skewness models. In this paper, the authors propose the joint location, scale and skewness models when the data set under consideration involves asymmetric outcomes,and consider the problem of variable selection for our proposed models. Based on an efficient unified penalized likelihood method, the consistency and the oracle property of the penalized estimators are established. The authors develop the variable selection procedure for the proposed joint models, which can efficiently simultaneously estimate and select important variables in location model, scale model and skewness model. Simulation studies and body mass index data analysis are presented to illustrate the proposed methods.展开更多
The integer-valued generalized autoregressive conditional heteroskedastic(INGARCH)model is often utilized to describe data in biostatistics,such as the number of people infected with dengue fever,daily epileptic seizu...The integer-valued generalized autoregressive conditional heteroskedastic(INGARCH)model is often utilized to describe data in biostatistics,such as the number of people infected with dengue fever,daily epileptic seizure counts of an epileptic patient and the number of cases of campylobacterosis infections,etc.Since the structure of such data is generally high-order and sparse,studies about order shrinkage and selection for the model attract many attentions.In this paper,we propose a penalized conditional maximum likelihood(PCML)method to solve this problem.The PCML method can effectively select significant orders and estimate the parameters,simultaneously.Some simulations and a real data analysis are carried out to illustrate the usefulness of our method.展开更多
文摘<div style="text-align:justify;"> With the high speed development of information technology, contemporary data from a variety of fields becomes extremely large. The number of features in many datasets is well above the sample size and is called high dimensional data. In statistics, variable selection approaches are required to extract the efficacious information from high dimensional data. The most popular approach is to add a penalty function coupled with a tuning parameter to the log likelihood function, which is called penalized likelihood method. However, almost all of penalized likelihood approaches only consider noise accumulation and supurious correlation whereas ignoring the endogeneity which also appeared frequently in high dimensional space. In this paper, we explore the cause of endogeneity and its influence on penalized likelihood approaches. Simulations based on five classical pe-nalized approaches are provided to vindicate their inconsistency under endogeneity. The results show that the positive selection rate of all five approaches increased gradually but the false selection rate does not consistently decrease when endogenous variables exist, that is, they do not satisfy the selection consistency. </div>
基金supported partly by National Natural Science Foundation of China (Grant No. 11071045)Shanghai Leading Academic Discipline Project (Grant No. B210)
文摘This paper considers variable selection for moment restriction models. We propose a penalized empirical likelihood (PEL) approach that has desirable asymptotic properties comparable to the penalized likelihood approach, which relies on a correct parametric likelihood specification. In addition to being consistent and having the oracle property, PEL admits inference on parameter without having to estimate its estimator's covariance. An approximate algorithm, along with a consistent BIC-type criterion for selecting the tuning parameters, is provided for FEL. The proposed algorithm enjoys considerable computational efficiency and overcomes the drawback of the local quadratic approximation of nonconcave penalties. Simulation studies to evaluate and compare the performances of our method with those of the existing ones show that PEL is competitive and robust. The proposed method is illustrated with two real examples.
基金supported in part by the National Natural Science Foundation of China(Grant Nos.11471223,11231010,11028103,11071022,11501586,71420107025)Key project of Beijing Municipal Education Commission(Grant No.KZ201410028030)the Foundation of Beijing Center for Mathematics and Information Interdisciplinary Sciences
文摘A consistent test via the partial penalized empirical likelihood approach for the parametric hy- pothesis testing under the sparse case, called the partial penalized empirical likelihood ratio (PPELR) test, is proposed in this paper. Our results are demonstrated for the mean vector in multivariate analysis and regression coefficients in linear models, respectively. And we establish its asymptotic distributions under the null hypoth- esis and the local alternatives of order n-1/2 under regularity conditions. Meanwhile, the oracle property of the partial penalized empirical likelihood estimator also holds. The proposed PPELR test statistic performs as well as the ordinary empirical likelihood ratio test statistic and outperforms the full penalized empirical like- lihood ratio test statistic in term of size and power when the null parameter is zero. Moreover, the proposed method obtains the variable selection as well as the p-values of testing. Numerical simulations and an analysis of Prostate Cancer data confirm our theoretical findings and demonstrate the promising performance of the proposed method in hypothesis testing and variable selection.
基金supported by National Natural Science Foundation of China (Grant Nos. 11401006, 11671299 and 11671042)a grant from the University Grants Council of Hong Kong+1 种基金the China Postdoctoral Science Foundation (Grant No. 2017M611083)the National Statistical Science Research Program of China (Grant No. 2015LY55)
文摘The purpose of this paper is two fold. First, we investigate estimation for varying coefficient partially linear models in which covariates in the nonparametric part are measured with errors. As there would be some spurious covariates in the linear part, a penalized profile least squares estimation is suggested with the assistance from smoothly clipped absolute deviation penalty. However, the estimator is often biased due to the existence of measurement errors, a bias correction is proposed such that the estimation consistency with the oracle property is proved. Second, based on the estimator, a test statistic is constructed to check a linear hypothesis of the parameters and its asymptotic properties are studied. We prove that the existence of measurement errors causes intractability of the limiting null distribution that requires a Monte Carlo approximation and the absence of the errors can lead to a chi-square limit. Furthermore, confidence regions of the parameter of interest can also be constructed. Simulation studies and a real data example are conducted to examine the performance of our estimators and test statistic.
基金supported by Grants from One Thousand Talents at Yunnan Universitya Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (Grant No. RGPIN–2014–03743)
文摘Two-parameter gamma distributions are widely used in liability theory, lifetime data analysis, financial statistics, and other areas. Finite mixtures of gamma distributions are their natural extensions, and they are particularly useful when the population is suspected of heterogeneity. These distributions are successfully employed in various applications, but many researchers falsely believe that the maximum likelihood estimator of the mixing distribution is consistent. Similarly to finite mixtures of normal distributions, the likelihood function under finite gamma mixtures is unbounded. Because of this, each observed value leads to a global maximum that is irrelevant to the true distribution. We apply a seemingly negligible penalty to the likelihood according to the shape parameters in the fitted model. We show that this penalty restores the consistency of the likelihoodbased estimator of the mixing distribution under finite gamma mixture models. We present simulation results to validate the consistency conclusion, and we give an example to illustrate the key points.
基金Natural Sciences and Engineering Research Council of CanadaAlberta Biodiversity Monitoring InitiativeEnvironment Canada.
文摘Aim Site occupancy probabilities of target species are commonly used in various ecological studies,e.g.to monitor current status and trends in biodiversity.Detection error introduces bias in the estimators of site occupancy.Existing methods for estimating occupancy probability in the presence of detection error use replicate surveys.These methods assume population closure,i.e.the site occupancy status remains constant across surveys,and independence between surveys.We present an approach for estimating site occupancy probability in the presence of detection error that requires only a single survey and does not require assumption of population closure or independence.In place of the closure assumption,this method requires covariates that affect detection and occupancy.Methods Penalized maximum-likelihood method was used to estimate the parameters.Estimability of the parameters was checked using data cloning.Parametric boostrapping method was used for computing confidence intervals.Important Findings The single-survey approach facilitates analysis of historical datasets where replicate surveys are unavailable,situations where replicate surveys are expensive to conduct and when the assumptions of closure or independence are not met.This method saves significant amounts of time,energy and money in ecological surveys without sacrificing statistical validity.Further,we show that occupancy and habitat suitability are not synonymous and suggest a method to estimate habitat suitability using single-survey data.
基金supported by Singapore Ministry of Educations ACRF Tier 1 (Grant No. R-155-000-065-112)supported by the National Science and Engineering Research Countil of Canada and MITACS,Canada
文摘The feature selection characterized by relatively small sample size and extremely high-dimensional feature space is common in many areas of contemporary statistics. The high dimensionality of the feature space causes serious difficulties: (i) the sample correlations between features become high even if the features are stochastically independent; (ii) the computation becomes intractable. These difficulties make conventional approaches either inapplicable or inefficient. The reduction of dimensionality of the feature space followed by low dimensional approaches appears the only feasible way to tackle the problem. Along this line, we develop in this article a tournament screening cum EBIC approach for feature selection with high dimensional feature space. The procedure of tournament screening mimics that of a tournament. It is shown theoretically that the tournament screening has the sure screening property, a necessary property which should be satisfied by any valid screening procedure. It is demonstrated by numerical studies that the tournament screening cum EBIC approach enjoys desirable properties such as having higher positive selection rate and lower false discovery rate than other approaches.
基金National Natural Science Foundation of China (Grant No. 11671059)。
文摘The problem of estimating high-dimensional Gaussian graphical models has gained much attention in recent years. Most existing methods can be considered as one-step approaches, being either regression-based or likelihood-based. In this paper, we propose a two-step method for estimating the high-dimensional Gaussian graphical model. Specifically, the first step serves as a screening step, in which many entries of the concentration matrix are identified as zeros and thus removed from further consideration. Then in the second step, we focus on the remaining entries of the concentration matrix and perform selection and estimation for nonzero entries of the concentration matrix. Since the dimension of the parameter space is effectively reduced by the screening step,the estimation accuracy of the estimated concentration matrix can be potentially improved. We show that the proposed method enjoys desirable asymptotic properties. Numerical comparisons of the proposed method with several existing methods indicate that the proposed method works well. We also apply the proposed method to a breast cancer microarray data set and obtain some biologically meaningful results.
基金Supported by the National Natural Science Foundation of China (No. 10961026, 10761011)the National Social Science Foundation of China (No. 10BTJ001)
文摘The present paper proposes a semiparametric reproductive dispersion nonlinear model (SRDNM) which is an extension of the nonlinear reproductive dispersion models and the semiparameter regression models. Maximum penalized likelihood estimates (MPLEs) of unknown parameters and nonparametric functions in SRDNM are presented. Assessment of local influence for various perturbation schemes are investigated. Some local influence diagnostics are given. A simulation study and a real example are used to illustrate the proposed methodologies.
基金Supported by the National Natural Science Foundation of China(No.11171263)
文摘The smooth integration of counting and absolute deviation (SICA) penalized variable selection procedure for high-dimensional linear regression models is proposed by Lv and Fan (2009). In this article, we extend their idea to Cox's proportional hazards (PH) model by using a penalized log partial likelihood with the SICA penalty. The number of the regression coefficients is allowed to grow with the sample size. Based on an approximation to the inverse of the Hessian matrix, the proposed method can be easily carried out with the smoothing quasi-Newton (SQN) algorithm. Under appropriate sparsity conditions, we show that the resulting estimator of the regression coefficients possesses the oracle property. We perform an extensive simulation study to compare our approach with other methods and illustrate it on a well known PBC data for predicting survival from risk factors.
基金supported by the National Natural Science Foundation of China under Grant Nos.11261025,11561075the Natural Science Foundation of Yunnan Province under Grant No.2016FB005the Program for Middle-aged Backbone Teacher,Yunnan University
文摘Variable selection is an important research topic in modern statistics, traditional variable selection methods can only select the mean model and(or) the variance model, and cannot be used to select the joint mean, variance and skewness models. In this paper, the authors propose the joint location, scale and skewness models when the data set under consideration involves asymmetric outcomes,and consider the problem of variable selection for our proposed models. Based on an efficient unified penalized likelihood method, the consistency and the oracle property of the penalized estimators are established. The authors develop the variable selection procedure for the proposed joint models, which can efficiently simultaneously estimate and select important variables in location model, scale model and skewness model. Simulation studies and body mass index data analysis are presented to illustrate the proposed methods.
文摘The integer-valued generalized autoregressive conditional heteroskedastic(INGARCH)model is often utilized to describe data in biostatistics,such as the number of people infected with dengue fever,daily epileptic seizure counts of an epileptic patient and the number of cases of campylobacterosis infections,etc.Since the structure of such data is generally high-order and sparse,studies about order shrinkage and selection for the model attract many attentions.In this paper,we propose a penalized conditional maximum likelihood(PCML)method to solve this problem.The PCML method can effectively select significant orders and estimate the parameters,simultaneously.Some simulations and a real data analysis are carried out to illustrate the usefulness of our method.