We study the quasi likelihood equation in Generalized Linear Models(GLM) with adaptive design ∑(i=1)^n xi(yi-h(x'iβ))=0, where yi is a q=vector, and xi is a p×q random matrix. Under some assumptions, i...We study the quasi likelihood equation in Generalized Linear Models(GLM) with adaptive design ∑(i=1)^n xi(yi-h(x'iβ))=0, where yi is a q=vector, and xi is a p×q random matrix. Under some assumptions, it is shown that the Quasi- Likelihood equation for the GLM has a solution which is asymptotic normal.展开更多
In a linear regression model, testing for uniformity of the variance of the residuals is a significant integral part of statistical analysis. This is a crucial assumption that requires statistical confirmation via the...In a linear regression model, testing for uniformity of the variance of the residuals is a significant integral part of statistical analysis. This is a crucial assumption that requires statistical confirmation via the use of some statistical tests mostly before carrying out the Analysis of Variance (ANOVA) technique. Many academic researchers have published series of papers (articles) on some tests for detecting variance heterogeneity assumption in multiple linear regression models. So many comparisons on these tests have been made using various statistical techniques like biases, error rates as well as powers. Aside comparisons, modifications of some of these statistical tests for detecting variance heterogeneity have been reported in some literatures in recent years. In a multiple linear regression situation, much work has not been done on comparing some selected statistical tests for homoscedasticity assumption when linear, quadratic, square root, and exponential forms of heteroscedasticity are injected into the residuals. As a result of this fact, the present study intends to work extensively on all these areas of interest with a view to filling the gap. The paper aims at providing a comprehensive comparative analysis of asymptotic behaviour of some selected statistical tests for homoscedasticity assumption in order to hunt for the best statistical test for detecting heteroscedasticity in a multiple linear regression scenario with varying variances and levels of significance. In the literature, several tests for homoscedasticity are available but only nine: Breusch-Godfrey test, studentized Breusch-Pagan test, White’s test, Nonconstant Variance Score test, Park test, Spearman Rank, <span>Glejser test, Goldfeld-Quandt test, Harrison-McCabe test were considered for this study;this is with a view to examining, by Monte Carlo simulations, their</span><span> asymptotic behaviours. However, four different forms of heteroscedastic structures: exponential and linear (generalize of square-root and quadratic structures) were injected into the residual part of the multiple linear regression models at different categories of sample sizes: 30, 50, 100, 200, 500 and 1000. Evaluations of the performances were done within R environment. Among other findings, our investigations revealed that Glejser and Park tests returned the best test to employ to check for heteroscedasticity in EHS and LHS respectively also White and Harrison-McCabe tests returned the best test to employ to check for homoscedasticity in EHS and LHS respectively for sample size less than 50.</span>展开更多
Changes in climate factors such as temperature, rainfall, humidity, and wind speed are natural processes that could significantly impact the incidence of infectious diseases. Dengue is a widespread disease that has of...Changes in climate factors such as temperature, rainfall, humidity, and wind speed are natural processes that could significantly impact the incidence of infectious diseases. Dengue is a widespread disease that has often been documented when it comes to the impact of climate change. It has become a significant concern, especially for the Malaysian health authorities, due to its rapid spread and serious effects, leading to loss of life. Several statistical models were performed to identify climatic factors associated with infectious diseases. However, because of the complex and nonlinear interactions between climate variables and disease components, modelling their relationships have become the main challenge in climate-health studies. Hence, this study proposed a Generalized Linear Model (GLM) via Poisson and Negative Binomial to examine the effects of the climate factors on dengue incidence by considering the collinearity between variables. This study focuses on the dengue hot spots in Malaysia for the year 2014. Since there exists collinearity between climate factors, the analysis was done separately using three different models. The study revealed that rainfall, temperature, humidity, and wind speed were statistically significant with dengue incidence, and most of them shown a negative effect. Of all variables, wind speed has the most significant impact on dengue incidence. Having this kind of relationships, policymakers should formulate better plans such that precautionary steps can be taken to reduce the spread of dengue diseases.展开更多
The penalized variable selection methods are often used to select the relevant covariates and estimate the unknown regression coefficients simultaneously,but these existing methods may fail to be consistent for the se...The penalized variable selection methods are often used to select the relevant covariates and estimate the unknown regression coefficients simultaneously,but these existing methods may fail to be consistent for the setting with highly correlated covariates.In this paper,the semi-standard partial covariance(SPAC)method with Lasso penalty is proposed to study the generalized linear model with highly correlated covariates,and the consistencies of the estimation and variable selection are shown in high-dimensional settings under some regularity conditions.Some simulation studies and an analysis of colon tumor dataset are carried out to show that the proposed method performs better in addressing highly correlated problem than the traditional penalized variable selection methods.展开更多
Background Abiotic factors exert different impacts on the abundance of individual tree species in the forest but little has been known about the impact of abiotic factors on the individual plant,particularly,in a trop...Background Abiotic factors exert different impacts on the abundance of individual tree species in the forest but little has been known about the impact of abiotic factors on the individual plant,particularly,in a tropical forest.This study identified the impact of abiotic factors on the abundances of Podocarpus falcatus,Croton macrostachyus,Celtis africana,Syzygium guineense,Olea capensis,Diospyros abyssinica,Feliucium decipenses,and Coffea arabica.A systematic sample design was used in the Harana forest,where 1122 plots were established to collect the abundance of species.Random forest(RF),artificial neural network(ANN),and generalized linear model(GLM)models were used to examine the impacts of topographic,climatic,and edaphic factors on the log abundances of woody species.The RF model was used to predict the spatial distribution maps of the log abundances of each species.Results The RF model achieved a better prediction accuracy with R^(2)=71%and a mean squared error(MSE)of 0.28 for Feliucium decipenses.The RF model differentiated elevation,temperature,precipitation,clay,and potassium were the top variables that influenced the abundance of species.The ANN model showed that elevation induced a nega-tive impact on the log abundances of all woody species.The GLM model reaffirmed the negative impact of elevation on all woody species except the log abundances of Syzygium guineense and Olea capensis.The ANN model indicated that soil organic matter(SOM)could positively affect the log abundances of all woody species.The GLM showed a similar positive impact of SOM,except for a negative impact on the log abundance of Celtis africana at p<0.05.The spatial distributions of the log abundances of Coffee arabica,Filicium decipenses,and Celtis africana were confined to the eastern parts,while the log abundance of Olea capensis was limited to the western parts.Conclusions The impacts of abiotic factors on the abundance of woody species may vary with species.This ecological understanding could guide the restoration activity of individual species.The prediction maps in this study provide spatially explicit information which can enhance the successful implementation of species conservation.展开更多
Generalized linear models are usually adopted to model the discrete or nonnegative responses.In this paper,empirical likelihood inference for fixed design generalized linear models with longitudinal data is investigat...Generalized linear models are usually adopted to model the discrete or nonnegative responses.In this paper,empirical likelihood inference for fixed design generalized linear models with longitudinal data is investigated.Under some mild conditions,the consistency and asymptotic normality of the maximum empirical likelihood estimator are established,and the asymptotic χ^(2) distribution of the empirical log-likelihood ratio is also obtained.Compared with the existing results,the new conditions are more weak and easy to verify.Some simulations are presented to illustrate these asymptotic properties.展开更多
Under the assumption that in the generalized linear model (GLM) the expectation of the response variable has a correct specification and some other smooth conditions, it is shown that with probability one the quasi-li...Under the assumption that in the generalized linear model (GLM) the expectation of the response variable has a correct specification and some other smooth conditions, it is shown that with probability one the quasi-likelihood equation for the GLM has a solution when the sample size n is sufficiently large. The rate of this solution tending to the true value is determined. In an important special case, this rate is the same as specified in the LIL for iid partial sums and thus cannot be improved anymore.展开更多
The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing...The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing methods in the literature heavily depend on an unverifiable assumption of the missing data mechanism, and they fail when the assumption is violated. This paper proposes a missing data mechanism that is as generally applicable as possible, which includes both ignorable and nonignorable missing data cases, as well as both scenarios of missing values in response and covariate.Under this general missing data mechanism, the authors adopt an approximate conditional likelihood method to estimate unknown parameters. The authors rigorously establish the regularity conditions under which the unknown parameters are identifiable under the approximate conditional likelihood approach. For parameters that are identifiable, the authors prove the asymptotic normality of the estimators obtained by maximizing the approximate conditional likelihood. Some simulation studies are conducted to evaluate finite sample performance of the proposed estimators as well as estimators from some existing methods. Finally, the authors present a biomarker analysis in prostate cancer study to illustrate the proposed method.展开更多
In this paper, we explore some weakly consistent properties of quasi-maximum likelihood estimates (QMLE) concerning the quasi-likelihood equation $ \sum\nolimits_{i = 1}^n {X_i (y_i - \mu (X_i^\prime \beta ))} $ for u...In this paper, we explore some weakly consistent properties of quasi-maximum likelihood estimates (QMLE) concerning the quasi-likelihood equation $ \sum\nolimits_{i = 1}^n {X_i (y_i - \mu (X_i^\prime \beta ))} $ for univariate generalized linear model E(y|X) = μ(X′β). Given uncorrelated residuals {e i = Y i ? μ(X i ′ β0), 1 ? i ? n} and other conditions, we prove that $$ \hat \beta _n - \beta _0 = O_p (\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\lambda } _n^{ - 1/2} ) $$ holds, where $ \hat \beta _n $ is a root of the above equation, β 0 is the true value of parameter β and $$ \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\lambda } _n $$ denotes the smallest eigenvalue of the matrix S n = ∑ i=1 n X i X i ′ . We also show that the convergence rate above is sharp, provided independent non-asymptotically degenerate residual sequence and other conditions. Moreover, paralleling to the elegant result of Drygas (1976) for classical linear regression models, we point out that the necessary condition guaranteeing the weak consistency of QMLE is S n ?1 → 0, as the sample size n → ∞.展开更多
In the assessment of car insurance claims,the claim rate for car insurance presents a highly skewed probability distribution,which is typically modeled using Tweedie distribution.The traditional approach to obtaining ...In the assessment of car insurance claims,the claim rate for car insurance presents a highly skewed probability distribution,which is typically modeled using Tweedie distribution.The traditional approach to obtaining the Tweedie regression model involves training on a centralized dataset,when the data is provided by multiple parties,training a privacy-preserving Tweedie regression model without exchanging raw data becomes a challenge.To address this issue,this study introduces a novel vertical federated learning-based Tweedie regression algorithm for multi-party auto insurance rate setting in data silos.The algorithm can keep sensitive data locally and uses privacy-preserving techniques to achieve intersection operations between the two parties holding the data.After determining which entities are shared,the participants train the model locally using the shared entity data to obtain the local generalized linear model intermediate parameters.The homomorphic encryption algorithms are introduced to interact with and update the model intermediate parameters to collaboratively complete the joint training of the car insurance rate-setting model.Performance tests on two publicly available datasets show that the proposed federated Tweedie regression algorithm can effectively generate Tweedie regression models that leverage the value of data fromboth partieswithout exchanging data.The assessment results of the scheme approach those of the Tweedie regressionmodel learned fromcentralized data,and outperformthe Tweedie regressionmodel learned independently by a single party.展开更多
For generalized linear models (GLM), in case the regressors are stochastic and have different distributions, the asymptotic properties of the maximum likelihood estimate (MLE) β^n of the parameters are studied. U...For generalized linear models (GLM), in case the regressors are stochastic and have different distributions, the asymptotic properties of the maximum likelihood estimate (MLE) β^n of the parameters are studied. Under reasonable conditions, we prove the weak, strong consistency and asymptotic normality of β^n展开更多
For the Generalized Linear Model (GLM), under some conditions including that the specification of the expectation is correct, it is shown that the Quasi Maximum Likelihood Estimate (QMLE) of the parameter-vector is as...For the Generalized Linear Model (GLM), under some conditions including that the specification of the expectation is correct, it is shown that the Quasi Maximum Likelihood Estimate (QMLE) of the parameter-vector is asymptotic normal. It is also shown that the asymptotic covariance matrix of the QMLE reaches its minimum (in the positive-definte sense) in case that the specification of the covariance matrix is correct.展开更多
In generalized linear models with fixed design, under the assumption λ↑_n→∞ and other regularity conditions, the asymptotic normality of maximum quasi-likelihood estimator ^↑βn, which is the root of the quasi-li...In generalized linear models with fixed design, under the assumption λ↑_n→∞ and other regularity conditions, the asymptotic normality of maximum quasi-likelihood estimator ^↑βn, which is the root of the quasi-likelihood equation with natural link function ∑i=1^n Xi(yi -μ(Xi′β)) = 0, is obtained, where λ↑_n denotes the minimum eigenvalue of ∑i=1^nXiXi′, Xi are bounded p × q regressors, and yi are q × 1 responses.展开更多
Epidemiologic studies use outcome-dependent sampling (ODS) schemes where, in addition to a simple random sample, there are also a number of supplement samples that are collected based on outcome variable. ODS scheme...Epidemiologic studies use outcome-dependent sampling (ODS) schemes where, in addition to a simple random sample, there are also a number of supplement samples that are collected based on outcome variable. ODS scheme is a cost-effective way to improve study efficiency. We develop a maximum semiparametric empirical likelihood estimation (MSELE) for data from a two-stage ODS scheme under the assumption that given covariate, the outcome follows a general linear model. The information of both validation samples and nonvalidation samples are used. What is more, we prove the asymptotic properties of the proposed MSELE.展开更多
We study the law of the iterated logarithm (LIL) for the maximum likelihood estimation of the parameters (as a convex optimization problem) in the generalized linear models with independent or weakly dependent (ρ-mix...We study the law of the iterated logarithm (LIL) for the maximum likelihood estimation of the parameters (as a convex optimization problem) in the generalized linear models with independent or weakly dependent (ρ-mixing) responses under mild conditions. The LIL is useful to derive the asymptotic bounds for the discrepancy between the empirical process of the log-likelihood function and the true log-likelihood. The strong consistency of some penalized likelihood-based model selection criteria can be shown as an application of the LIL. Under some regularity conditions, the model selection criterion will be helpful to select the simplest correct model almost surely when the penalty term increases with the model dimension, and the penalty term has an order higher than O(log log n) but lower than O(n). Simulation studies are implemented to verify the selection consistency of Bayesian information criterion.展开更多
The paper studies a generalized linear model(GLM)yt = h(xt^T β) + εt,t = l,2,...,n,where ε1 = η1,ε1 =ρεt +ηt,t = 2,3,...;n,h is a continuous differentiable function,ηt's are independent and identically...The paper studies a generalized linear model(GLM)yt = h(xt^T β) + εt,t = l,2,...,n,where ε1 = η1,ε1 =ρεt +ηt,t = 2,3,...;n,h is a continuous differentiable function,ηt's are independent and identically distributed random errors with zero mean and finite variance σ^2.Firstly,the quasi-maximum likelihood(QML) estimators of β,p and σ^2 are given.Secondly,under mild conditions,the asymptotic properties(including the existence,weak consistency and asymptotic distribution) of the QML estimators are investigated.Lastly,the validity of method is illuminated by a simulation example.展开更多
In this paper, for the generalized linear models (GLMs) with diverging number of covariates, the asymptotic properties of maximum quasi-likelihood estimators (MQLEs) under some regular conditions are developed. Th...In this paper, for the generalized linear models (GLMs) with diverging number of covariates, the asymptotic properties of maximum quasi-likelihood estimators (MQLEs) under some regular conditions are developed. The existence, weak convergence and the rate of convergence and asymptotic normality of linear combination of MQLEs and asymptotic distribution of single linear hypothesis teststatistics are presented. The results are illustrated by Monte-Carlo simulations.展开更多
Fragmentary data is becoming more and more popular in many areas which brings big chal-lenges to researchers and data analysts.Most existing methods dealing with fragmentary data consider a continuous response while i...Fragmentary data is becoming more and more popular in many areas which brings big chal-lenges to researchers and data analysts.Most existing methods dealing with fragmentary data consider a continuous response while in many applications the response variable is discrete.In this paper,we propose a model averaging method for generalized linear models in fragmentary data prediction.The candidate models are fitted based on different combinations of covariate availability and sample size.The optimal weight is selected by minimizing the Kullback-Leibler loss in the completed cases and its asymptotic optimality is established.Empirical evidences from a simulation study and a real data analysis about Alzheimer disease are presented.展开更多
For generalized linear models (GLM), in the ease that the regressors are stochastie and have different distributions and the observations of the responses may have different dimcnsionality, the asyinptotic theory of...For generalized linear models (GLM), in the ease that the regressors are stochastie and have different distributions and the observations of the responses may have different dimcnsionality, the asyinptotic theory of the maximum likelihood estimate (MLE) of the parameters are studied under the assumption of a non-natural link funetion,展开更多
This paper considers the iterative sequential lasso(ISLasso)variable selection for generalized linear model with ultrahigh dimensional feature space.The ISLasso selects features by estimated parameter sequentially ite...This paper considers the iterative sequential lasso(ISLasso)variable selection for generalized linear model with ultrahigh dimensional feature space.The ISLasso selects features by estimated parameter sequentially iteratively for the second order approximation of likelihood function where the features selected depend on regulatory parameters.The procedure stops when extended BIC(EBIC)reaches a minimum.Simulation study demonstrates that the new method is a desirable approach over other methods.展开更多
文摘We study the quasi likelihood equation in Generalized Linear Models(GLM) with adaptive design ∑(i=1)^n xi(yi-h(x'iβ))=0, where yi is a q=vector, and xi is a p×q random matrix. Under some assumptions, it is shown that the Quasi- Likelihood equation for the GLM has a solution which is asymptotic normal.
文摘In a linear regression model, testing for uniformity of the variance of the residuals is a significant integral part of statistical analysis. This is a crucial assumption that requires statistical confirmation via the use of some statistical tests mostly before carrying out the Analysis of Variance (ANOVA) technique. Many academic researchers have published series of papers (articles) on some tests for detecting variance heterogeneity assumption in multiple linear regression models. So many comparisons on these tests have been made using various statistical techniques like biases, error rates as well as powers. Aside comparisons, modifications of some of these statistical tests for detecting variance heterogeneity have been reported in some literatures in recent years. In a multiple linear regression situation, much work has not been done on comparing some selected statistical tests for homoscedasticity assumption when linear, quadratic, square root, and exponential forms of heteroscedasticity are injected into the residuals. As a result of this fact, the present study intends to work extensively on all these areas of interest with a view to filling the gap. The paper aims at providing a comprehensive comparative analysis of asymptotic behaviour of some selected statistical tests for homoscedasticity assumption in order to hunt for the best statistical test for detecting heteroscedasticity in a multiple linear regression scenario with varying variances and levels of significance. In the literature, several tests for homoscedasticity are available but only nine: Breusch-Godfrey test, studentized Breusch-Pagan test, White’s test, Nonconstant Variance Score test, Park test, Spearman Rank, <span>Glejser test, Goldfeld-Quandt test, Harrison-McCabe test were considered for this study;this is with a view to examining, by Monte Carlo simulations, their</span><span> asymptotic behaviours. However, four different forms of heteroscedastic structures: exponential and linear (generalize of square-root and quadratic structures) were injected into the residual part of the multiple linear regression models at different categories of sample sizes: 30, 50, 100, 200, 500 and 1000. Evaluations of the performances were done within R environment. Among other findings, our investigations revealed that Glejser and Park tests returned the best test to employ to check for heteroscedasticity in EHS and LHS respectively also White and Harrison-McCabe tests returned the best test to employ to check for homoscedasticity in EHS and LHS respectively for sample size less than 50.</span>
文摘Changes in climate factors such as temperature, rainfall, humidity, and wind speed are natural processes that could significantly impact the incidence of infectious diseases. Dengue is a widespread disease that has often been documented when it comes to the impact of climate change. It has become a significant concern, especially for the Malaysian health authorities, due to its rapid spread and serious effects, leading to loss of life. Several statistical models were performed to identify climatic factors associated with infectious diseases. However, because of the complex and nonlinear interactions between climate variables and disease components, modelling their relationships have become the main challenge in climate-health studies. Hence, this study proposed a Generalized Linear Model (GLM) via Poisson and Negative Binomial to examine the effects of the climate factors on dengue incidence by considering the collinearity between variables. This study focuses on the dengue hot spots in Malaysia for the year 2014. Since there exists collinearity between climate factors, the analysis was done separately using three different models. The study revealed that rainfall, temperature, humidity, and wind speed were statistically significant with dengue incidence, and most of them shown a negative effect. Of all variables, wind speed has the most significant impact on dengue incidence. Having this kind of relationships, policymakers should formulate better plans such that precautionary steps can be taken to reduce the spread of dengue diseases.
基金Supported by the National Natural Science Foundation of China(Grant Nos.12001277,12271046 and 12131006)。
文摘The penalized variable selection methods are often used to select the relevant covariates and estimate the unknown regression coefficients simultaneously,but these existing methods may fail to be consistent for the setting with highly correlated covariates.In this paper,the semi-standard partial covariance(SPAC)method with Lasso penalty is proposed to study the generalized linear model with highly correlated covariates,and the consistencies of the estimation and variable selection are shown in high-dimensional settings under some regularity conditions.Some simulation studies and an analysis of colon tumor dataset are carried out to show that the proposed method performs better in addressing highly correlated problem than the traditional penalized variable selection methods.
文摘Background Abiotic factors exert different impacts on the abundance of individual tree species in the forest but little has been known about the impact of abiotic factors on the individual plant,particularly,in a tropical forest.This study identified the impact of abiotic factors on the abundances of Podocarpus falcatus,Croton macrostachyus,Celtis africana,Syzygium guineense,Olea capensis,Diospyros abyssinica,Feliucium decipenses,and Coffea arabica.A systematic sample design was used in the Harana forest,where 1122 plots were established to collect the abundance of species.Random forest(RF),artificial neural network(ANN),and generalized linear model(GLM)models were used to examine the impacts of topographic,climatic,and edaphic factors on the log abundances of woody species.The RF model was used to predict the spatial distribution maps of the log abundances of each species.Results The RF model achieved a better prediction accuracy with R^(2)=71%and a mean squared error(MSE)of 0.28 for Feliucium decipenses.The RF model differentiated elevation,temperature,precipitation,clay,and potassium were the top variables that influenced the abundance of species.The ANN model showed that elevation induced a nega-tive impact on the log abundances of all woody species.The GLM model reaffirmed the negative impact of elevation on all woody species except the log abundances of Syzygium guineense and Olea capensis.The ANN model indicated that soil organic matter(SOM)could positively affect the log abundances of all woody species.The GLM showed a similar positive impact of SOM,except for a negative impact on the log abundance of Celtis africana at p<0.05.The spatial distributions of the log abundances of Coffee arabica,Filicium decipenses,and Celtis africana were confined to the eastern parts,while the log abundance of Olea capensis was limited to the western parts.Conclusions The impacts of abiotic factors on the abundance of woody species may vary with species.This ecological understanding could guide the restoration activity of individual species.The prediction maps in this study provide spatially explicit information which can enhance the successful implementation of species conservation.
基金supported by the Natural Science Foundation of China under Grant Nos.12031016,11061002,11801033,12071014 and 12131001the National Social Science Fund of China under Grant No.19ZDA121the Natural Science Foundation of Guangxi under Grant Nos.2015GXNSFAA139006 and LMEQF。
文摘Generalized linear models are usually adopted to model the discrete or nonnegative responses.In this paper,empirical likelihood inference for fixed design generalized linear models with longitudinal data is investigated.Under some mild conditions,the consistency and asymptotic normality of the maximum empirical likelihood estimator are established,and the asymptotic χ^(2) distribution of the empirical log-likelihood ratio is also obtained.Compared with the existing results,the new conditions are more weak and easy to verify.Some simulations are presented to illustrate these asymptotic properties.
基金This work was supported by the National Natural Science Foundation of China.
文摘Under the assumption that in the generalized linear model (GLM) the expectation of the response variable has a correct specification and some other smooth conditions, it is shown that with probability one the quasi-likelihood equation for the GLM has a solution when the sample size n is sufficiently large. The rate of this solution tending to the true value is determined. In an important special case, this rate is the same as specified in the LIL for iid partial sums and thus cannot be improved anymore.
基金supported by the Chinese 111 Project B14019the US National Science Foundation under Grant Nos.DMS-1305474 and DMS-1612873the US National Institutes of Health Award UL1TR001412
文摘The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing methods in the literature heavily depend on an unverifiable assumption of the missing data mechanism, and they fail when the assumption is violated. This paper proposes a missing data mechanism that is as generally applicable as possible, which includes both ignorable and nonignorable missing data cases, as well as both scenarios of missing values in response and covariate.Under this general missing data mechanism, the authors adopt an approximate conditional likelihood method to estimate unknown parameters. The authors rigorously establish the regularity conditions under which the unknown parameters are identifiable under the approximate conditional likelihood approach. For parameters that are identifiable, the authors prove the asymptotic normality of the estimators obtained by maximizing the approximate conditional likelihood. Some simulation studies are conducted to evaluate finite sample performance of the proposed estimators as well as estimators from some existing methods. Finally, the authors present a biomarker analysis in prostate cancer study to illustrate the proposed method.
基金supported by the President Foundation (Grant No. Y1050)the Scientific Research Foundation(Grant No. KYQD200502) of GUCAS
文摘In this paper, we explore some weakly consistent properties of quasi-maximum likelihood estimates (QMLE) concerning the quasi-likelihood equation $ \sum\nolimits_{i = 1}^n {X_i (y_i - \mu (X_i^\prime \beta ))} $ for univariate generalized linear model E(y|X) = μ(X′β). Given uncorrelated residuals {e i = Y i ? μ(X i ′ β0), 1 ? i ? n} and other conditions, we prove that $$ \hat \beta _n - \beta _0 = O_p (\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\lambda } _n^{ - 1/2} ) $$ holds, where $ \hat \beta _n $ is a root of the above equation, β 0 is the true value of parameter β and $$ \underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\lambda } _n $$ denotes the smallest eigenvalue of the matrix S n = ∑ i=1 n X i X i ′ . We also show that the convergence rate above is sharp, provided independent non-asymptotically degenerate residual sequence and other conditions. Moreover, paralleling to the elegant result of Drygas (1976) for classical linear regression models, we point out that the necessary condition guaranteeing the weak consistency of QMLE is S n ?1 → 0, as the sample size n → ∞.
基金This research was funded by the National Natural Science Foundation of China(No.62272124)the National Key Research and Development Program of China(No.2022YFB2701401)+3 种基金Guizhou Province Science and Technology Plan Project(Grant Nos.Qiankehe Paltform Talent[2020]5017)The Research Project of Guizhou University for Talent Introduction(No.[2020]61)the Cultivation Project of Guizhou University(No.[2019]56)the Open Fund of Key Laboratory of Advanced Manufacturing Technology,Ministry of Education(GZUAMT2021KF[01]).
文摘In the assessment of car insurance claims,the claim rate for car insurance presents a highly skewed probability distribution,which is typically modeled using Tweedie distribution.The traditional approach to obtaining the Tweedie regression model involves training on a centralized dataset,when the data is provided by multiple parties,training a privacy-preserving Tweedie regression model without exchanging raw data becomes a challenge.To address this issue,this study introduces a novel vertical federated learning-based Tweedie regression algorithm for multi-party auto insurance rate setting in data silos.The algorithm can keep sensitive data locally and uses privacy-preserving techniques to achieve intersection operations between the two parties holding the data.After determining which entities are shared,the participants train the model locally using the shared entity data to obtain the local generalized linear model intermediate parameters.The homomorphic encryption algorithms are introduced to interact with and update the model intermediate parameters to collaboratively complete the joint training of the car insurance rate-setting model.Performance tests on two publicly available datasets show that the proposed federated Tweedie regression algorithm can effectively generate Tweedie regression models that leverage the value of data fromboth partieswithout exchanging data.The assessment results of the scheme approach those of the Tweedie regressionmodel learned fromcentralized data,and outperformthe Tweedie regressionmodel learned independently by a single party.
基金Project supported by the Chinese Natural Science Foundation
文摘For generalized linear models (GLM), in case the regressors are stochastic and have different distributions, the asymptotic properties of the maximum likelihood estimate (MLE) β^n of the parameters are studied. Under reasonable conditions, we prove the weak, strong consistency and asymptotic normality of β^n
基金Project supported by the National Natural Science Foundation of China.
文摘For the Generalized Linear Model (GLM), under some conditions including that the specification of the expectation is correct, it is shown that the Quasi Maximum Likelihood Estimate (QMLE) of the parameter-vector is asymptotic normal. It is also shown that the asymptotic covariance matrix of the QMLE reaches its minimum (in the positive-definte sense) in case that the specification of the covariance matrix is correct.
基金the National Natural Science Foundation of China under Grant Nos.10171094,10571001,and 30572285the Foundation of Nanjing Normal University under Grant No.2005101XGQ2B84+1 种基金the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No.07KJD110093the Foundation of Anhui University under Grant No.02203105
文摘In generalized linear models with fixed design, under the assumption λ↑_n→∞ and other regularity conditions, the asymptotic normality of maximum quasi-likelihood estimator ^↑βn, which is the root of the quasi-likelihood equation with natural link function ∑i=1^n Xi(yi -μ(Xi′β)) = 0, is obtained, where λ↑_n denotes the minimum eigenvalue of ∑i=1^nXiXi′, Xi are bounded p × q regressors, and yi are q × 1 responses.
基金Jie-li DING is supported by the National Natural Science Foundation of China(No.11101314),Yan-yan LIU s supported by the National Natural Science Foundation of China(No.11171263,No.11371299)
文摘Epidemiologic studies use outcome-dependent sampling (ODS) schemes where, in addition to a simple random sample, there are also a number of supplement samples that are collected based on outcome variable. ODS scheme is a cost-effective way to improve study efficiency. We develop a maximum semiparametric empirical likelihood estimation (MSELE) for data from a two-stage ODS scheme under the assumption that given covariate, the outcome follows a general linear model. The information of both validation samples and nonvalidation samples are used. What is more, we prove the asymptotic properties of the proposed MSELE.
文摘We study the law of the iterated logarithm (LIL) for the maximum likelihood estimation of the parameters (as a convex optimization problem) in the generalized linear models with independent or weakly dependent (ρ-mixing) responses under mild conditions. The LIL is useful to derive the asymptotic bounds for the discrepancy between the empirical process of the log-likelihood function and the true log-likelihood. The strong consistency of some penalized likelihood-based model selection criteria can be shown as an application of the LIL. Under some regularity conditions, the model selection criterion will be helpful to select the simplest correct model almost surely when the penalty term increases with the model dimension, and the penalty term has an order higher than O(log log n) but lower than O(n). Simulation studies are implemented to verify the selection consistency of Bayesian information criterion.
基金Supported by National Natural Science Foundation of China(Grant Nos.11071022,11471105)Science and Technology Research Projects of the Educational Department of Hubei Province(Grant No.Q20132505)
文摘The paper studies a generalized linear model(GLM)yt = h(xt^T β) + εt,t = l,2,...,n,where ε1 = η1,ε1 =ρεt +ηt,t = 2,3,...;n,h is a continuous differentiable function,ηt's are independent and identically distributed random errors with zero mean and finite variance σ^2.Firstly,the quasi-maximum likelihood(QML) estimators of β,p and σ^2 are given.Secondly,under mild conditions,the asymptotic properties(including the existence,weak consistency and asymptotic distribution) of the QML estimators are investigated.Lastly,the validity of method is illuminated by a simulation example.
基金supported by Major Programm of Natural Science Foundation of China under Grant No.71690242the Natural Science Foundation of China under Grant No.11471252the National Social Science Fund of China under Grant No.18BTJ040
文摘In this paper, for the generalized linear models (GLMs) with diverging number of covariates, the asymptotic properties of maximum quasi-likelihood estimators (MQLEs) under some regular conditions are developed. The existence, weak convergence and the rate of convergence and asymptotic normality of linear combination of MQLEs and asymptotic distribution of single linear hypothesis teststatistics are presented. The results are illustrated by Monte-Carlo simulations.
基金The research of Fang was supported by National Key R&D Program of China[grant numbers 2021YFA1000100,2021YFA1000101]National Natural Science Foundation of China[grant numbers 11831008,12071143].
文摘Fragmentary data is becoming more and more popular in many areas which brings big chal-lenges to researchers and data analysts.Most existing methods dealing with fragmentary data consider a continuous response while in many applications the response variable is discrete.In this paper,we propose a model averaging method for generalized linear models in fragmentary data prediction.The candidate models are fitted based on different combinations of covariate availability and sample size.The optimal weight is selected by minimizing the Kullback-Leibler loss in the completed cases and its asymptotic optimality is established.Empirical evidences from a simulation study and a real data analysis about Alzheimer disease are presented.
文摘For generalized linear models (GLM), in the ease that the regressors are stochastie and have different distributions and the observations of the responses may have different dimcnsionality, the asyinptotic theory of the maximum likelihood estimate (MLE) of the parameters are studied under the assumption of a non-natural link funetion,
基金supported in part by the National Natural Science Foundation of China under Grant Nos.11571112,11501372,11571148,11471160Doctoral Fund of Ministry of Education of China under Grant No.20130076110004+1 种基金Program of Shanghai Subject Chief Scientist under Grant No.14XD1401600the 111Project of China under Grant No.B14019。
文摘This paper considers the iterative sequential lasso(ISLasso)variable selection for generalized linear model with ultrahigh dimensional feature space.The ISLasso selects features by estimated parameter sequentially iteratively for the second order approximation of likelihood function where the features selected depend on regulatory parameters.The procedure stops when extended BIC(EBIC)reaches a minimum.Simulation study demonstrates that the new method is a desirable approach over other methods.