The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing...The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing methods in the literature heavily depend on an unverifiable assumption of the missing data mechanism, and they fail when the assumption is violated. This paper proposes a missing data mechanism that is as generally applicable as possible, which includes both ignorable and nonignorable missing data cases, as well as both scenarios of missing values in response and covariate.Under this general missing data mechanism, the authors adopt an approximate conditional likelihood method to estimate unknown parameters. The authors rigorously establish the regularity conditions under which the unknown parameters are identifiable under the approximate conditional likelihood approach. For parameters that are identifiable, the authors prove the asymptotic normality of the estimators obtained by maximizing the approximate conditional likelihood. Some simulation studies are conducted to evaluate finite sample performance of the proposed estimators as well as estimators from some existing methods. Finally, the authors present a biomarker analysis in prostate cancer study to illustrate the proposed method.展开更多
The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values.In most research studies,the existence of missing values(MVs)is a vital problem.In a...The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values.In most research studies,the existence of missing values(MVs)is a vital problem.In addition,any dataset with MVs cannot be used for further analysis or with any data driven tool especially when the percentage of MVs are high.In this paper,the authors propose a novel algorithm for dealing with MVs depending on the feature selec-tion(FS)of similarity classifier with fuzzy entropy measure.The proposed algo-rithm imputes MVs in cumulative order.The candidate feature to be manipulated is selected using similarity classifier with Parkash’s fuzzy entropy measure.The predictive model to predict MVs within the candidate feature is the Bayesian Ridge Regression(BRR)technique.Furthermore,any imputed features will be incorporated within the BRR equation to impute the MVs in the next chosen incomplete feature.The proposed algorithm was compared against some practical state-of-the-art imputation methods by conducting an experiment on four medical datasets which were gathered from several databases repository with MVs gener-ated from the three missingness mechanisms.The evaluation metrics of mean abso-lute error(MAE),root mean square error(RMSE)and coefficient of determination(R2 score)were used to measure the performance.The results exhibited that perfor-mance vary depending on the size of the dataset,amount of MVs and the missing-ness mechanism type.Moreover,compared to other methods,the results showed that the proposed method gives better accuracy and less error in most cases.展开更多
基金supported by the Chinese 111 Project B14019the US National Science Foundation under Grant Nos.DMS-1305474 and DMS-1612873the US National Institutes of Health Award UL1TR001412
文摘The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing methods in the literature heavily depend on an unverifiable assumption of the missing data mechanism, and they fail when the assumption is violated. This paper proposes a missing data mechanism that is as generally applicable as possible, which includes both ignorable and nonignorable missing data cases, as well as both scenarios of missing values in response and covariate.Under this general missing data mechanism, the authors adopt an approximate conditional likelihood method to estimate unknown parameters. The authors rigorously establish the regularity conditions under which the unknown parameters are identifiable under the approximate conditional likelihood approach. For parameters that are identifiable, the authors prove the asymptotic normality of the estimators obtained by maximizing the approximate conditional likelihood. Some simulation studies are conducted to evaluate finite sample performance of the proposed estimators as well as estimators from some existing methods. Finally, the authors present a biomarker analysis in prostate cancer study to illustrate the proposed method.
基金funded by the Deanship of Scientific Research(DSR)at King Abdulaziz University(KAU)Jeddah,Saudi Arabia,under grant No.(PH:13-130-1442).
文摘The accuracy of the statistical learning model depends on the learning technique used which in turn depends on the dataset’s values.In most research studies,the existence of missing values(MVs)is a vital problem.In addition,any dataset with MVs cannot be used for further analysis or with any data driven tool especially when the percentage of MVs are high.In this paper,the authors propose a novel algorithm for dealing with MVs depending on the feature selec-tion(FS)of similarity classifier with fuzzy entropy measure.The proposed algo-rithm imputes MVs in cumulative order.The candidate feature to be manipulated is selected using similarity classifier with Parkash’s fuzzy entropy measure.The predictive model to predict MVs within the candidate feature is the Bayesian Ridge Regression(BRR)technique.Furthermore,any imputed features will be incorporated within the BRR equation to impute the MVs in the next chosen incomplete feature.The proposed algorithm was compared against some practical state-of-the-art imputation methods by conducting an experiment on four medical datasets which were gathered from several databases repository with MVs gener-ated from the three missingness mechanisms.The evaluation metrics of mean abso-lute error(MAE),root mean square error(RMSE)and coefficient of determination(R2 score)were used to measure the performance.The results exhibited that perfor-mance vary depending on the size of the dataset,amount of MVs and the missing-ness mechanism type.Moreover,compared to other methods,the results showed that the proposed method gives better accuracy and less error in most cases.