Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by st...Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.展开更多
Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a spe...Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span>展开更多
Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multip...Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.展开更多
Empirical-likelihood-based inference for parameters defined by the general estimating equations of Qin and Lawless(1994) remains an active research topic. When the response is missing at random(MAR) and the dimension ...Empirical-likelihood-based inference for parameters defined by the general estimating equations of Qin and Lawless(1994) remains an active research topic. When the response is missing at random(MAR) and the dimension of covariate is not low, the authors propose a two-stage estimation procedure by using the dimension-reduced kernel estimators in conjunction with an unbiased estimating function based on augmented inverse probability weighting and multiple imputation(AIPW-MI) methods. The authors show that the resulting estimator achieves consistency and asymptotic normality. In addition, the corresponding empirical likelihood ratio statistics asymptotically follow central chi-square distributions when evaluated at the true parameter. The finite-sample performance of the proposed estimator is studied through simulation, and an application to HIV-CD4 data set is also presented.展开更多
In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, lead...In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, leading to the conclusion that a small m (≤5) would be sufficient for MI. However, evidence has been accumulating that many more imputations are needed. Why would the apparently sufficient m deduced from the RE be actually too small? The answer may lie with γ. In this research, γ was determined at the fractions of missing data (δ) of 4%, 10%, 20%, and 29% using the 2012 Physician Workflow Mail Survey of the National Ambulatory Medical Care Survey (NAMCS). The γ values were strikingly small, ranging in the order of 10?6 to 0.01. As δ increased, γ usually increased but sometimes decreased. How the data were analysed had the dominating effects on γ, overshadowing the effect of δ. The results suggest that it is impossible to predict γ using δ and that it may not be appropriate to use the γ-based RE to determine sufficient m.展开更多
This study computes the durability of Return on Assets (ROA) in small and medium enterprises from different sample datasets. Utilizing information from the Financial Statements Statistics of Corporations by Industry...This study computes the durability of Return on Assets (ROA) in small and medium enterprises from different sample datasets. Utilizing information from the Financial Statements Statistics of Corporations by Industry, it verifies the precision of correlation coefficients using the Non-iterative Bayesian-based Imputation (NIBAS) and multiple imputation method for all combinations of common variables with auxiliary files. The following are the three important findings of this paper. First, statistical matching estimates of higher precision can be obtained using key variable sets with higher canonical correlation coefficients. Second, even if the key variable sets have high canonical correlation coefficients, key variables that are correlated extremely strongly with target variables and have high kurtosis should not be used. Finally, using auxiliary flies can improve the precision of statistical matching estimates. Accordingly, the durability of ROA in small and medium enterprises is computed. The author finds that the series of ROA correlation fluctuates for smaller enterprises compared to larger ones, and thus, the vulnerability of ROA in small and medium enterprises can be clarified via statistical matching.展开更多
As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias st...As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias stems from the issue of dataset shift,as the density distributions of predictor variables differ greatly between urban and rural areas.We propose a data-augmentation approach based on the multiple imputation by chained equations(MICE-DA)to remedy the dataset shift problem.Compared with the benchmark models,MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity(Chengdu)at the foot of the Tibetan Plateau,especially for correcting the estimation bias,with the mean bias decreasing from-3.4µg/m3 to-1.6µg/m3.As a complement to the holdout validation,the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area.The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth(AOD)during the data augmentation.Consequently,the importance of AOD is largely enhanced for predicting PM2.5,and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5%to 18.4%.This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies.The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research.展开更多
In some situations,the failure time of interest is defined as the gap time between two related events and the observations on both event times can suffer either right or interval censoring.Such data are usually referr...In some situations,the failure time of interest is defined as the gap time between two related events and the observations on both event times can suffer either right or interval censoring.Such data are usually referred to as doubly censored data and frequently encountered in many clinical and observational studies.Additionally,there may also exist a cured subgroup in the whole population,which means that not every individual under study will experience the failure time of interest eventually.In this paper,we consider regression analysis of doubly censored data with a cured subgroup under a wide class of flexible transformation cure models.Specifically,we consider marginal likelihood estimation and develop a two-step approach by combining the multiple imputation and a new expectation-maximization(EM)algorithm for its implementation.The resulting estimators are shown to be consistent and asymptotically normal.The finite sample performance of the proposed method is investigated through simulation studies.The proposed method is also applied to a real dataset arising from an AIDS cohort study for illustration.展开更多
Although Gregory Johnson’s models have influenced social theory in archaeology,few have applied or built upon these models to predict aspects of social organization,group size,or fissioning.Exceptions have been limit...Although Gregory Johnson’s models have influenced social theory in archaeology,few have applied or built upon these models to predict aspects of social organization,group size,or fissioning.Exceptions have been limited to small case studies.Recently,the relationship between a society’s scale and its information-processing capacities has been explored using the Seshat Databank.Here,I apply multiple-linear regression analysis to the Seshat data using Turchin and colleagues’9“complexity characteristics”(CCs)to further examine the relationship between the hierarchy CC and the remaining 8 CCs which include both aspects of a polity’s scale and aspects of what Kohler et al.call“collective computation”.The results support Johnson’s ideas that stratification will generally increase with increases in a polity’s scale(population,territory);however,stratification is also higher when polities increase their developments in information-processing variables such as texts.展开更多
基金This research was financially supported by FDCT NO.005/2018/A1also supported by Guangdong Provincial Innovation and Entrepreneurship Training Program Project No.201713719017College Students Innovation Training Program held by Guangdong university of Science and Technology Nos.1711034,1711080,and No.1711088.
文摘Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.
文摘Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span>
基金supported by the Stiftung Rheinland-Pfalz fur Innovation(959).
文摘Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.
基金supported by the National Natural Science Foundation of China under Grant Nos.11871287,11501208,11771144,11801359the Natural Science Foundation of Tianjin under Grant No.18JCYBJC41100+1 种基金Fundamental Research Funds for the Central Universitiesthe Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin。
文摘Empirical-likelihood-based inference for parameters defined by the general estimating equations of Qin and Lawless(1994) remains an active research topic. When the response is missing at random(MAR) and the dimension of covariate is not low, the authors propose a two-stage estimation procedure by using the dimension-reduced kernel estimators in conjunction with an unbiased estimating function based on augmented inverse probability weighting and multiple imputation(AIPW-MI) methods. The authors show that the resulting estimator achieves consistency and asymptotic normality. In addition, the corresponding empirical likelihood ratio statistics asymptotically follow central chi-square distributions when evaluated at the true parameter. The finite-sample performance of the proposed estimator is studied through simulation, and an application to HIV-CD4 data set is also presented.
文摘In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, leading to the conclusion that a small m (≤5) would be sufficient for MI. However, evidence has been accumulating that many more imputations are needed. Why would the apparently sufficient m deduced from the RE be actually too small? The answer may lie with γ. In this research, γ was determined at the fractions of missing data (δ) of 4%, 10%, 20%, and 29% using the 2012 Physician Workflow Mail Survey of the National Ambulatory Medical Care Survey (NAMCS). The γ values were strikingly small, ranging in the order of 10?6 to 0.01. As δ increased, γ usually increased but sometimes decreased. How the data were analysed had the dominating effects on γ, overshadowing the effect of δ. The results suggest that it is impossible to predict γ using δ and that it may not be appropriate to use the γ-based RE to determine sufficient m.
文摘This study computes the durability of Return on Assets (ROA) in small and medium enterprises from different sample datasets. Utilizing information from the Financial Statements Statistics of Corporations by Industry, it verifies the precision of correlation coefficients using the Non-iterative Bayesian-based Imputation (NIBAS) and multiple imputation method for all combinations of common variables with auxiliary files. The following are the three important findings of this paper. First, statistical matching estimates of higher precision can be obtained using key variable sets with higher canonical correlation coefficients. Second, even if the key variable sets have high canonical correlation coefficients, key variables that are correlated extremely strongly with target variables and have high kurtosis should not be used. Finally, using auxiliary flies can improve the precision of statistical matching estimates. Accordingly, the durability of ROA in small and medium enterprises is computed. The author finds that the series of ROA correlation fluctuates for smaller enterprises compared to larger ones, and thus, the vulnerability of ROA in small and medium enterprises can be clarified via statistical matching.
基金supported by the National Natural Science Foundation of China (Grant No.22076129)the Sichuan Key R&D Project (Grant No.2020YFS0055)the Chengdu Major Technology Application and Demonstration Project (Grant No.2020-YF09-00031-SN).
文摘As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias stems from the issue of dataset shift,as the density distributions of predictor variables differ greatly between urban and rural areas.We propose a data-augmentation approach based on the multiple imputation by chained equations(MICE-DA)to remedy the dataset shift problem.Compared with the benchmark models,MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity(Chengdu)at the foot of the Tibetan Plateau,especially for correcting the estimation bias,with the mean bias decreasing from-3.4µg/m3 to-1.6µg/m3.As a complement to the holdout validation,the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area.The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth(AOD)during the data augmentation.Consequently,the importance of AOD is largely enhanced for predicting PM2.5,and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5%to 18.4%.This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies.The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research.
基金Supported by the National Natural Science Foundation of China(Grant Nos.11771431,11690015,11926341,11901128 and 11601097)Key Laboratory of RCSDS,CAS(Grant Nos.2008DP 173182)Natural Science Foundation of Guangdong Province of China(Grant No.2018A030310068)。
文摘In some situations,the failure time of interest is defined as the gap time between two related events and the observations on both event times can suffer either right or interval censoring.Such data are usually referred to as doubly censored data and frequently encountered in many clinical and observational studies.Additionally,there may also exist a cured subgroup in the whole population,which means that not every individual under study will experience the failure time of interest eventually.In this paper,we consider regression analysis of doubly censored data with a cured subgroup under a wide class of flexible transformation cure models.Specifically,we consider marginal likelihood estimation and develop a two-step approach by combining the multiple imputation and a new expectation-maximization(EM)algorithm for its implementation.The resulting estimators are shown to be consistent and asymptotically normal.The finite sample performance of the proposed method is investigated through simulation studies.The proposed method is also applied to a real dataset arising from an AIDS cohort study for illustration.
基金the Graduate School at Washington State University and by the National Science Foundation(No.SMA-1620462)to the Santa Fe Institute and Washington State University.
文摘Although Gregory Johnson’s models have influenced social theory in archaeology,few have applied or built upon these models to predict aspects of social organization,group size,or fissioning.Exceptions have been limited to small case studies.Recently,the relationship between a society’s scale and its information-processing capacities has been explored using the Seshat Databank.Here,I apply multiple-linear regression analysis to the Seshat data using Turchin and colleagues’9“complexity characteristics”(CCs)to further examine the relationship between the hierarchy CC and the remaining 8 CCs which include both aspects of a polity’s scale and aspects of what Kohler et al.call“collective computation”.The results support Johnson’s ideas that stratification will generally increase with increases in a polity’s scale(population,territory);however,stratification is also higher when polities increase their developments in information-processing variables such as texts.