Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multip...Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.展开更多
In this paper, we derive the stochastic maximum principle for optimal control problems of the forward-backward Markovian regime-switching system. The control system is described by an anticipated forward-backward stoc...In this paper, we derive the stochastic maximum principle for optimal control problems of the forward-backward Markovian regime-switching system. The control system is described by an anticipated forward-backward stochastic pantograph equation and modulated by a continuous-time finite-state Markov chain. By virtue of classical variational approach, duality method, and convex analysis, we obtain a stochastic maximum principle for the optimal control.展开更多
As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias st...As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias stems from the issue of dataset shift,as the density distributions of predictor variables differ greatly between urban and rural areas.We propose a data-augmentation approach based on the multiple imputation by chained equations(MICE-DA)to remedy the dataset shift problem.Compared with the benchmark models,MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity(Chengdu)at the foot of the Tibetan Plateau,especially for correcting the estimation bias,with the mean bias decreasing from-3.4µg/m3 to-1.6µg/m3.As a complement to the holdout validation,the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area.The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth(AOD)during the data augmentation.Consequently,the importance of AOD is largely enhanced for predicting PM2.5,and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5%to 18.4%.This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies.The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research.展开更多
Purpose-Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases.However,the performance of these systems is adversely affected by the missing va...Purpose-Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases.However,the performance of these systems is adversely affected by the missing values in medical datasets.Imputation methods are used to predict these missing values.In this paper,a new imputation method called hybrid imputation optimized by the classifier(HIOC)is proposed to predict missing values efficiently.Design/methodology/approach-The proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations(MICE),K nearest neighbor(KNN),mean and mode imputation methods in an optimum way.Performance of HIOC has been compared to MICE,KNN,and mean and mode methods.Four classifiers support vector machine(SVM),naive Bayes(NB),random forest(RF)and decision tree(DT)have been used to evaluate the performance of imputation methods.Findings-The results show that HIOC performed efficiently even with a high rate of missing values.It had reduced root mean square error(RMSE)up to 17.32%in the heart disease dataset and 34.73%in the breast cancer dataset.Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases.It increased classification accuracy up to 18.61%in the heart disease dataset and 6.20%in the breast cancer dataset.Originality/value-The proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.展开更多
基金supported by the Stiftung Rheinland-Pfalz fur Innovation(959).
文摘Currently, a growing number of programs become available in statistical software for multiple imputation of missing values. Among others, two algorithms are mainly implemented: Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE). They have been shown to work well in large samples or when only small proportions of missing data are to be imputed. However, some researchers have begun to impute large proportions of missing data or to apply the method to small samples. A simulation was performed using MICE on datasets with 50, 100 or 200 cases and four or eleven variables. A varying proportion of data (3% - 63%) was set as missing completely at random and subsequently substituted using multiple imputation by chained equations. In a logistic regression model, four coefficients, i.e. non-zero and zero main effects as well as non-zero and zero interaction effects were examined. Estimations of all main and interaction effects were unbiased. There was a considerable variance in the estimates, increasing with the proportion of missing data and decreasing with sample size. The imputation of missing data by chained equations is a useful tool for imputing small to moderate proportions of missing data. The method has its limits, however. In small samples, there are considerable random errors for all effects.
文摘In this paper, we derive the stochastic maximum principle for optimal control problems of the forward-backward Markovian regime-switching system. The control system is described by an anticipated forward-backward stochastic pantograph equation and modulated by a continuous-time finite-state Markov chain. By virtue of classical variational approach, duality method, and convex analysis, we obtain a stochastic maximum principle for the optimal control.
基金supported by the National Natural Science Foundation of China (Grant No.22076129)the Sichuan Key R&D Project (Grant No.2020YFS0055)the Chengdu Major Technology Application and Demonstration Project (Grant No.2020-YF09-00031-SN).
文摘As most air quality monitoring sites are in urban areas worldwide,machine learning models may produce substantial estimation bias in rural areas when deriving spatiotemporal distributions of air pollutants.The bias stems from the issue of dataset shift,as the density distributions of predictor variables differ greatly between urban and rural areas.We propose a data-augmentation approach based on the multiple imputation by chained equations(MICE-DA)to remedy the dataset shift problem.Compared with the benchmark models,MICE-DA exhibits superior predictive performance in deriving the spatiotemporal distributions of hourly PM2.5 in the megacity(Chengdu)at the foot of the Tibetan Plateau,especially for correcting the estimation bias,with the mean bias decreasing from-3.4µg/m3 to-1.6µg/m3.As a complement to the holdout validation,the semi-variance results show that MICE-DA decently preserves the spatial autocorrelation pattern of PM2.5 over the study area.The essence of MICE-DA is strengthening the correlation between PM2.5 and aerosol optical depth(AOD)during the data augmentation.Consequently,the importance of AOD is largely enhanced for predicting PM2.5,and the summed relative importance value of the two satellite-retrieved AOD variables increases from 5.5%to 18.4%.This study resolved the puzzle that AOD exhibited relatively lower importance in local or regional studies.The results of this study can advance the utilization of satellite remote sensing in modeling air quality while drawing more attention to the common dataset shift problem in data-driven environmental research.
文摘Purpose-Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases.However,the performance of these systems is adversely affected by the missing values in medical datasets.Imputation methods are used to predict these missing values.In this paper,a new imputation method called hybrid imputation optimized by the classifier(HIOC)is proposed to predict missing values efficiently.Design/methodology/approach-The proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations(MICE),K nearest neighbor(KNN),mean and mode imputation methods in an optimum way.Performance of HIOC has been compared to MICE,KNN,and mean and mode methods.Four classifiers support vector machine(SVM),naive Bayes(NB),random forest(RF)and decision tree(DT)have been used to evaluate the performance of imputation methods.Findings-The results show that HIOC performed efficiently even with a high rate of missing values.It had reduced root mean square error(RMSE)up to 17.32%in the heart disease dataset and 34.73%in the breast cancer dataset.Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases.It increased classification accuracy up to 18.61%in the heart disease dataset and 6.20%in the breast cancer dataset.Originality/value-The proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.