Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.I...Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.In this study,we evaluate and compare the effects of imputationmethods for estimating missing values in a time series.Our approach does not include a simulation to generate pseudo-missing data,but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom.In an experiment,therefore,several time series forecasting models are trained using different training datasets prepared using each imputation method.Subsequently,the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models.The results obtained from a total of four experimental cases show that the k-nearest neighbor technique is the most effective in reconstructing missing data and contributes positively to time series forecasting compared with other imputation methods.展开更多
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study...Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/soft- wares/imputationmethods/.展开更多
Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by st...Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.展开更多
Purpose-Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases.However,the performance of these systems is adversely affected by the missing va...Purpose-Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases.However,the performance of these systems is adversely affected by the missing values in medical datasets.Imputation methods are used to predict these missing values.In this paper,a new imputation method called hybrid imputation optimized by the classifier(HIOC)is proposed to predict missing values efficiently.Design/methodology/approach-The proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations(MICE),K nearest neighbor(KNN),mean and mode imputation methods in an optimum way.Performance of HIOC has been compared to MICE,KNN,and mean and mode methods.Four classifiers support vector machine(SVM),naive Bayes(NB),random forest(RF)and decision tree(DT)have been used to evaluate the performance of imputation methods.Findings-The results show that HIOC performed efficiently even with a high rate of missing values.It had reduced root mean square error(RMSE)up to 17.32%in the heart disease dataset and 34.73%in the breast cancer dataset.Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases.It increased classification accuracy up to 18.61%in the heart disease dataset and 6.20%in the breast cancer dataset.Originality/value-The proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.30800776)the State High-Tech Development Plan of China(Grant No.2008AA101002)the Recommend International Advanced Agricultural Science and Technology Plan of China(Grant No2011-G2A)
基金This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(Grant Number 2020R1A6A1A03040583).
文摘Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.In this study,we evaluate and compare the effects of imputationmethods for estimating missing values in a time series.Our approach does not include a simulation to generate pseudo-missing data,but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom.In an experiment,therefore,several time series forecasting models are trained using different training datasets prepared using each imputation method.Subsequently,the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models.The results obtained from a total of four experimental cases show that the k-nearest neighbor technique is the most effective in reconstructing missing data and contributes positively to time series forecasting compared with other imputation methods.
基金supported by the School of Biological Sciences of Institute for Research in Fundamental Sciences(IPM)supported by Institute for Computational Biomedicine of Weill Cornell Medicine
文摘Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/soft- wares/imputationmethods/.
基金This research was financially supported by FDCT NO.005/2018/A1also supported by Guangdong Provincial Innovation and Entrepreneurship Training Program Project No.201713719017College Students Innovation Training Program held by Guangdong university of Science and Technology Nos.1711034,1711080,and No.1711088.
文摘Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.
文摘Purpose-Decision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases.However,the performance of these systems is adversely affected by the missing values in medical datasets.Imputation methods are used to predict these missing values.In this paper,a new imputation method called hybrid imputation optimized by the classifier(HIOC)is proposed to predict missing values efficiently.Design/methodology/approach-The proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations(MICE),K nearest neighbor(KNN),mean and mode imputation methods in an optimum way.Performance of HIOC has been compared to MICE,KNN,and mean and mode methods.Four classifiers support vector machine(SVM),naive Bayes(NB),random forest(RF)and decision tree(DT)have been used to evaluate the performance of imputation methods.Findings-The results show that HIOC performed efficiently even with a high rate of missing values.It had reduced root mean square error(RMSE)up to 17.32%in the heart disease dataset and 34.73%in the breast cancer dataset.Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases.It increased classification accuracy up to 18.61%in the heart disease dataset and 6.20%in the breast cancer dataset.Originality/value-The proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.