The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random mis...The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random missing(RM)that differs significantly from common missing patterns of RTT-AT.The method for solving the RM may experience performance degradation or failure when applied to RTT-AT imputation.Conventional autoregressive deep learning methods are prone to error accumulation and long-term dependency loss.In this paper,a non-autoregressive imputation model that addresses the issue of missing value imputation for two common missing patterns in RTT-AT is proposed.Our model consists of two probabilistic sparse diagonal masking self-attention(PSDMSA)units and a weight fusion unit.It learns missing values by combining the representations outputted by the two units,aiming to minimize the difference between the missing values and their actual values.The PSDMSA units effectively capture temporal dependencies and attribute correlations between time steps,improving imputation quality.The weight fusion unit automatically updates the weights of the output representations from the two units to obtain a more accurate final representation.The experimental results indicate that,despite varying missing rates in the two missing patterns,our model consistently outperforms other methods in imputation performance and exhibits a low frequency of deviations in estimates for specific missing entries.Compared to the state-of-the-art autoregressive deep learning imputation model Bidirectional Recurrent Imputation for Time Series(BRITS),our proposed model reduces mean absolute error(MAE)by 31%~50%.Additionally,the model attains a training speed that is 4 to 8 times faster when compared to both BRITS and a standard Transformer model when trained on the same dataset.Finally,the findings from the ablation experiments demonstrate that the PSDMSA,the weight fusion unit,cascade network design,and imputation loss enhance imputation performance and confirm the efficacy of our design.展开更多
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
Genotype imputation has become an indispensable part of genomic data analysis. In recent years, imputation based on a multi-breed reference population has received more attention, but the relevant studies are scarce i...Genotype imputation has become an indispensable part of genomic data analysis. In recent years, imputation based on a multi-breed reference population has received more attention, but the relevant studies are scarce in pigs. In this study, we used the Illumina Porcine SNP50 Bead Chip to investigate the variations of imputation accuracy with various influencing factors and compared the imputation performance of four commonly used imputation software programs. The results indicated that imputation accuracy increased as either the validation population marker density, reference population sample size, or minor allele frequency(MAF) increased. However, the imputation accuracy would have a certain extent of decrease when the pig reference population was a mixed group of multiple breeds or lines. Considering both imputation accuracy and running time, Beagle 4.1 and FImpute are excellent choices among the four software packages tested. This work visually presents the impacts of these influencing factors on imputation and provides a reference for formulating reasonable imputation strategies in actual pig breeding.展开更多
Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by st...Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.展开更多
How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situatio...How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situations. In this study, η, the minimally sufficient number of imputations, was determined based on the relationship between m, the number of imputations, and ω, the standard error of imputation variances using the 2012 National Ambulatory Medical Care Survey (NAMCS) Physician Workflow mail survey. Five variables of various value ranges, variances, and missing data percentages were tested. For all variables tested, ω decreased as m increased. The m value above which the cost of further increase in m would outweigh the benefit of reducing ω was recognized as the η. This method has a potential to be used by anyone to determine η that fits his or her own data situation.展开更多
Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a spe...Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span>展开更多
Background: Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence(WGS) data. However, sequencing thousands of individuals of interest is expensive.Imputatio...Background: Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence(WGS) data. However, sequencing thousands of individuals of interest is expensive.Imputation from SNP panels to WGS data is an attractive and less expensive approach to obtain WGS data. The aims of this study were to investigate the accuracy of imputation and to provide insight into the design and execution of genotype imputation.Results: We genotyped 450 chickens with a 600 K SNP array, and sequenced 24 key individuals by whole genome re-sequencing. Accuracy of imputation from putative 60 K and 600 K array data to WGS data was 0.620 and 0.812 for Beagle, and 0.810 and 0.914 for FImpute, respectively. By increasing the sequencing cost from 24 X to 144 X, the imputation accuracy increased from 0.525 to 0.698 for Beagle and from 0.654 to 0.823 for FImpute. With fixed sequence depth(12 X), increasing the number of sequenced animals from 1 to 24, improved accuracy from 0.421 to0.897 for FImpute and from 0.396 to 0.777 for Beagle. Using optimally selected key individuals resulted in a higher imputation accuracy compared with using randomly selected individuals as a reference population for resequencing. With fixed reference population size(24), imputation accuracy increased from 0.654 to 0.875 for FImpute and from 0.512 to 0.762 for Beagle as the sequencing depth increased from 1 X to 12 X. With a given total cost of genotyping, accuracy increased with the size of the reference population for FImpute, but the pattern was not valid for Beagle, which showed the highest accuracy at six fold coverage for the scenarios used in this study.Conclusions: In conclusion, we comprehensively investigated the impacts of several key factors on genotype imputation. Generally, increasing sequencing cost gave a higher imputation accuracy. But with a fixed sequencing cost, the optimal imputation enhance the performance of WGP and GWAS. An optimal imputation strategy should take size of reference population, imputation algorithms, marker density, and population structure of the target population and methods to select key individuals into consideration comprehensively. This work sheds additional light on how to design and execute genotype imputation for livestock populations.展开更多
Background:Genotyping by sequencing(GBS)still has problems with missing genotypes.Imputation is important for using GBS for genomic predictions,especially for low depths,due to the large number of missing genotypes.Mi...Background:Genotyping by sequencing(GBS)still has problems with missing genotypes.Imputation is important for using GBS for genomic predictions,especially for low depths,due to the large number of missing genotypes.Minor allele frequency(MAF)is widely used as a marker data editing criteria for genomic predictions.In this study,three imputation methods(Beagle,IMPUTE2 and FImpute software)based on four MAF editing criteria were investigated with regard to imputation accuracy of missing genotypes and accuracy of genomic predictions,based on simulated data of livestock population.Results:Four MAFs(no MAF limit,MAF≥0.001,MAF≥0.01 and MAF≥0.03)were used for editing marker data before imputation.Beagle,IMPUTE2 and FImpute software were applied to impute the original GBS.Additionally,IMPUTE2 also imputed the expected genotype dosage after genotype correction(GcIM).The reliability of genomic predictions was calculated using GBS and imputed GBS data.The results showed that imputation accuracies were the same for the three imputation methods,except for the data of sequencing read depth(depth)=2,where FImpute had a slightly lower imputation accuracy than Beagle and IMPUTE2.GcIM was observed to be the best for all of the imputations at depth=4,5 and 10,but the worst for depth=2.For genomic prediction,retaining more SNPs with no MAF limit resulted in higher reliability.As the depth increased to 10,the prediction reliabilities approached those using true genotypes in the GBS loci.Beagle and IMPUTE2 had the largest increases in prediction reliability of 5 percentage points,and FImpute gained 3 percentage points at depth=2.The best prediction was observed at depth=4,5 and 10 using GcIM,but the worst prediction was also observed using GcIM at depth=2.Conclusions:The current study showed that imputation accuracies were relatively low for GBS with low depths and high for GBS with high depths.Imputation resulted in larger gains in the reliability of genomic predictions for GBS with lower depths.These results suggest that the application of IMPUTE2,based on a corrected GBS(GcIM)to improve genomic predictions for higher depths,and FImpute software could be a good alternative for routine imputation.展开更多
The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics,especially the analysis of gene expression data that facilitates an early detection of cancer.Many...The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics,especially the analysis of gene expression data that facilitates an early detection of cancer.Many attempts show improvements made by excluding samples with missing information from the analysis process,while others have tried to fill the gaps with possible values.While the former is simple,the latter safeguards information loss.For that,a neighbour-based(KNN)approach has proven more effective than other global estimators.The paper extends this further by introducing a new summarizationmethod to theKNNmodel.It is the first study that applies the concept of ordered weighted averaging(OWA)operator to such a problem context.In particular,two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models.Using different ratios of missing values from 1%-20%and a set of six published gene expression datasets,the experimental results suggest that newmethods usually provide more accurate estimates than those compared methods.Specific to the missing rates of 5%and 20%,the best NRMSE scores as averages across datasets is 0.65 and 0.69,while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84,respectively.展开更多
In analyzing data from clinical trials and longitudinal studies, the issue of missing values is always a fundamental challenge since the missing data could introduce bias and lead to erroneous statistical inferences. ...In analyzing data from clinical trials and longitudinal studies, the issue of missing values is always a fundamental challenge since the missing data could introduce bias and lead to erroneous statistical inferences. To deal with this challenge, several imputation methods have been developed in the literature to handle missing values where the most commonly used are complete case method, mean imputation method, last observation carried forward (LOCF) method, and multiple imputation (MI) method. In this paper, we conduct a simulation study to investigate the efficiency of these four typical imputation methods with longitudinal data setting under missing completely at random (MCAR). We categorize missingness with three cases from a lower percentage of 5% to a higher percentage of 30% and 50% missingness. With this simulation study, we make a conclusion that LOCF method has more bias than the other three methods in most situations. MI method has the least bias with the best coverage probability. Thus, we conclude that MI method is the most effective imputation method in our MCAR simulation study.展开更多
Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.I...Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.In this study,we evaluate and compare the effects of imputationmethods for estimating missing values in a time series.Our approach does not include a simulation to generate pseudo-missing data,but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom.In an experiment,therefore,several time series forecasting models are trained using different training datasets prepared using each imputation method.Subsequently,the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models.The results obtained from a total of four experimental cases show that the k-nearest neighbor technique is the most effective in reconstructing missing data and contributes positively to time series forecasting compared with other imputation methods.展开更多
This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bil...This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method.展开更多
Databases for machine learning and data mining often have missing values. How to develop effective method for missing values imputation is a crucial important problem in the field of machine learning and data mining. ...Databases for machine learning and data mining often have missing values. How to develop effective method for missing values imputation is a crucial important problem in the field of machine learning and data mining. In this paper, several methods for dealing with missing values in incomplete data are reviewed, and a new method for missing values imputation based on iterative learning is proposed. The proposed method is based on a basic assumption: There exist cause-effect connections among condition attribute values, and the missing values can be induced from known values. In the process of missing values imputation, a part of missing values are filled in at first and converted to known values, which are used for the next step of missing values imputation. The iterative learning process will go on until an incomplete data is entirely converted to a complete data. The paper also presents an example to illustrate the framework of iterative learning for missing values imputation.展开更多
Background: A novel approach to modelling individual tree growth dynamics is proposed. The approach combines multiple imputation and copula sampling to produce a stochastic individual tree growth and yield projection...Background: A novel approach to modelling individual tree growth dynamics is proposed. The approach combines multiple imputation and copula sampling to produce a stochastic individual tree growth and yield projection system. Methods: The Nova Scotia, Canada permanent sample plot network is used as a case study to develop and test the modelling approach. Predictions from this model are compared to predictions from the Acadian variant of the Forest Vegetation Simulator, a widely used statistical individual tree growth and yield model. Results: Diameter and height growth rates were predicted with error rates consistent with those produced using statistical models. Mortality and ingrowth error rates were higher than those observed for diameter and height, but also were within the bounds produced by traditional approaches for predicting these rates. Ingrowth species composition was very poorly predicted. The model was capable of reproducing a wide range of stand dynamic trajectories and in some cases reproduced trajectories that the statistical model was incapable of reproducing. Conclusions: The model has potential to be used as a benchmarking tool for evaluating statistical and process models and may provide a mechanism to separate signal from noise and improve our ability to analyze and learn from large regional datasets that often have underlying flaws in sample design.展开更多
Suppose that there are two populations x and y with missing data on both of them, where x has a distribution function F(·) which is unknown and y has a distribution function Gθ(·) with a probability den...Suppose that there are two populations x and y with missing data on both of them, where x has a distribution function F(·) which is unknown and y has a distribution function Gθ(·) with a probability density function gθ(·) with known form depending on some unknown parameter θ. Fractional imputation is used to fill in missing data. The asymptotic distributions of the semi-empirical likelihood ration statistic are obtained under some mild conditions. Then, empirical likelihood confidence intervals on the differences of x and y are constructed.展开更多
Individual tree detection (ITD) and the area-based approach (ABA) are combined to generate tree-lists using airborne LiDAR data. ITD based on the Canopy Height Model (CHM) was applied for overstory trees, while ABA ba...Individual tree detection (ITD) and the area-based approach (ABA) are combined to generate tree-lists using airborne LiDAR data. ITD based on the Canopy Height Model (CHM) was applied for overstory trees, while ABA based on nearest neighbor (NN) imputation was applied for understory trees. Our approach is intended to compensate for the weakness of LiDAR data and ITD in estimating understory trees, keeping the strength of ITD in estimating overstory trees in tree-level. We investigated the effects of three parameters on the performance of our proposed approach: smoothing of CHM, resolution of CHM, and height cutoff (a specific height that classifies trees into overstory and understory). There was no single combination of those parameters that produced the best performance for estimating stems per ha, mean tree height, basal area, diameter distribution and height distribution. The trees in the lowest LiDAR height class yielded the largest relative bias and relative root mean squared error. Although ITD and ABA showed limited explanatory powers to estimate stems per hectare and basal area, there could be improvements from methods such as using LiDAR data with higher density, applying better algorithms for ITD and decreasing distortion of the structure of LiDAR data. Automating the procedure of finding optimal combinations of those parameters is essential to expedite forest management decisions across forest landscapes using remote sensing data.展开更多
The relative humidity in the atmosphere captured by AQUA satellite contains missing matrices. In order to fill such missing values four very popular imputation techniques: Bilinear, Inverse Distance Weighting, Natural...The relative humidity in the atmosphere captured by AQUA satellite contains missing matrices. In order to fill such missing values four very popular imputation techniques: Bilinear, Inverse Distance Weighting, Natural Neighbor and Nearest Interpolations were tested. Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Coefficient of Determination (R2) and Correlation Coefficient (Corr), were used to check the accuracy of these interpolations. It was found that the Inverse Distance Weighting and Nearest Interpolation were proved not to be suited. Natural interpolation gave accurate results than the aforementioned two interpolations. Missing values of relative humidity were accurately refilled with Bilinear Interpolation. This interpolation produced RMSE of ±0.543 for relative humidity over 100, 150, 200, 250, 300, 400, 500 hPa while for 600, 700, 850 and 925 hPa RMSE remainnear to 1. A perfect fit to the surface and very strong correlation (value near to 0.99) was found between actual and imputed relative humidity data through Bilinear Interpolation. Therefore it was concluded that the Bilinear Interpolation is the most accurate and best imputation for missing values of relative humidity form 100 to 1000 hPa levels.展开更多
In current study an attempt is carried out by filling missing data of geopotiential height over Pakistan and identifying the optimum method for interpolation. In last thirteen years geopotential height values over wer...In current study an attempt is carried out by filling missing data of geopotiential height over Pakistan and identifying the optimum method for interpolation. In last thirteen years geopotential height values over were missing over Pakistan. These gaps are tried to be filled by interpolation Techniques. The techniques for interpolations included Bilinear interpolations [BI], Nearest Neighbor [NN], Natural [NI] and Inverse distance weighting [IDW]. These imputations were judged on the basis of performance parameters which include Root Mean Square Error [RMSE], Mean Absolute Error [MAE], Correlation Coefficient [Corr] and Coefficient of Determination [R2]. The NN and IDW interpolation Imputations were not precise and accurate. The Natural Neighbors and Bilinear interpolations immaculately fitted to the data set. A good correlation was found for Natural Neighbor interpolation imputations and perfectly fit to the surface of geopotential height. The root mean square error [maximum and minimum] values were ranges from ±5.10 to ±2.28 m respectively. However mean absolute error was near to 1. The validation of imputation revealed that NN interpolation produced more accurate results than BI. It can be concluded that Natural Interpolation was the best suited interpolation technique for filling missing data sets from AQUA satellite for geopotential height.展开更多
Non-responses leading to missing data are common in most studies and causes inefficient and biased statistical inferences if ignored. When faced with missing data, many studies choose to employ complete case analysis ...Non-responses leading to missing data are common in most studies and causes inefficient and biased statistical inferences if ignored. When faced with missing data, many studies choose to employ complete case analysis approach to estimate the parameters of the model. This however compromises on the susceptibility of the estimates to reduced bias and minimum variance as expected. Several classical and model based techniques of imputing the missing values have been mentioned in literature. Bayesian approach to missingness is deemed superior amongst the other techniques through its natural self-lending to missing data settings where the missing values are treated as unobserved random variables that have a distribution which depends on the observed data. This paper digs up the superiority of Bayesian imputation to Multiple Imputation with Chained Equations (MICE) when estimating logistic panel data models with single fixed effects. The study validates the superiority of conditional maximum likelihood estimates for nonlinear binary choice logit panel model in the presence of missing observations. A Monte Carlo simulation was designed to determine the magnitude of bias and root mean square errors (RMSE) arising from MICE and Full Bayesian imputation. The simulation results show that the conditional maximum likelihood (ML) logit estimator presented in this paper is less biased and more efficient when Bayesian imputation is performed to curb non-responses.展开更多
基金supported by Graduate Funded Project(No.JY2022A017).
文摘The frequent missing values in radar-derived time-series tracks of aerial targets(RTT-AT)lead to significant challenges in subsequent data-driven tasks.However,the majority of imputation research focuses on random missing(RM)that differs significantly from common missing patterns of RTT-AT.The method for solving the RM may experience performance degradation or failure when applied to RTT-AT imputation.Conventional autoregressive deep learning methods are prone to error accumulation and long-term dependency loss.In this paper,a non-autoregressive imputation model that addresses the issue of missing value imputation for two common missing patterns in RTT-AT is proposed.Our model consists of two probabilistic sparse diagonal masking self-attention(PSDMSA)units and a weight fusion unit.It learns missing values by combining the representations outputted by the two units,aiming to minimize the difference between the missing values and their actual values.The PSDMSA units effectively capture temporal dependencies and attribute correlations between time steps,improving imputation quality.The weight fusion unit automatically updates the weights of the output representations from the two units to obtain a more accurate final representation.The experimental results indicate that,despite varying missing rates in the two missing patterns,our model consistently outperforms other methods in imputation performance and exhibits a low frequency of deviations in estimates for specific missing entries.Compared to the state-of-the-art autoregressive deep learning imputation model Bidirectional Recurrent Imputation for Time Series(BRITS),our proposed model reduces mean absolute error(MAE)by 31%~50%.Additionally,the model attains a training speed that is 4 to 8 times faster when compared to both BRITS and a standard Transformer model when trained on the same dataset.Finally,the findings from the ablation experiments demonstrate that the PSDMSA,the weight fusion unit,cascade network design,and imputation loss enhance imputation performance and confirm the efficacy of our design.
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.
基金supported by the China Agriculture Research System of MOF and MARA(CARS-35)the National Natural Science Foundation of China(32072696,31790414 and 31601916)the Fundamental Research Funds for the Central Universities(2662019PY011)。
文摘Genotype imputation has become an indispensable part of genomic data analysis. In recent years, imputation based on a multi-breed reference population has received more attention, but the relevant studies are scarce in pigs. In this study, we used the Illumina Porcine SNP50 Bead Chip to investigate the variations of imputation accuracy with various influencing factors and compared the imputation performance of four commonly used imputation software programs. The results indicated that imputation accuracy increased as either the validation population marker density, reference population sample size, or minor allele frequency(MAF) increased. However, the imputation accuracy would have a certain extent of decrease when the pig reference population was a mixed group of multiple breeds or lines. Considering both imputation accuracy and running time, Beagle 4.1 and FImpute are excellent choices among the four software packages tested. This work visually presents the impacts of these influencing factors on imputation and provides a reference for formulating reasonable imputation strategies in actual pig breeding.
基金This research was financially supported by FDCT NO.005/2018/A1also supported by Guangdong Provincial Innovation and Entrepreneurship Training Program Project No.201713719017College Students Innovation Training Program held by Guangdong university of Science and Technology Nos.1711034,1711080,and No.1711088.
文摘Based on the two-dimensional relation table,this paper studies the missing values in the sample data of land price of Shunde District of Foshan City.GeoDa software was used to eliminate the insignificant factors by stepwise regression analysis;NORM software was adopted to construct the multiple imputation models;EM algorithm and the augmentation algorithm were applied to fit multiple linear regression equations to construct five different filling datasets.Statistical analysis is performed on the imputation data set in order to calculate the mean and variance of each data set,and the weight is determined according to the differences.Finally,comprehensive integration is implemented to achieve the imputation expression of missing values.The results showed that in the three missing cases where the PRICE variable was missing and the deletion rate was 5%,the PRICE variable was missing and the deletion rate was 10%,and the PRICE variable and the CBD variable were both missing.The new method compared to the traditional multiple filling methods of true value closer ratio is 75%to 25%,62.5%to 37.5%,100%to 0%.Therefore,the new method is obviously better than the traditional multiple imputation methods,and the missing value data estimated by the new method bears certain reference value.
文摘How many imputations are sufficient in multiple imputations? The answer given by different researchers varies from as few as 2 - 3 to as many as hundreds. Perhaps no single number of imputations would fit all situations. In this study, η, the minimally sufficient number of imputations, was determined based on the relationship between m, the number of imputations, and ω, the standard error of imputation variances using the 2012 National Ambulatory Medical Care Survey (NAMCS) Physician Workflow mail survey. Five variables of various value ranges, variances, and missing data percentages were tested. For all variables tested, ω decreased as m increased. The m value above which the cost of further increase in m would outweigh the benefit of reducing ω was recognized as the η. This method has a potential to be used by anyone to determine η that fits his or her own data situation.
文摘Multiple imputations compensate for missing data and produce multiple datasets by regression model and are considered the solver of the old problem of univariate imputation. The univariate imputes data only from a specific column where the data cell was missing. Multivariate imputation works simultaneously, with all variables in all columns, whether missing or observed. It has emerged as a principal method of solving missing data problems. All incomplete datasets analyzed before Multiple Imputation by Chained Equations <span style="font-family:Verdana;">(MICE) presented were misdiagnosed;results obtained were invalid and should</span><span style="font-family:Verdana;"> not be countable to yield reasonable conclusions. This article will highlight why multiple imputations and how the MICE work with a particular focus on the cyber-security dataset.</span><b> </b><span style="font-family:Verdana;">Removing missing data in any dataset and replac</span><span style="font-family:Verdana;">ing it is imperative in analyzing the data and creating prediction models. Therefore,</span><span style="font-family:Verdana;"> a good imputation technique should recover the missingness, which involves extracting the good features. However, the widely used univariate imputation method does not impute missingness reasonably if the values are too large and may thus lead to bias. Therefore, we aim to propose an alternative imputation method that is efficient and removes potential bias after removing the missingness.</span>
基金supported by the National Natural Science Foundation of China(31772556)the China Agricultural Research System(CARS-41-G03)+2 种基金the Science Innovation Project of Guangdong(2015A020209159)the Special Program for Applied Research on Super Computation of the NSFC Guangdong Joint Fund(the second phase)under Grant No.U1501501technical support from the National Supercomputer Center in Guangzhou
文摘Background: Genome-wide association studies and genomic predictions are thought to be optimized by using whole-genome sequence(WGS) data. However, sequencing thousands of individuals of interest is expensive.Imputation from SNP panels to WGS data is an attractive and less expensive approach to obtain WGS data. The aims of this study were to investigate the accuracy of imputation and to provide insight into the design and execution of genotype imputation.Results: We genotyped 450 chickens with a 600 K SNP array, and sequenced 24 key individuals by whole genome re-sequencing. Accuracy of imputation from putative 60 K and 600 K array data to WGS data was 0.620 and 0.812 for Beagle, and 0.810 and 0.914 for FImpute, respectively. By increasing the sequencing cost from 24 X to 144 X, the imputation accuracy increased from 0.525 to 0.698 for Beagle and from 0.654 to 0.823 for FImpute. With fixed sequence depth(12 X), increasing the number of sequenced animals from 1 to 24, improved accuracy from 0.421 to0.897 for FImpute and from 0.396 to 0.777 for Beagle. Using optimally selected key individuals resulted in a higher imputation accuracy compared with using randomly selected individuals as a reference population for resequencing. With fixed reference population size(24), imputation accuracy increased from 0.654 to 0.875 for FImpute and from 0.512 to 0.762 for Beagle as the sequencing depth increased from 1 X to 12 X. With a given total cost of genotyping, accuracy increased with the size of the reference population for FImpute, but the pattern was not valid for Beagle, which showed the highest accuracy at six fold coverage for the scenarios used in this study.Conclusions: In conclusion, we comprehensively investigated the impacts of several key factors on genotype imputation. Generally, increasing sequencing cost gave a higher imputation accuracy. But with a fixed sequencing cost, the optimal imputation enhance the performance of WGP and GWAS. An optimal imputation strategy should take size of reference population, imputation algorithms, marker density, and population structure of the target population and methods to select key individuals into consideration comprehensively. This work sheds additional light on how to design and execute genotype imputation for livestock populations.
基金This study was funded by the Genomic Selection in Animals and Plants(GenSAP)research project financed by the Danish Council of Strategic Research(Aarhus,Denmark).Xiao Wang received Ph.D.stipends from the Technical University of Denmark(DTU Bioinformatics and DTU Compute),Denmark,and the China Scholarship Council,China.
文摘Background:Genotyping by sequencing(GBS)still has problems with missing genotypes.Imputation is important for using GBS for genomic predictions,especially for low depths,due to the large number of missing genotypes.Minor allele frequency(MAF)is widely used as a marker data editing criteria for genomic predictions.In this study,three imputation methods(Beagle,IMPUTE2 and FImpute software)based on four MAF editing criteria were investigated with regard to imputation accuracy of missing genotypes and accuracy of genomic predictions,based on simulated data of livestock population.Results:Four MAFs(no MAF limit,MAF≥0.001,MAF≥0.01 and MAF≥0.03)were used for editing marker data before imputation.Beagle,IMPUTE2 and FImpute software were applied to impute the original GBS.Additionally,IMPUTE2 also imputed the expected genotype dosage after genotype correction(GcIM).The reliability of genomic predictions was calculated using GBS and imputed GBS data.The results showed that imputation accuracies were the same for the three imputation methods,except for the data of sequencing read depth(depth)=2,where FImpute had a slightly lower imputation accuracy than Beagle and IMPUTE2.GcIM was observed to be the best for all of the imputations at depth=4,5 and 10,but the worst for depth=2.For genomic prediction,retaining more SNPs with no MAF limit resulted in higher reliability.As the depth increased to 10,the prediction reliabilities approached those using true genotypes in the GBS loci.Beagle and IMPUTE2 had the largest increases in prediction reliability of 5 percentage points,and FImpute gained 3 percentage points at depth=2.The best prediction was observed at depth=4,5 and 10 using GcIM,but the worst prediction was also observed using GcIM at depth=2.Conclusions:The current study showed that imputation accuracies were relatively low for GBS with low depths and high for GBS with high depths.Imputation resulted in larger gains in the reliability of genomic predictions for GBS with lower depths.These results suggest that the application of IMPUTE2,based on a corrected GBS(GcIM)to improve genomic predictions for higher depths,and FImpute software could be a good alternative for routine imputation.
基金This work is funded by Newton Institutional Links 2020-21 project:623718881,jointly by British Council and National Research Council of Thailand(www.britishcouncil.org).The corresponding author is the project PI.
文摘The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics,especially the analysis of gene expression data that facilitates an early detection of cancer.Many attempts show improvements made by excluding samples with missing information from the analysis process,while others have tried to fill the gaps with possible values.While the former is simple,the latter safeguards information loss.For that,a neighbour-based(KNN)approach has proven more effective than other global estimators.The paper extends this further by introducing a new summarizationmethod to theKNNmodel.It is the first study that applies the concept of ordered weighted averaging(OWA)operator to such a problem context.In particular,two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models.Using different ratios of missing values from 1%-20%and a set of six published gene expression datasets,the experimental results suggest that newmethods usually provide more accurate estimates than those compared methods.Specific to the missing rates of 5%and 20%,the best NRMSE scores as averages across datasets is 0.65 and 0.69,while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84,respectively.
文摘In analyzing data from clinical trials and longitudinal studies, the issue of missing values is always a fundamental challenge since the missing data could introduce bias and lead to erroneous statistical inferences. To deal with this challenge, several imputation methods have been developed in the literature to handle missing values where the most commonly used are complete case method, mean imputation method, last observation carried forward (LOCF) method, and multiple imputation (MI) method. In this paper, we conduct a simulation study to investigate the efficiency of these four typical imputation methods with longitudinal data setting under missing completely at random (MCAR). We categorize missingness with three cases from a lower percentage of 5% to a higher percentage of 30% and 50% missingness. With this simulation study, we make a conclusion that LOCF method has more bias than the other three methods in most situations. MI method has the least bias with the best coverage probability. Thus, we conclude that MI method is the most effective imputation method in our MCAR simulation study.
基金This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(Grant Number 2020R1A6A1A03040583).
文摘Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.In this study,we evaluate and compare the effects of imputationmethods for estimating missing values in a time series.Our approach does not include a simulation to generate pseudo-missing data,but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom.In an experiment,therefore,several time series forecasting models are trained using different training datasets prepared using each imputation method.Subsequently,the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models.The results obtained from a total of four experimental cases show that the k-nearest neighbor technique is the most effective in reconstructing missing data and contributes positively to time series forecasting compared with other imputation methods.
基金supported by the National Natural Science Foundation of China(Grant No.30800776)the State High-Tech Development Plan of China(Grant No.2008AA101002)the Recommend International Advanced Agricultural Science and Technology Plan of China(Grant No2011-G2A)
文摘This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method.
文摘Databases for machine learning and data mining often have missing values. How to develop effective method for missing values imputation is a crucial important problem in the field of machine learning and data mining. In this paper, several methods for dealing with missing values in incomplete data are reviewed, and a new method for missing values imputation based on iterative learning is proposed. The proposed method is based on a basic assumption: There exist cause-effect connections among condition attribute values, and the missing values can be induced from known values. In the process of missing values imputation, a part of missing values are filled in at first and converted to known values, which are used for the next step of missing values imputation. The iterative learning process will go on until an incomplete data is entirely converted to a complete data. The paper also presents an example to illustrate the framework of iterative learning for missing values imputation.
文摘Background: A novel approach to modelling individual tree growth dynamics is proposed. The approach combines multiple imputation and copula sampling to produce a stochastic individual tree growth and yield projection system. Methods: The Nova Scotia, Canada permanent sample plot network is used as a case study to develop and test the modelling approach. Predictions from this model are compared to predictions from the Acadian variant of the Forest Vegetation Simulator, a widely used statistical individual tree growth and yield model. Results: Diameter and height growth rates were predicted with error rates consistent with those produced using statistical models. Mortality and ingrowth error rates were higher than those observed for diameter and height, but also were within the bounds produced by traditional approaches for predicting these rates. Ingrowth species composition was very poorly predicted. The model was capable of reproducing a wide range of stand dynamic trajectories and in some cases reproduced trajectories that the statistical model was incapable of reproducing. Conclusions: The model has potential to be used as a benchmarking tool for evaluating statistical and process models and may provide a mechanism to separate signal from noise and improve our ability to analyze and learn from large regional datasets that often have underlying flaws in sample design.
基金The NSF (10661003) of China,SRF for ROCS,SEM ([2004]527)the NSF (0728092) of GuangxiInnovation Project of Guangxi Graduate Education ([2006]40)
文摘Suppose that there are two populations x and y with missing data on both of them, where x has a distribution function F(·) which is unknown and y has a distribution function Gθ(·) with a probability density function gθ(·) with known form depending on some unknown parameter θ. Fractional imputation is used to fill in missing data. The asymptotic distributions of the semi-empirical likelihood ration statistic are obtained under some mild conditions. Then, empirical likelihood confidence intervals on the differences of x and y are constructed.
文摘Individual tree detection (ITD) and the area-based approach (ABA) are combined to generate tree-lists using airborne LiDAR data. ITD based on the Canopy Height Model (CHM) was applied for overstory trees, while ABA based on nearest neighbor (NN) imputation was applied for understory trees. Our approach is intended to compensate for the weakness of LiDAR data and ITD in estimating understory trees, keeping the strength of ITD in estimating overstory trees in tree-level. We investigated the effects of three parameters on the performance of our proposed approach: smoothing of CHM, resolution of CHM, and height cutoff (a specific height that classifies trees into overstory and understory). There was no single combination of those parameters that produced the best performance for estimating stems per ha, mean tree height, basal area, diameter distribution and height distribution. The trees in the lowest LiDAR height class yielded the largest relative bias and relative root mean squared error. Although ITD and ABA showed limited explanatory powers to estimate stems per hectare and basal area, there could be improvements from methods such as using LiDAR data with higher density, applying better algorithms for ITD and decreasing distortion of the structure of LiDAR data. Automating the procedure of finding optimal combinations of those parameters is essential to expedite forest management decisions across forest landscapes using remote sensing data.
文摘The relative humidity in the atmosphere captured by AQUA satellite contains missing matrices. In order to fill such missing values four very popular imputation techniques: Bilinear, Inverse Distance Weighting, Natural Neighbor and Nearest Interpolations were tested. Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Coefficient of Determination (R2) and Correlation Coefficient (Corr), were used to check the accuracy of these interpolations. It was found that the Inverse Distance Weighting and Nearest Interpolation were proved not to be suited. Natural interpolation gave accurate results than the aforementioned two interpolations. Missing values of relative humidity were accurately refilled with Bilinear Interpolation. This interpolation produced RMSE of ±0.543 for relative humidity over 100, 150, 200, 250, 300, 400, 500 hPa while for 600, 700, 850 and 925 hPa RMSE remainnear to 1. A perfect fit to the surface and very strong correlation (value near to 0.99) was found between actual and imputed relative humidity data through Bilinear Interpolation. Therefore it was concluded that the Bilinear Interpolation is the most accurate and best imputation for missing values of relative humidity form 100 to 1000 hPa levels.
文摘In current study an attempt is carried out by filling missing data of geopotiential height over Pakistan and identifying the optimum method for interpolation. In last thirteen years geopotential height values over were missing over Pakistan. These gaps are tried to be filled by interpolation Techniques. The techniques for interpolations included Bilinear interpolations [BI], Nearest Neighbor [NN], Natural [NI] and Inverse distance weighting [IDW]. These imputations were judged on the basis of performance parameters which include Root Mean Square Error [RMSE], Mean Absolute Error [MAE], Correlation Coefficient [Corr] and Coefficient of Determination [R2]. The NN and IDW interpolation Imputations were not precise and accurate. The Natural Neighbors and Bilinear interpolations immaculately fitted to the data set. A good correlation was found for Natural Neighbor interpolation imputations and perfectly fit to the surface of geopotential height. The root mean square error [maximum and minimum] values were ranges from ±5.10 to ±2.28 m respectively. However mean absolute error was near to 1. The validation of imputation revealed that NN interpolation produced more accurate results than BI. It can be concluded that Natural Interpolation was the best suited interpolation technique for filling missing data sets from AQUA satellite for geopotential height.
文摘Non-responses leading to missing data are common in most studies and causes inefficient and biased statistical inferences if ignored. When faced with missing data, many studies choose to employ complete case analysis approach to estimate the parameters of the model. This however compromises on the susceptibility of the estimates to reduced bias and minimum variance as expected. Several classical and model based techniques of imputing the missing values have been mentioned in literature. Bayesian approach to missingness is deemed superior amongst the other techniques through its natural self-lending to missing data settings where the missing values are treated as unobserved random variables that have a distribution which depends on the observed data. This paper digs up the superiority of Bayesian imputation to Multiple Imputation with Chained Equations (MICE) when estimating logistic panel data models with single fixed effects. The study validates the superiority of conditional maximum likelihood estimates for nonlinear binary choice logit panel model in the presence of missing observations. A Monte Carlo simulation was designed to determine the magnitude of bias and root mean square errors (RMSE) arising from MICE and Full Bayesian imputation. The simulation results show that the conditional maximum likelihood (ML) logit estimator presented in this paper is less biased and more efficient when Bayesian imputation is performed to curb non-responses.