In wireless sensor networks(WSNs),the performance of related applications is highly dependent on the quality of data collected.Unfortunately,missing data is almost inevitable in the process of data acquisition and tra...In wireless sensor networks(WSNs),the performance of related applications is highly dependent on the quality of data collected.Unfortunately,missing data is almost inevitable in the process of data acquisition and transmission.Existing methods often rely on prior information such as low-rank characteristics or spatiotemporal correlation when recovering missing WSNs data.However,in realistic application scenarios,it is very difficult to obtain these prior information from incomplete data sets.Therefore,we aim to recover the missing WSNs data effectively while getting rid of the perplexity of prior information.By designing the corresponding measurement matrix that can capture the position of missing data and sparse representation matrix,a compressive sensing(CS)based missing data recovery model is established.Then,we design a comparison standard to select the best sparse representation basis and introduce average cross-correlation to examine the rationality of the established model.Furthermore,an improved fast matching pursuit algorithm is proposed to solve the model.Simulation results show that the proposed method can effectively recover the missing WSNs data.展开更多
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o...The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.展开更多
Missing data presents a significant challenge in statistical analysis and machine learning, often resulting in biased outcomes and diminished efficiency. This comprehensive review investigates various imputation techn...Missing data presents a significant challenge in statistical analysis and machine learning, often resulting in biased outcomes and diminished efficiency. This comprehensive review investigates various imputation techniques, categorizing them into three primary approaches: deterministic methods, probabilistic models, and machine learning algorithms. Traditional techniques, including mean or mode imputation, regression imputation, and last observation carried forward, are evaluated alongside more contemporary methods such as multiple imputation, expectation-maximization, and deep learning strategies. The strengths and limitations of each approach are outlined. Key considerations for selecting appropriate methods, based on data characteristics and research objectives, are discussed. The importance of evaluating imputation’s impact on subsequent analyses is emphasized. This synthesis of recent advancements and best practices provides researchers with a robust framework for effectively handling missing data, thereby improving the reliability of empirical findings across diverse disciplines.展开更多
Background:Missing data are frequently occurred in clinical studies.Due to the development of precision medicine,there is an increased interest in N-of-1 trial.Bayesian models are one of main statistical methods for a...Background:Missing data are frequently occurred in clinical studies.Due to the development of precision medicine,there is an increased interest in N-of-1 trial.Bayesian models are one of main statistical methods for analyzing the data of N-of-1 trials.This simulation study aimed to compare two statistical methods for handling missing values of quantitative data in Bayesian N-of-1 trials.Methods:The simulated data of N-of-1 trials with different coefficients of autocorrelation,effect sizes and missing ratios are obtained by SAS 9.1 system.The missing values are filled with mean filling and regression filling respectively in the condition of different coefficients of autocorrelation,effect sizes and missing ratios by SPSS 25.0 software.Bayesian models are built to estimate the posterior means by Winbugs 14 software.Results:When the missing ratio is relatively small,e.g.5%,missing values have relatively little effect on the results.Therapeutic effects may be underestimated when the coefficient of autocorrelation increases and no filling is used.However,it may be overestimated when mean or regression filling is used,and the results after mean filling are closer to the actual effect than regression filling.In the case of moderate missing ratio,the estimated effect after mean filling is closer to the actual effect compared to regression filling.When a large missing ratio(20%)occurs,data missing can lead to significantly underestimate the effect.In this case,the estimated effect after regression filling is closer to the actual effect compared to mean filling.Conclusion:Data missing can affect the estimated therapeutic effects using Bayesian models in N-of-1 trials.The present study suggests that mean filling can be used under situation of missing ratio≤10%.Otherwise,regression filling may be preferable.展开更多
In wireless sensor networks, the missing of sensor data is inevitable due to the inherent characteristic of wireless sensor networks, and it causes many difficulties in various applications. To solve the problem, the ...In wireless sensor networks, the missing of sensor data is inevitable due to the inherent characteristic of wireless sensor networks, and it causes many difficulties in various applications. To solve the problem, the missing data should be estimated as accurately as possible. In this paper, a k-nearest neighbor based missing data estimation algorithm is proposed based on the temporal and spatial correlation of sensor data. It adopts the linear regression model to describe the spatial correlation of sensor data among different sensor nodes, and utilizes the data information of multiple neighbor nodes to estimate the missing data jointly rather than independently, so that a stable and reliable estimation performance can be achieved. Experimental results on two real-world datasets show that the proposed algorithm can estimate the missing data accurately.展开更多
On the assumption that random interruptions in the observation process are modeled by a sequence of independent Bernoulli random variables, we firstly generalize two kinds of nonlinear filtering methods with random in...On the assumption that random interruptions in the observation process are modeled by a sequence of independent Bernoulli random variables, we firstly generalize two kinds of nonlinear filtering methods with random interruption failures in the observation based on the extended Kalman filtering (EKF) and the unscented Kalman filtering (UKF), which were shortened as GEKF and CUKF in this paper, respectively. Then the nonlinear filtering model is established by using the radial basis function neural network (RBFNN) prototypes and the network weights as state equation and the output of RBFNN to present the observation equation. Finally, we take the filtering problem under missing observed data as a special case of nonlinear filtering with random intermittent failures by setting each missing data to be zero without needing to pre-estimate the missing data, and use the GEKF-based RBFNN and the GUKF-based RBFNN to predict the ground radioactivity time series with missing data. Experimental results demonstrate that the prediction results of GUKF-based RBFNN accord well with the real ground radioactivity time series while the prediction results of GEKF-based RBFNN are divergent.展开更多
This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bil...This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method.展开更多
Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.I...Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.In this study,we evaluate and compare the effects of imputationmethods for estimating missing values in a time series.Our approach does not include a simulation to generate pseudo-missing data,but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom.In an experiment,therefore,several time series forecasting models are trained using different training datasets prepared using each imputation method.Subsequently,the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models.The results obtained from a total of four experimental cases show that the k-nearest neighbor technique is the most effective in reconstructing missing data and contributes positively to time series forecasting compared with other imputation methods.展开更多
The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in a...The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in an analysis that lack data for some characters (incomplete taxa) or to include characters that lack data for some species. Given the difficulty of obtaining data from all characters for all taxa (e.g., fossils), missing data might seriously impede efforts to reconstruct a comprehensive phylogeny that includes all species. Fortunately, recent simulations and empirical analyses suggest that missing data cells are not themselves problematic, and that in-complete taxa can be accurately placed as long as the overall number of characters in the analysis is large. How-ever, these studies have so far only been conducted on parsimony, likelihood, and neighbor-joining methods. Although Bayesian phylogenetic methods have become widely used in recent years, the effects of missing data on Bayesian analysis have not been adequately studied. Here, we conduct simulations to test whether Bayesian analyses can accurately place incomplete taxa despite extensive missing data. In agreement with previous studies of other methods, we find that Bayesian analyses can accurately reconstruct the position of highly incomplete taxa (i.e., 95% missing data), as long as the overall number of characters in the analysis is large. These results suggest that highly incomplete taxa can be safely included in many Bayesian phylogenetic analyses.展开更多
The quality of a multichannel audio signal may be reduced by missing data, which must be recovered before use. The data sets of multichannel audio can be quite large and have more than two axes of variation, such as c...The quality of a multichannel audio signal may be reduced by missing data, which must be recovered before use. The data sets of multichannel audio can be quite large and have more than two axes of variation, such as channel, frame, and feature. To recover missing audio data, we propose a low-rank tensor completion method that is a high-order generalization of matrix completion. First, a multichannel audio signal with missing data is modeled by a three-order tensor. Next, tensor completion is formulated as a convex optimization problem by defining the trace norm of the tensor, and then an augmented Lagrange multiplier method is used for solving the constrained optimization problem. Finally, the missing data is replaced by alternating iteration with a tensor computation. Experiments were conducted to evaluate the effectiveness on data of a 5.1-channel audio signal. The results show that the proposed method outperforms state-of-the-art methods. Moreover, subjective listening tests with MUSHRA(Multiple Stimuli with Hidden Reference and Anchor) indicate that better audio effects were obtained by tensor completion.展开更多
A novel interval quartering algorithm (IQA) is proposed to overcome insufficiency of the conventional singular spectrum analysis (SSA) iterative interpolation for selecting parameters including the number of the p...A novel interval quartering algorithm (IQA) is proposed to overcome insufficiency of the conventional singular spectrum analysis (SSA) iterative interpolation for selecting parameters including the number of the principal components and the embedding dimension. Based on the improved SSA iterative interpolation, interpolated test and comparative analysis are carried out to the outgoing longwave radiation daily data. The results show that IQA can find globally optimal parameters to the error curve with local oscillation, and has advantage of fast computing speed. The improved interpolation method is effective in the interpolation of missing data.展开更多
Missing data are always an issue in community-based longitudinal studies, calling into question the representativeness of samples and bias in conclusions, the research has generated. This may be due to the difficulty ...Missing data are always an issue in community-based longitudinal studies, calling into question the representativeness of samples and bias in conclusions, the research has generated. This may be due to the difficulty of implementing random sampling procedures in these studies and/or the inherent difficulty in sampling hard-to-reach segments of the population being studied. In fact, the ability to accurately study hard-to-reach populations in light of potential bias created by missing data remains an open question. In this study, missing data are defined as both failure to interview potential research participants identified in the sampling frame and failure to retain enrolled research participants longitudinally. Using the sample from the Mobile Youth Survey, a multiple-cohort, longitudinal study of adolescents living in highly impoverished neighborhoods in Mobile, Alabama, we examined sample representativeness and dropout to determine whether missing data led to a nonrepresentative, and therefore, biased sample. Results indicate that even though random procedures are not strictly used to draw the sample, (a) the sample appears to be largely representative of the population that was studied, and (b) attrition is largely uncorrelated with characteristics of those who dropped out. This suggests that it is possible to study with validity hard-to reach populations in community settings.展开更多
In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objec...In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objective is to identify the best method for correcting missing data in TB/HIV Co-infection setting. We employ both empirical data analysis and extensive simulation study to examine the effects of missing data, the accuracy, sensitivity, specificity and train and test error for different approaches. The novelty of this work hinges on the use of modern statistical learning algorithm when treating missingness. In the empirical analysis, both HIV data and TB-HIV co-infection data imputations were performed, and the missing values were imputed using different approaches. In the simulation study, sets of 0% (Complete case), 10%, 30%, 50% and 80% of the data were drawn randomly and replaced with missing values. Results show complete cases only had a co-infection rate (95% Confidence Interval band) of 29% (25%, 33%), weighted method 27% (23%, 31%), likelihood-based approach 26% (24%, 28%) and multiple imputation approach 21% (20%, 22%). In conclusion, MI remains the best approach for dealing with missing data and failure to apply it, results to overestimation of HIV/TB co-infection rate by 8%.展开更多
Due to ethical and logistical concerns it is common for data monitoring committees to periodically monitor accruing clinical trial data to assess the safety, and possibly efficacy, of a new experimental treatment. Whe...Due to ethical and logistical concerns it is common for data monitoring committees to periodically monitor accruing clinical trial data to assess the safety, and possibly efficacy, of a new experimental treatment. When formalized, monitoring is typically implemented using group sequential methods. In some cases regulatory agencies have required that primary trial analyses should be based solely on the judgment of an independent review committee (IRC). The IRC assessments can produce difficulties for trial monitoring given the time lag typically associated with receiving assessments from the IRC. This results in a missing data problem wherein a surrogate measure of response may provide useful information for interim decisions and future monitoring strategies. In this paper, we present statistical tools that are helpful for monitoring a group sequential clinical trial with missing IRC data. We illustrate the proposed methodology in the case of binary endpoints under various missingness mechanisms including missing completely at random assessments and when missingness depends on the IRC’s measurement.展开更多
The receiver operating characteristic (ROC) curve has been widely used in scientific research fields. After using the random hot deck imputation, we propose the smoothed empirical likelihood ratio statistic for the RO...The receiver operating characteristic (ROC) curve has been widely used in scientific research fields. After using the random hot deck imputation, we propose the smoothed empirical likelihood ratio statistic for the ROC curve with missing data. Its asymptotic distribution is a scaled chi-square distribution and empirical likelihood confidence intervals for ROC curves are constructed. The simulation study shows that the proposed interval estimates perform well based on the coverage probability for different sample sizes and response rates.展开更多
In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, lead...In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, leading to the conclusion that a small m (≤5) would be sufficient for MI. However, evidence has been accumulating that many more imputations are needed. Why would the apparently sufficient m deduced from the RE be actually too small? The answer may lie with γ. In this research, γ was determined at the fractions of missing data (δ) of 4%, 10%, 20%, and 29% using the 2012 Physician Workflow Mail Survey of the National Ambulatory Medical Care Survey (NAMCS). The γ values were strikingly small, ranging in the order of 10?6 to 0.01. As δ increased, γ usually increased but sometimes decreased. How the data were analysed had the dominating effects on γ, overshadowing the effect of δ. The results suggest that it is impossible to predict γ using δ and that it may not be appropriate to use the γ-based RE to determine sufficient m.展开更多
This paper simultaneously investigates variable selection and imputation estimation of semiparametric partially linear varying-coefficient model in that case where there exist missing responses for cluster data. As is...This paper simultaneously investigates variable selection and imputation estimation of semiparametric partially linear varying-coefficient model in that case where there exist missing responses for cluster data. As is well known, commonly used approach to deal with missing data is complete-case data. Combined the idea of complete-case data with a discussion of shrinkage estimation is made on different cluster. In order to avoid the biased results as well as improve the estimation efficiency, this article introduces Group Least Absolute Shrinkage and Selection Operator (Group Lasso) to semiparametric model. That is to say, the method combines the approach of local polynomial smoothing and the Least Absolute Shrinkage and Selection Operator. In that case, it can conduct nonparametric estimation and variable selection in a computationally efficient manner. According to the same criterion, the parametric estimators are also obtained. Additionally, for each cluster, the nonparametric and parametric estimators are derived, and then compute the weighted average per cluster as finally estimators. Moreover, the large sample properties of estimators are also derived respectively.展开更多
The analysis of spatially correlated binary data observed on lattices is an interesting topic that catches the attention of many scholars of different scientific fields like epidemiology, medicine, agriculture, biolog...The analysis of spatially correlated binary data observed on lattices is an interesting topic that catches the attention of many scholars of different scientific fields like epidemiology, medicine, agriculture, biology, geology and geography. To overcome the encountered difficulties upon fitting the autologistic regression model to analyze such data via Bayesian and/or Markov chain Monte Carlo (MCMC) techniques, the Gaussian latent variable model has been enrolled in the methodology. Assuming a normal distribution for the latent random variable may not be realistic and wrong, normal assumptions might cause bias in parameter estimates and affect the accuracy of results and inferences. Thus, it entails more flexible prior distributions for the latent variable in the spatial models. A review of the recent literature in spatial statistics shows that there is an increasing tendency in presenting models that are involving skew distributions, especially skew-normal ones. In this study, a skew-normal latent variable modeling was developed in Bayesian analysis of the spatially correlated binary data that were acquired on uncorrelated lattices. The proposed methodology was applied in inspecting spatial dependency and related factors of tooth caries occurrences in a sample of students of Yasuj University of Medical Sciences, Yasuj, Iran. The results indicated that the skew-normal latent variable model had validity and it made a decent criterion that fitted caries data.展开更多
In this paper, we focus on a type of inverse problem in which the data are expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In ...In this paper, we focus on a type of inverse problem in which the data are expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In particular, we deal with situations in which training data are not available. Then we cannot model the unknown functional relationship between data and the unknown model function (or parameter vector) with a Gaussian Process of appropriate dimensionality. A Bayesian method based on state space modelling is advanced instead. Within this framework, the likelihood is expressed in terms of the probability density function (pdf) of the state space variable and the sought model parameter vector is embedded within the domain of this pdf. As the measurable vector lives only inside an identified sub-volume of the system state space, the pdf of the state space variable is projected onto the space of the measurables, and it is in terms of the projected state space density that the likelihood is written;the final form of the likelihood is achieved after convolution with the distribution of measurement errors. Application motivated vague priors are invoked and the posterior probability density of the model parameter vectors, given the data are computed. Inference is performed by taking posterior samples with adaptive MCMC. The method is illustrated on synthetic as well as real galactic data.展开更多
The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper</span></span><span><span><span style="fo...The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper</span></span><span><span><span style="font-family:""> </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">aim</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">ed</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> at the missing data widespread problem faced by analysts and statisticians in academia and professional environments. Some data-driven methods were studied to obtain accurate data. Projects that highly rely on data face this missing data problem. And since machine learning models are only as good as the data used to train them, the missing data problem has a real impact on the solutions developed for real-world problems. Therefore, in this dissertation, there is an attempt to solve this problem using different mechanisms. This is done by testing the effectiveness of both traditional and modern data imputation techniques by determining the loss of statistical power when these different approaches are used to tackle the missing data problem. At the end of this research dissertation, it should be easy to establish which methods are the best when handling the research problem. It is recommended that using Multivariate Imputation by Chained Equations (MICE) for MAR missingness is the best approach </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">to</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> dealing with missing data.展开更多
基金supported by the National Natural Science Foundation of China(No.61871400)the Natural Science Foundation of the Jiangsu Province of China(No.BK20171401)。
文摘In wireless sensor networks(WSNs),the performance of related applications is highly dependent on the quality of data collected.Unfortunately,missing data is almost inevitable in the process of data acquisition and transmission.Existing methods often rely on prior information such as low-rank characteristics or spatiotemporal correlation when recovering missing WSNs data.However,in realistic application scenarios,it is very difficult to obtain these prior information from incomplete data sets.Therefore,we aim to recover the missing WSNs data effectively while getting rid of the perplexity of prior information.By designing the corresponding measurement matrix that can capture the position of missing data and sparse representation matrix,a compressive sensing(CS)based missing data recovery model is established.Then,we design a comparison standard to select the best sparse representation basis and introduce average cross-correlation to examine the rationality of the established model.Furthermore,an improved fast matching pursuit algorithm is proposed to solve the model.Simulation results show that the proposed method can effectively recover the missing WSNs data.
文摘The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method.
文摘Missing data presents a significant challenge in statistical analysis and machine learning, often resulting in biased outcomes and diminished efficiency. This comprehensive review investigates various imputation techniques, categorizing them into three primary approaches: deterministic methods, probabilistic models, and machine learning algorithms. Traditional techniques, including mean or mode imputation, regression imputation, and last observation carried forward, are evaluated alongside more contemporary methods such as multiple imputation, expectation-maximization, and deep learning strategies. The strengths and limitations of each approach are outlined. Key considerations for selecting appropriate methods, based on data characteristics and research objectives, are discussed. The importance of evaluating imputation’s impact on subsequent analyses is emphasized. This synthesis of recent advancements and best practices provides researchers with a robust framework for effectively handling missing data, thereby improving the reliability of empirical findings across diverse disciplines.
基金supported by the National Natural Science Foundation of China (No.81973705).
文摘Background:Missing data are frequently occurred in clinical studies.Due to the development of precision medicine,there is an increased interest in N-of-1 trial.Bayesian models are one of main statistical methods for analyzing the data of N-of-1 trials.This simulation study aimed to compare two statistical methods for handling missing values of quantitative data in Bayesian N-of-1 trials.Methods:The simulated data of N-of-1 trials with different coefficients of autocorrelation,effect sizes and missing ratios are obtained by SAS 9.1 system.The missing values are filled with mean filling and regression filling respectively in the condition of different coefficients of autocorrelation,effect sizes and missing ratios by SPSS 25.0 software.Bayesian models are built to estimate the posterior means by Winbugs 14 software.Results:When the missing ratio is relatively small,e.g.5%,missing values have relatively little effect on the results.Therapeutic effects may be underestimated when the coefficient of autocorrelation increases and no filling is used.However,it may be overestimated when mean or regression filling is used,and the results after mean filling are closer to the actual effect than regression filling.In the case of moderate missing ratio,the estimated effect after mean filling is closer to the actual effect compared to regression filling.When a large missing ratio(20%)occurs,data missing can lead to significantly underestimate the effect.In this case,the estimated effect after regression filling is closer to the actual effect compared to mean filling.Conclusion:Data missing can affect the estimated therapeutic effects using Bayesian models in N-of-1 trials.The present study suggests that mean filling can be used under situation of missing ratio≤10%.Otherwise,regression filling may be preferable.
文摘In wireless sensor networks, the missing of sensor data is inevitable due to the inherent characteristic of wireless sensor networks, and it causes many difficulties in various applications. To solve the problem, the missing data should be estimated as accurately as possible. In this paper, a k-nearest neighbor based missing data estimation algorithm is proposed based on the temporal and spatial correlation of sensor data. It adopts the linear regression model to describe the spatial correlation of sensor data among different sensor nodes, and utilizes the data information of multiple neighbor nodes to estimate the missing data jointly rather than independently, so that a stable and reliable estimation performance can be achieved. Experimental results on two real-world datasets show that the proposed algorithm can estimate the missing data accurately.
基金Project supported by the State Key Program of the National Natural Science of China (Grant No. 60835004)the Natural Science Foundation of Jiangsu Province of China (Grant No. BK2009727)+1 种基金the Natural Science Foundation of Higher Education Institutions of Jiangsu Province of China (Grant No. 10KJB510004)the National Natural Science Foundation of China (Grant No. 61075028)
文摘On the assumption that random interruptions in the observation process are modeled by a sequence of independent Bernoulli random variables, we firstly generalize two kinds of nonlinear filtering methods with random interruption failures in the observation based on the extended Kalman filtering (EKF) and the unscented Kalman filtering (UKF), which were shortened as GEKF and CUKF in this paper, respectively. Then the nonlinear filtering model is established by using the radial basis function neural network (RBFNN) prototypes and the network weights as state equation and the output of RBFNN to present the observation equation. Finally, we take the filtering problem under missing observed data as a special case of nonlinear filtering with random intermittent failures by setting each missing data to be zero without needing to pre-estimate the missing data, and use the GEKF-based RBFNN and the GUKF-based RBFNN to predict the ground radioactivity time series with missing data. Experimental results demonstrate that the prediction results of GUKF-based RBFNN accord well with the real ground radioactivity time series while the prediction results of GEKF-based RBFNN are divergent.
文摘This research was an effort to select best imputation method for missing upper air temperature data over 24 standard pressure levels. We have implemented four imputation techniques like inverse distance weighting, Bilinear, Natural and Nearest interpolation for missing data imputations. Performance indicators for these techniques were the root mean square error (RMSE), absolute mean error (AME), correlation coefficient and coefficient of determination ( R<sup>2</sup> ) adopted in this research. We randomly make 30% of total samples (total samples was 324) predictable from 70% remaining data. Although four interpolation methods seem good (producing <1 RMSE, AME) for imputations of air temperature data, but bilinear method was the most accurate with least errors for missing data imputations. RMSE for bilinear method remains <0.01 on all pressure levels except 1000 hPa where this value was 0.6. The low value of AME (<0.1) came at all pressure levels through bilinear imputations. Very strong correlation (>0.99) found between actual and predicted air temperature data through this method. The high value of the coefficient of determination (0.99) through bilinear interpolation method, tells us best fit to the surface. We have also found similar results for imputation with natural interpolation method in this research, but after investigating scatter plots over each month, imputations with this method seem to little obtuse in certain months than bilinear method.
基金This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education(Grant Number 2020R1A6A1A03040583).
文摘Time series forecasting has become an important aspect of data analysis and has many real-world applications.However,undesirable missing values are often encountered,which may adversely affect many forecasting tasks.In this study,we evaluate and compare the effects of imputationmethods for estimating missing values in a time series.Our approach does not include a simulation to generate pseudo-missing data,but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom.In an experiment,therefore,several time series forecasting models are trained using different training datasets prepared using each imputation method.Subsequently,the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models.The results obtained from a total of four experimental cases show that the k-nearest neighbor technique is the most effective in reconstructing missing data and contributes positively to time series forecasting compared with other imputation methods.
文摘The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in an analysis that lack data for some characters (incomplete taxa) or to include characters that lack data for some species. Given the difficulty of obtaining data from all characters for all taxa (e.g., fossils), missing data might seriously impede efforts to reconstruct a comprehensive phylogeny that includes all species. Fortunately, recent simulations and empirical analyses suggest that missing data cells are not themselves problematic, and that in-complete taxa can be accurately placed as long as the overall number of characters in the analysis is large. How-ever, these studies have so far only been conducted on parsimony, likelihood, and neighbor-joining methods. Although Bayesian phylogenetic methods have become widely used in recent years, the effects of missing data on Bayesian analysis have not been adequately studied. Here, we conduct simulations to test whether Bayesian analyses can accurately place incomplete taxa despite extensive missing data. In agreement with previous studies of other methods, we find that Bayesian analyses can accurately reconstruct the position of highly incomplete taxa (i.e., 95% missing data), as long as the overall number of characters in the analysis is large. These results suggest that highly incomplete taxa can be safely included in many Bayesian phylogenetic analyses.
基金partially supported by the National Natural Science Foundation of China under Grants No. 61571044, No.61620106002, No.61473041, No.11590772, No.61640012Inner Mongolia Natural Science Foundation under Grants No. 2017MS(LH)0602
文摘The quality of a multichannel audio signal may be reduced by missing data, which must be recovered before use. The data sets of multichannel audio can be quite large and have more than two axes of variation, such as channel, frame, and feature. To recover missing audio data, we propose a low-rank tensor completion method that is a high-order generalization of matrix completion. First, a multichannel audio signal with missing data is modeled by a three-order tensor. Next, tensor completion is formulated as a convex optimization problem by defining the trace norm of the tensor, and then an augmented Lagrange multiplier method is used for solving the constrained optimization problem. Finally, the missing data is replaced by alternating iteration with a tensor computation. Experiments were conducted to evaluate the effectiveness on data of a 5.1-channel audio signal. The results show that the proposed method outperforms state-of-the-art methods. Moreover, subjective listening tests with MUSHRA(Multiple Stimuli with Hidden Reference and Anchor) indicate that better audio effects were obtained by tensor completion.
基金the State Key Program for Basic Research of China(No.2007CB816003)the Open Item of the State Key Laboratory of Numerical Modeling for Atmospheric Sciences and Geophysical Fluid Dynamics of China
文摘A novel interval quartering algorithm (IQA) is proposed to overcome insufficiency of the conventional singular spectrum analysis (SSA) iterative interpolation for selecting parameters including the number of the principal components and the embedding dimension. Based on the improved SSA iterative interpolation, interpolated test and comparative analysis are carried out to the outgoing longwave radiation daily data. The results show that IQA can find globally optimal parameters to the error curve with local oscillation, and has advantage of fast computing speed. The improved interpolation method is effective in the interpolation of missing data.
文摘Missing data are always an issue in community-based longitudinal studies, calling into question the representativeness of samples and bias in conclusions, the research has generated. This may be due to the difficulty of implementing random sampling procedures in these studies and/or the inherent difficulty in sampling hard-to-reach segments of the population being studied. In fact, the ability to accurately study hard-to-reach populations in light of potential bias created by missing data remains an open question. In this study, missing data are defined as both failure to interview potential research participants identified in the sampling frame and failure to retain enrolled research participants longitudinally. Using the sample from the Mobile Youth Survey, a multiple-cohort, longitudinal study of adolescents living in highly impoverished neighborhoods in Mobile, Alabama, we examined sample representativeness and dropout to determine whether missing data led to a nonrepresentative, and therefore, biased sample. Results indicate that even though random procedures are not strictly used to draw the sample, (a) the sample appears to be largely representative of the population that was studied, and (b) attrition is largely uncorrelated with characteristics of those who dropped out. This suggests that it is possible to study with validity hard-to reach populations in community settings.
文摘In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objective is to identify the best method for correcting missing data in TB/HIV Co-infection setting. We employ both empirical data analysis and extensive simulation study to examine the effects of missing data, the accuracy, sensitivity, specificity and train and test error for different approaches. The novelty of this work hinges on the use of modern statistical learning algorithm when treating missingness. In the empirical analysis, both HIV data and TB-HIV co-infection data imputations were performed, and the missing values were imputed using different approaches. In the simulation study, sets of 0% (Complete case), 10%, 30%, 50% and 80% of the data were drawn randomly and replaced with missing values. Results show complete cases only had a co-infection rate (95% Confidence Interval band) of 29% (25%, 33%), weighted method 27% (23%, 31%), likelihood-based approach 26% (24%, 28%) and multiple imputation approach 21% (20%, 22%). In conclusion, MI remains the best approach for dealing with missing data and failure to apply it, results to overestimation of HIV/TB co-infection rate by 8%.
文摘Due to ethical and logistical concerns it is common for data monitoring committees to periodically monitor accruing clinical trial data to assess the safety, and possibly efficacy, of a new experimental treatment. When formalized, monitoring is typically implemented using group sequential methods. In some cases regulatory agencies have required that primary trial analyses should be based solely on the judgment of an independent review committee (IRC). The IRC assessments can produce difficulties for trial monitoring given the time lag typically associated with receiving assessments from the IRC. This results in a missing data problem wherein a surrogate measure of response may provide useful information for interim decisions and future monitoring strategies. In this paper, we present statistical tools that are helpful for monitoring a group sequential clinical trial with missing IRC data. We illustrate the proposed methodology in the case of binary endpoints under various missingness mechanisms including missing completely at random assessments and when missingness depends on the IRC’s measurement.
文摘The receiver operating characteristic (ROC) curve has been widely used in scientific research fields. After using the random hot deck imputation, we propose the smoothed empirical likelihood ratio statistic for the ROC curve with missing data. Its asymptotic distribution is a scaled chi-square distribution and empirical likelihood confidence intervals for ROC curves are constructed. The simulation study shows that the proposed interval estimates perform well based on the coverage probability for different sample sizes and response rates.
文摘In his 1987 classic book on multiple imputation (MI), Rubin used the fraction of missing information, γ, to define the relative efficiency (RE) of MI as RE = (1 + γ/m)?1/2, where m is the number of imputations, leading to the conclusion that a small m (≤5) would be sufficient for MI. However, evidence has been accumulating that many more imputations are needed. Why would the apparently sufficient m deduced from the RE be actually too small? The answer may lie with γ. In this research, γ was determined at the fractions of missing data (δ) of 4%, 10%, 20%, and 29% using the 2012 Physician Workflow Mail Survey of the National Ambulatory Medical Care Survey (NAMCS). The γ values were strikingly small, ranging in the order of 10?6 to 0.01. As δ increased, γ usually increased but sometimes decreased. How the data were analysed had the dominating effects on γ, overshadowing the effect of δ. The results suggest that it is impossible to predict γ using δ and that it may not be appropriate to use the γ-based RE to determine sufficient m.
文摘This paper simultaneously investigates variable selection and imputation estimation of semiparametric partially linear varying-coefficient model in that case where there exist missing responses for cluster data. As is well known, commonly used approach to deal with missing data is complete-case data. Combined the idea of complete-case data with a discussion of shrinkage estimation is made on different cluster. In order to avoid the biased results as well as improve the estimation efficiency, this article introduces Group Least Absolute Shrinkage and Selection Operator (Group Lasso) to semiparametric model. That is to say, the method combines the approach of local polynomial smoothing and the Least Absolute Shrinkage and Selection Operator. In that case, it can conduct nonparametric estimation and variable selection in a computationally efficient manner. According to the same criterion, the parametric estimators are also obtained. Additionally, for each cluster, the nonparametric and parametric estimators are derived, and then compute the weighted average per cluster as finally estimators. Moreover, the large sample properties of estimators are also derived respectively.
文摘The analysis of spatially correlated binary data observed on lattices is an interesting topic that catches the attention of many scholars of different scientific fields like epidemiology, medicine, agriculture, biology, geology and geography. To overcome the encountered difficulties upon fitting the autologistic regression model to analyze such data via Bayesian and/or Markov chain Monte Carlo (MCMC) techniques, the Gaussian latent variable model has been enrolled in the methodology. Assuming a normal distribution for the latent random variable may not be realistic and wrong, normal assumptions might cause bias in parameter estimates and affect the accuracy of results and inferences. Thus, it entails more flexible prior distributions for the latent variable in the spatial models. A review of the recent literature in spatial statistics shows that there is an increasing tendency in presenting models that are involving skew distributions, especially skew-normal ones. In this study, a skew-normal latent variable modeling was developed in Bayesian analysis of the spatially correlated binary data that were acquired on uncorrelated lattices. The proposed methodology was applied in inspecting spatial dependency and related factors of tooth caries occurrences in a sample of students of Yasuj University of Medical Sciences, Yasuj, Iran. The results indicated that the skew-normal latent variable model had validity and it made a decent criterion that fitted caries data.
文摘In this paper, we focus on a type of inverse problem in which the data are expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In particular, we deal with situations in which training data are not available. Then we cannot model the unknown functional relationship between data and the unknown model function (or parameter vector) with a Gaussian Process of appropriate dimensionality. A Bayesian method based on state space modelling is advanced instead. Within this framework, the likelihood is expressed in terms of the probability density function (pdf) of the state space variable and the sought model parameter vector is embedded within the domain of this pdf. As the measurable vector lives only inside an identified sub-volume of the system state space, the pdf of the state space variable is projected onto the space of the measurables, and it is in terms of the projected state space density that the likelihood is written;the final form of the likelihood is achieved after convolution with the distribution of measurement errors. Application motivated vague priors are invoked and the posterior probability density of the model parameter vectors, given the data are computed. Inference is performed by taking posterior samples with adaptive MCMC. The method is illustrated on synthetic as well as real galactic data.
文摘The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper</span></span><span><span><span style="font-family:""> </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">aim</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">ed</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> at the missing data widespread problem faced by analysts and statisticians in academia and professional environments. Some data-driven methods were studied to obtain accurate data. Projects that highly rely on data face this missing data problem. And since machine learning models are only as good as the data used to train them, the missing data problem has a real impact on the solutions developed for real-world problems. Therefore, in this dissertation, there is an attempt to solve this problem using different mechanisms. This is done by testing the effectiveness of both traditional and modern data imputation techniques by determining the loss of statistical power when these different approaches are used to tackle the missing data problem. At the end of this research dissertation, it should be easy to establish which methods are the best when handling the research problem. It is recommended that using Multivariate Imputation by Chained Equations (MICE) for MAR missingness is the best approach </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">to</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> dealing with missing data.